EVOLUTION OF MACHINE LEARNING

Machine learning (ML) is the process of learning a model that can be used in prediction based on data. Prediction involves assigning a data item to one of the classes or associating the data item with a number. The former activity is classification while the latter is regression. ML gained popularity because of the improved processing speed and storage space of computers and the availability of large data sets for experimentation. Deep learning (DL) is branch of ML. Perceptron forms the basic building block of various DL architectures, including multi-layer perceptron networks, convolutional neural networks (CNNs) and recurrent neural networks (RNNs).

In the early days, artificial intelligence (AI) was believed that mathematical logic was the ideal vehicle for building Al systems. Some of the initial contributions in this are like the General Problem Solver (GPS), Automatic Theorem Proving (ATP), rule-based systems and programming languages like Prolog and Lisp (lambda calculus-based) were all outcomes of this view.

However, the advent of efficient graphics processing units (GPUs), platforms like Tensor Flow and PyTorch along with the demonstrated success stories of convolutional neural networks, gated recurrent units and generative models based on neural networks have impacted every aspect of science and engineering activities across the globe. So, ML along with DL has become a state-of-the-art subject. Artificial neural networks form the backbone of DL.

A high-level view of AI is shown in below Figure. The tasks related to conventional AI are listed separately. Here, ML may be viewed as dealing with more than just pattern recognition (PR) tasks. Classification and clustering are the typical tasks of a PR system. However, ML deals with regression problems also. Data mining is the efficient organization of data in the form of a database

Figure: Background topics of Al

Note that data structures and algorithms are basic to both conventional and current Al systems. Logic and discrete structures played an important role in the analysis and synthesis of conventional Al systems.

In ML, we deal with vectors and vector spaces and these topics are better appreciated through linear algebra. The data input to an ML system may be viewed as a matrix, popularly called the data matrix. If there are a data items, each represented as an l-dimensional vector, then the

corresponding data matrix A is of size n x l. Linear algebra is useful in analyzing the weights associated with the edges in a neural network. Matrix multiplication and eigen analysis are important in initializing the weights of the neural network and in weight updates. It can also help in weight normalization. The whole activity of clustering may be viewed as data matrix factorization.

The probability and statistics help in estimating the distributions underlying the data. Further, they play a crucial role in analysis and inference in ML.

Optimization (along with calculus) is essential in training neural networks where gradients and their computations are important. Gradient descent-based optimization is an essential Ingredient of any DL system.

Information theoretic concepts like entropy, mutual information and Kullback-Leibler divergence are essential to understand topics such as decision tree classifiers, feature selection and deep neural networks.

PARADIGMS FOR ML

There are different ways or paradigms for ML, such as learning by rote, learning by deduction, learning by abduction, learning by induction and reinforcement learning.

Learning by Rote
Learning by Deduction
Learning by Abduction
Learning by Induction
Reinforcement Learning

1 Learning by Rote

This involves memorization in an effective manner. It is a form of learning that is popular in elementary schools where the alphabet and numbers are memorized. Memorizing simple addition and multiplication tables are also examples of rote learning. In the case of data caching, we store computed values so that we do not have to recompute them later. Caching is implemented by search engines and it may be viewed as another popular scheme of rote learning. When computation is more expensive than recall, this strategy can save a significant amount of time. Chess masters spend a lot of time memorizing the great games of the past. It is this rote learning that teaches them how to 'think' in chess. Various board positions and their potential to reach the winning configuration are exploited in games like chess and checkers.

2 Learning by Deduction

Deductive learning deals with the exploitation of deductions made earlier. This type of learning is based on reasoning that is truth preserving. Given A, and if A then B (AB), we can deduce B. We can use B along with if B then C (BC) to deduce C. Note that whenever A and A→B are True, then B is True, ratifying the truth preserving nature of learning by deduction. Consider the following statements:

1. It is raining.

2. If it rains, the roads get wet.

3. If a road is wet, it is slippery.

From (1) and (2), we can infer using deduction that (4) the roads are wet. This deduction can then be used with (3) to deduce or learn that (5) the roads are slippery. Here, if statements (1), (2) and (3) are True, then statements (4) and (5) are automatically True.

A digital computer is primarily a deductive engine and is ideally suited for this form of learning. Deductive learning is applied in well-defined domains like game playing, including in chess

3 Learning by Abduction

Here, we infer A from B and (AB). Notice that this is not truth preserving like in deduction as both B and (AB) can be True and A can be False. Consider the following inference:

1. An aeroplane is a flying object (aeroplane flying object).

2. A is a flying object.

From (1) and (2), we infer using abduction that A is an aeroplane. This kind of reasoning may lead to incorrect conclusions. For example, A could be a bird or a kite.

4 Learning by Induction

This is the most popular and effective form of ML. Here, learning is achieved with the help of examples or observations. It may be categorized as follows:

 Learning from Examples: Here, it is assumed that a collection of labelled examples are provided and the ML system uses these examples to make a prediction on a new data pattern. We deal with two ML problems: classification and regression.

1. Classification: Consider the handwritten digits shown in below figure. Here, each row has 15 examples of each of the digits. The problem is to learn an ML model using such data to classify a new data pattern. This is also called supervised learning as the model is learnt with the help of such exemplar data. In the case of handwritten digits, we have 10 class labels, one class label corresponding to each of the digits from 0 to 9. In classification, we would like to assign an appropriate class label from these labels to a new pattern.

Figure : Examples of hand written digits labelled 0 to 9

2. Regression: Contrary to classification, there are several prediction applications where the labels come from a possibly infinite set. For example, the share value of a stock could be a positive real number. The stock may have different values at a particular time and each of these values is a real number. The need is to predict the share value of a stock at a future time instance based on past data in the form of examples. This is a typical regression or curve fitting problem.

 Learning from Observations: Observations are also instances like examples but they are different because observations need not be labelled. In this case, we cluster or group the observations into a smaller number of groups. Such grouping is performed with the help of a clustering algorithm that assigns similar patterns to the same group/cluster. Each cluster could be represented by its centroid or mean. Let x1, x2, x3, …, xp be p elements of a cluster. Then centroid of the cluster is defined by

5. Reinfocement Learning:

Reinforcement Learning (RL) is a branch of machine learning focused on making decisions to maximize cumulative rewards in a given situation. Unlike supervised learning, which relies on a training dataset with predefined answers, RL involves learning through experience. In RL, an agent learns to achieve a goal in an uncertain, potentially complex environment by performing actions and receiving feedback through rewards or penalties.

TYPES OF DATA:

Data in machine learning are broadly categorized into two types − numerical (quantitative) and categorical (qualitative) data. The numerical data can be measured, counted or given a numerical value, for example, age, height, income, etc. The categorical data is non-numeric data that can be arranged in categories with or without meaningful order, for example, gender, blood group, etc.

Further, the numerical data can be categorized into discrete and continuous data. The categorical data can also be categorized into two types − nominal and ordinal. Let's understand these types of data in machine learning in detail.

Types of Data in Machine Learning

Numerical (Quantitative) Data:

The numerical (quantitative) data is data that can be measured, counted or given a numerical value. The examples of numerical data are age, height, income, number of students in class, number of books in a shelf, shoe size, etc.

The numerical data can be categorized into the following two types −

 Discrete Data

 Continuous Data

Discrete Data

The discrete data is numerical data that is countable, finite, and can only take certain values, usually whole numbers. Examples of discrete data are number of students in class, number of books in a shelf, number of ducks in a pond, etc.

Continuous Data

The continuous data is numerical data that can take any value within a specified range including fractions and decimals. Examples of continuous data are age, height, weight, income, time, temperature, etc.

Categorical (Qualitative) Data

The categorical (qualitative) data can be categorized with or without a meaningful order. For example, gender, blood group, hair color, nationality, the school grades, level of education, range of income, ratings, etc.

The categorical data can be divided into the following two types −

Nominal Data
Ordinal Data

Nominal Data

The nominal data is categorical data that can’t be arranged in an order or rank. The examples of nominal data are gender, blood group, hair color, nationality, etc.

Ordinal Data

The ordinal data is categorical data can be ordered or ranked with a specific attribute. The examples of ordinal data are the school grades, level of education, range of income, ratings, etc.

MATCHING

Matching is an important activity in both supervised learning and in learning from observations. Matching is carried out by using a proximity measure which can be a distance/dissimilarity measure or a similarity measure. Two data items, u and v. represented as l-dimensional vectors, match better when the distance between them is smaller or when the similarity between them is larger. A popular distance measure is the Euclidean distance and a popular similarity measure is the cosine of the angle between vectors. The Euclidean distance is given by

d(u, v) =

The cosine similarity is given by

cos(u, v) = where u^tu is the dot product between vectors u and u and ||u|| is the Euclidean distance between u and the origin; it is also called the Euclidean norm. Some of the important applications of matching in ML are in:

 Finding the Nearest Neighbor of a Pattern: Let x be an l-dimensional pattern vector. Let X ={x1, x2, x3, …, xn} be a collection of n data vectors. The nearest neighbor of x from X, denoted NN_x(X) , is x_j if

d(x, x_j}) <= d(x, x_i), V x_i ε X

This is an approximate where a pattern that best matches is obtained. If there is a tie, that is, when both x_p c X and x_q ε X are the nearest neighbors of x, we can break the tie arbitrarily or choose either of the be the nearest neighbor of x.

 Assigning to a Set with the Nearest Representative: Let C1, C2, C3, …, Ck be k sets with x¹, x², x³, …, x^k as their respective representatives. A pattern x is assigned to C_i if

d(x, xⁱ) <= d(x, x^j) for j ε { 1, 2 ,...,K}

This idea is useful in clustering or learning from observations, where C_i, is the i^th group or cluster of patterns or observations and xⁱ is the representative of Ci. The centroid of the data vectors in Ci, is a popularly used representative, xⁱ, of Ci.

STAGES IN MACHINE LEARNING

Building a machine learning system involves a number of steps, as illustrated in below figure.

Figure: Important steps in a practical machine learning system

The available data is split into training, validation and test data. Training data is used in model learning or training and validation data is used to tune the ML model. Test data is used to examine how the learnt model is performing. We now describe the components of an ML system.

1 Data Acquisition

This depends on the domain of the application. For example, to distinguish between adults and children, measurements of their height or weight are adequate; however, to distinguish between normal and COVID-19-infected humans, their body temperature and chest congestion may be more important than their height or weight. Typically, data collection is carried out before feature engineering.

2 Feature Engineering

This step involves a combination of data preprocessing and data representation.

Data Preprocessing

In several practical applications, the raw data available needs to be updated before it can be used by an ML model. The common problems encountered with raw data are

a) Missing values

b) Data from different Domains

c) Outliers in the data

a). Missing Data: It is likely that in some domains, there could be missing data. This occurs as a consequence of the inability to measure a feature value or due to unavailability or erroneous data entry. Some ML algorithms can work even when there are a reasonable number of missing data values and, in such cases, there is no need for preprocessing. However, there are a large number of other cases where the ML models cannot handle missing values. So, there is a need to examine techniques for dealing with missing data. Different schemes are used for dealing with the prediction of missing values:

Use the nearest neighbor: Let x be an l-dimensional data vector that has its i^th component x(i) missing. Let X={ x¹, x², x³, …, xⁿ } be the set of a training pattern vectors. Let x^j ε X be the nearest neighbour of x based on the remaining l - 1 (excluding the i^th ) components. Predict the value of x(i) to be x^j(i) that is, if the i^th component x(i) of is missing, use the i^th component of x^j = NN_x(X) instead.

Use a larger neighborhood: Use the k-nearest neighbors (KNNs) of x to predict the missing value x_i, Let the KNNs of x, using the remaining l - 1 components, from X be , x1, x2,...,xK. Now the predicted value of x(i) is the average of the i^th components of these KNNs. That is, the predicted value of x(i) is

EXAMPLE: Consider the set data vectors (1, 1, 1), (1, 1, 2), (1, 1, 3) ,(1,-,2),(1,1,-),(6,6,1).

There are 6 three-dimensional pattern vectors in the set. Missing values are indicated by -. Let us see how to predict the missing value in ( 1 ,-,2). Let us use K = 3 and find the 3 nearest neighbors (NNs) based on the remaining feature values. The three NNs are (1,1,1), (1,1,2) and (1,1,3). Note that the second feature value of all these three neighbors is 1, which is the predicted value for the missing value in (1,-,2) So, the vector becomes (1, 1, 2).

Cluster the data and locate the nearest cluster: This approach is based on clustering the training data and locating the cluster to which x belongs based on the remaining l - 1 components. Let x with its i^th value missing belong to cluster C^q. Let u^q the centroid of C^q. Then the predicted value of x(i) is , the i^th component of u^q.

EXAMPLE: Consider the following data matrix. It has 5 data vectors in a four- dimensional space.

5.1 3.5 1.4 0.2

4.9 3.0 1.4 0.2

4.7 3.2 1.3 0.2

4.6 3.1 1.5 0.2

5.0 3.6 1.4 0.2

Let us imagine that two values are missing to obtain.

NA3.5 1.4 0.2

4.9 3.0 1.4 0.2

4.7 3.2 1.3 0.2

4.6 NA1.5 0.2

5.0 3.6 1.4 0.2

If we assume that this data is from the same cluster, we can compute the mean in each case and use it to predict the missing value. The mean of the available 4 values in the first column is 4.8; similarly, the mean of the available 4 values in the second column is 3.325. So, using these mean values to replace the missing values in the matrix gives us the updated matrix

[[4.8, 3.5, 1.4, 2],

[4.9, 3.2, 1.3, 0.2],

[4.7, 3.2, 1.3, 0.2],

[4.6, 3.325, 1.5, 0.2],

[5.0, 3.6, 1.4, 0.2]]

EXAMPLE: Consider three clusters of points and their centroids:

a. Cluster1: ((1,1,1), (1,2,3), (1,3,2)) Centroid 1: (1,2,2)

b. Cluster2: ((3,4,3), (3,5,3), (3,3,3)) Centroid 2: (3,4,3)

c. Cluster3: ((6,6,6), (6,8,6), (6,7,6)) Centroid 3: (6,7,6)

Consider a pattern vector with a missing value given by ( 1 ,-, 2). Its nearest centroid among the three centroids based on the remaining two features is Centroid 1, (1,2,2). So, the missing value in the second location is predicted to be the second component in Centroid 1, that is, 2. So, the pattern with the missing value becomes (1,2,2).

b). Data from Different Domains: The scales of values of different features could be very different. This would bias the matching process to depend more upon features that assume larger values, toning down the contributions of features with smaller values. So, in applications where different components of the vectors have different domain ranges, it is possible for some components to dominate in contributing to the distance between any pair of patterns. Consider for example, classification of objects into one of two classes: adult or child. Let the objects be represented by height in metres and weight in grams.

Height	Weight	Class
1.6	75000	Adult
0.6	5000	Child

Assume that the domain of height is [0.5,2.5] and the domain of weight is [2000, 200000] in this example. So, there is a large difference in the ranges of values of these two features.

Now the Euclidean distance between the adult and child vectors given above is

= ≈ 70000

Similarly, the cosine of the angle between the adult and child vectors is

≈ 1.0

Note that the proximity values computed between the two vectors, whether it is the Euclidean distance or the cosine of the angle between the two vectors, are dependent largely upon only one of the two features, that is, weight, while the contributions of height are negligible. This is because of the difference in the magnitudes of the ranges of values of the two features. This can be handled by scaling different components differently and such a process of scaling is called normalization. There are two popular normalization schemes:

Scaling using the range: On any categorical feature, the values of two patterns either match or mismatch and the contribution to the distance is either zero (0) (match) or 1 (mismatch). If we want to be consistent, then in the case of any numerical feature also we want the contribution to be in the range [0,1]. This is achieved by scaling the difference by the range of the values of the feature. So, if the p^th component is of numerical type, its contribution to the distance between patterns xⁱ and x^j is

where Range, is the range of the p^th feature values. Note that the value of this term is in the range [0, 1] the value of 1 is achieved when | | = , and it is 0 (zero) if patterns xⁱ and x^j have the same value for the p^th feature. Such a scaling will ensure that the contribution, to the distance, of either a categorical feature or a numerical feature will be in the range [0,1]. Standardization: Here, the data is normalized so that it will have 0 (zero) mean and unit variance. This is motivated by the property of standard normal distribution, which is characterized by zero mean and unit variance.

EXAMPLE: Let there be 5 l-dimensional data vectors and let the the q^th components of the 5 vectors be 60, 80, 20, 100 and 40. The mean of this collection is (60 + 80 + 20 + 100 + 40)/5 = 60

We get zero-mean data by subtracting this mean from each of the 5 data items to obtain 0,20, 40, 40,-20 for their q^th components. Note that this is zero-mean data as these values add up to 0. To make the standard deviation of this data 1, we divide each of the zero-mean data values by the standard deviation of the data. Note that the variance of the zero-mean data is

((0 + 20² + (- 40)² + 40² + (- 20)²)/5 = 800

and the standard deviation is 28.284. So, the scaled data is 0, 0.707, -1.414, 1.414, -0.707. Note that this data corresponding to the q^th feature value of the 5 vectors has zero mean and unit variance.

Outliers in the Data: An outlier is a data item that is either noisy or erroneous. Noisy measuring instruments or erroneous data recordings are responsible for the presence of such outliers. A common problem across various applications is the presence of outliers. A data item is usually called an outlier if it

 Assumes values that are far away from those of the average data items

 Deviates from the normally behaving data item

 Is not connected/similar to any other object in terms of its characteristics

Outliers can occur because of different reasons:

Noisy measurements: The measuring Instruments may malfunction and may lead to recording of noisy data. It is possible that the recorded value lies outside the domain of the data type.

Erroneous data entry: Outlying data can occur at the data entry level itself. For example, it is very common for spelling mistakes to occur when names are entered. Also, it is possible to enter numbers such as salary erroneously as 2000000 instead of 200000 by typing an extra zero (0).

Evolving systems: It is possible to encounter data items in sparse regions during the evolution of a system. For example, it is common to encounter isolated entities in the early days of a social network. Such isolated entities may or may not be outliers.

Very naturally: Instead of viewing an outlier as a noisy or unwanted data item, it may be better to view it as something useful. For example, a novel idea or breakthrough in a scientific discipline, a highly paid sportsperson or an expensive car can all be natural and influential outliers

An outlying data item can be either out-of-range or within-range. For example, consider an organization in which the salary could be from { 10000, 150000, 225000, 300000 }. In this case, an entry like 2250000 is an out-of-range outlier that occurs possibly because of an erroneous zero (0). Also, if there are only 500 people drawing 10000, 400 drawing 150000, 300 at 225000 and 175 drawing 300000, then an entry like 270000 could be a within-range outlier. There are different schemes for detecting outliers. They are based on the density around points in the data. If a data point is located in a sparse region, it could be a possible outlier. It is possible to use clustering to locate such outliers. It does not matter whether it is within- range or out-of-range. If the clustering output has a singleton cluster, that is, a one-element cluster, then that element could be a possible outlier.

3 Data Representation

Representation is an important step in building ML models. This subsection introduces how data items are represented. It also discusses the importance of representation in ML. In the process, it deals with both feature selection and feature extraction and introduces different categories of dimensionality reduction.

It is often stated in DL literature that feature engineering is important in ML, but not in DL because DL systems have automatic representation learning capability. Even though representation is not explicit, it is implicitly handled in DL by choosing the appropriate number of layers and number of neurons in each layer of the neural network.

Representation of Data Items

The most active and state-of-the-art paradigm for ML is statistical machine learning. Here, each data item is represented as a vector. Typically, we consider addition of vectors in computing the mean or centroid of a collection of vectors, multiplication of a vector by a scalar in dealing with operations on matrices, and the dot product between a pair of vectors for computing similarity as important operations on the set of vectors. In most of the practical applications, the dimensionality

of the data or correspondingly size of the vectors representing data items, L, can be very large. For example, there are around 468 billion Google Ngrams. In this case, the dimensionality of the vectors is the vocabulary size or the number of Ngrams, so, the dimensionality could be very large. Such a high dimensional data is common in bioinformatics, information retrieval, satellite imagery arbitrary and so on. So, representation is an important component of any ML system. An arbitrary representation may also be adequate to build an ML model. However, the predictions made using such a model may not be meaningful.

Current-day applications deal with high-dimensional data. Some of the difficulties associated with ML using such high-dimensional data vectors are:

 Computation time increases with the dimensionality.

 Storage space requirement also increases with the dimensionality.

 Performance of the model: It is well-known that as the dimensionality increases, we require a larger training data set to build an ML model. There is a result, popularly called the peaking phenomenon, that shows that as the dimensionality keeps increasing, the accuracy of a classification model increases until some value, and beyond that value, the accuracy starts decreasing.

This may be attributed to the well-known concept of overfitting. The model will tend to remember the training data and fails to perform well on validation data. With a larger training data set, we can afford to use a higher dimensional data set and still avoid overfitting. Even though the dimensionality of the data set in an application is large, it is possible that the number of available training vectors is small. In such cases, a popular technique used in ML is to reduce the dimensionality so that the learnt model does not overfit the available data.

Well-known dimensionality reduction approaches are:

 Feature selection: Let F = {f1, f2, f3, …, fL } be the set of L features. In the feature selection approach, we would like to select a subset Fl of F having l (< L) features such that Fl maximizes the performance of the ML model.

 Feature extraction: Here, from the set F of L features, a set H = { h1, h2, h3, …, hl} of l (<L) features is extracted. It is possible to categorize these schemes as follows:

1. Linear schemes: In this case,

h_j =

That is, each element of H is a linear combination of the original features. Note that feature selection is a specialization of feature extraction. Some prominent schemes under this category are:

a. Principal components (PCs): Consider the data set of n vectors in an L-dimensional space; this may be represented as a matrix A of size nxL. The covariance matrix of size L x L associated with the data is computed and the eigenvectors of ∑ form the principal components. The eigenvector corresponding to the largest eigenvalue is the first principal component (PC). Similarly, the second largest eigenvalue provides its corresponding eigenvector as the second PC. Finally, the eigenvector corresponding to the l^th largest eigenvalue is the l^th PC. Both the original feature vectors and PCs are sufficiently powerful to represent any data vector. So, PCs may be viewed as linear combinations of the given features.

b. Non-negative matriz factorization (NMF): Even when the data is non-negative, it is possible that PCs have negative entries. However, it is useful to have representations using non-negative entries; NMF is such a factorization of A_nxL into a product of B_nxl and C_ixL. Its use is motivated by the notion that NMF can be used to characterize objects in an image represented by A. In NMF, the columns of B can be viewed as linear combinations of the columns of A because of linear independence.

2. Non-linear feature extraction: Here, we represent using H = {h1, h2, h3, …, hl} such that

hl=t(f1, f2, f3, …, fL)

where is a non-linear function of the features. For example, if F= {f1, f2}, then h₁=af1+bf2+cf1f2 is one such non-linear combination; it is non-linear because we have a term of the form f1f2 in h1.

Autoencoder is a popular, state-of-the-art, non-linear feature extraction tool. Here, a neural network which has an encoder and a decoder is used. The middle layer has l neurons so that the l outputs from the middle layer give an l(<L)-dimensional representation of the L-dimensional pattern that is input to the autoencoder. Note that the encoder encodes or represents the L-dimensional pattern in the l-dimensional space while the decoder decodes or converts the l-dimensional pattern into the l-dimensional space. Note that it is called autoencoder because the same L-dimensional pattern is associated with the input and output layers.

4 Model Selection

Selection of the model to be used to train an ML system depends upon the nature of the data and knowledge of the application domain. For some applications, only a subset of the ML models can be used. For example, if some features are numerical and others are categorical, then classifiers based on perceptrons and support vector machine (SVM) are not suitable as they compute the dot product between vectors and dot products do not make sense when some values in the corresponding vectors are non-numerical. On the other hand, Bayesian models and decision tree-based models are ideally suited to deal with such data as they depend upon the frequency of occurrence of values.

5 Model Learning

This step depends on the size and type of the training data. In practice, a subset of the labelled data is used as training data for learning the model and another subset is used for model validation or model evaluation. Some of the ML models are highly transparent while others are opaque or black box models. For example, decision tree-based models are ideally suited to provide transparency; this is because in a decision tree, at each internal or decision node, branching is carried out based on the value assumed by a feature. For example, if the height of an object is larger than 5 feet, it is likely to be an adult and not a child; such easy-to-understand rules are abstracted by decision trees. Neural networks are typically opaque as the outputs of intermediate/hidden layer neurons may not offer transparency.

6 Model Evaluation

This step is also called model validation. This step requires specifically earmarked data called validation data. It is possible that the ML model works well on the training data; then we say that the model is well trained. However, it may not work well on the validation data. In such a case, we say that the ML model overfits the training data. In order to overcome overfitting, typically use the validation data to tune the ML model so that it works well on both the training and validation data seta.

7 Model Prediction

This step deals with testing the model that is learnt and validated. It is used for prediction because both classification and regression tasks are predictive tasks. This step employs the test data set earmarked for the purpose. In the real world, the model is used for prediction as new patterns keep coming in. Imagine an ML model built for medical diagnosis. It is like a doctor who predicts and makes a diagnosis when a new patient comes in.

8 Model Explanation

This step is important to explain to an expert or a manager why a decision was taken by the ML model. This will help in explicit or implicit feedback from the user to further improve the model. Explanation had an important role earlier in expert systems and other Al systems. However, explanation has become very important in the era of DL. This is because DL systems typically employ neural networks that are relatively opaque. So, their functioning cannot be easily explained at a level of detail that can be appreciated by the domain expert/user. Such opaque behaviour has created the need for explainable AL.

SEARCH AND LEARNING

Search is a very basic and fundamental operation in both ML and AL. Search had a special role in conventional Al where it was successfully used in problem solving, theorem proving, planning and knowledge-based systems.

Further, search plays an important role in several computer science applications. Some of them are as follows:

 Exact search is popular in databases for answering queries, in operating systems for operations like grep, and in looking for entries in symbol tables.

 In ML, search is important in learning a classification model, a proximity measure for clustering and classification, and the appropriate model for regression.

 Inference is search in logic and probability. In linear algebra, matrix factorization is search. In optimization, we use a regularizer to simplify the search in finding a solution. In information theory, we search for purity (low entropy).

So, several activities including optimization, inference and matrix factorization that are essential for ML are all based on search. Learning itself is search.

EXPLANATION OFFERED BY THE MODEL

Conventional Al systems were logic-based or rule-based systems. So, the corresponding reasoning systems naturally exhibited transparency and, as a consequence, explainability. Both forward and backward reasoning was possible. In fact, the same knowledge base, based on experts' input, was used in both diagnosis and in teaching because of this flexibility. Specifically, the knowledge base wed by the MYCIN expert system was used in tutoring medical students using another expert system called GUIDON,

However, there were some problems associated with conventional Al systems

 There was no general framework for building Al systems. Acquiring knowledge, using additional heuristics and dealing with exceptions led to adhocism; experience in building one Al system did not simplify the building of another Al system.

 Acquiring knowledge was a great challenge. Different experts typically differed in their conclusions, leading to inconsistencies, Conventional logic-based systems found it difficult to deal with such inconsistent knowledge.

There has been a gradual shift from using knowledge to using data in building Al systems. Current-day Al systems, which are mostly based on DL, are by and large data dependent. They can learn representations automatically. They employ variants of multi-layer neural networks and backpropagation algorithms in training models.

Some difficulties associated with DL systems are:

 They are data dependent. Their performance improves as the size of the data set increases. So, they need larger data sets. Fortunately, it is not difficult to provide large data sets in most of the current applications.

 Learning in DL systems involves a simple change of weights in the neural network to optimize the objective function. This is done with the help of backpropagation, which is a gradient-descent algorithm and which can get stuck with a locally optimal solution. Combining this with large data sets may possibly lead to overfitting. This is typically avoided by using a variety of simplifications in the form of regularizers and other heuristics.

 A major difficulty is that DL systems are black box systems and lack explanation capability. This problem is currently attracting the attention of Al researchers.

CHAPTER 2

Nearest Neighbor-Based Models

INTRODUCTION TO PROXIMITY MEASURES

Proximity measures are used to quantify the degree of similarity or dissimilarity between two or more pattern vectors. These pattern vectors can represent documents, images or even entire audio or video files. Proximity measures are often used by machine learning (ML) algorithms to compare and classify or group or make predictions using these patterns. There are many types of proximity measures and the popular ones are:

Euclidean distance: This is the most popular distance as it is intuitively appealing. It measures the straight-line distance between two points in a multi-dimensional space. So, it is also called as the crow flies distance.

Cosine similarity: This measures the cosine of the angle between two vectors and is often used in text analysis to compute similarity between a pair of documents.

Jaccard similarity: This measures the ratio of the cardinalities of intersection over union of two sets and is often used in recommendation systems to compare users' preferences.

Hamming distance: This measures the number of positions at which two binary strings differ and is often used in error correction codes.

DISTANCE MEASURES

A distance measure is used to find the dissimilarity between patterns represented as vectors. Patterns which are more similar should be closer. The distance function (d) could be a metric or a non-metric. The most popularly used family of distance metrics is called the Minkowski metric.

A metric is a type of measure that possesses three key attributes: positive reflexivity, symmetry and triangular inequality:

 Positive reflexivity: d(x, x) = 0

 Symmetry: d(x, y) = d(y, x)

 Triangular inequality d(x, y) <= d(x, z) + d(z, y) . where z, y, z are any three patterns.

The different types of dissimilarity/similarity (proximity) measures between pattern vectors.

 Minkowski Distance

 Weighted Distance Measure

 Non-Metric Similarity Functions

 Levenshtein Distance

 Mutual Neighborhood Distance (MND)

 Proximity Between Binary Patterns

Non-Metric Similarity Functions:

This category includes similarity functions that do not obey the triangular inequality or symmetry. They are commonly used for image or string data and they are resistant to outliers or extremely noisy data. The squared Euclidean distance is an example of a non-metric, but it provides the same ranking as the Euclidean distance metric. One example of a non-metric similarity function is the k-median distance between two vectors. Given x=( x(1), x(2) ,...,x(l)) and y=( y(1), y(2) ,...,y(l)), the formula for the k- median distance

d(x,y)=k-median{|r(1) - y(1)|, …, |x(n) - y(n)]},

where the k-median operator returns the k^th value of the ordered difference vector. Another similarity measure is S(x, y) =

which corresponds to the cosine of the angle between the vectors x and y. It is symmetric because cos(θ) = cos(- θ) and it represents the similarity between x and y. A possible distance function the cosine similarity, S(x, y) is

d(x, y) = 1 - S(x, y)

Note that d(x, y) is symmetric as S(x, y) is symmetric. However, d(x, y) violates the triangular inequality.

EXAMPLE: Let x, y and z be three vectors in a two-dimensional space, where the angle between x and z is 45 and the angle between z and y is 45. Here d(x, y) = 1 - cos( / 2) = 1 while d(x, z) + d(z, y) = 1 - cos( / 4) + 1 - cos( / 4) =2- =0.586. So, d(x, z) + d(z, y) < d(x, y) violating the triangle inequality.

Proximity Between Binary Patterns

We can view I-dimensional binary patterns as binary strings of length L. Let p and q be two l-bit binary strings. Some of the popular proximity measures on such binary patterns are:

Hamming Distance (HD): If p(i) = q(i) we say that p and q match on their i^th bit, else ( p(i) != q(i)) and p and q mismatch on the i^th bit . Hamming distance is the number of mismatching of the l-bit locations.

EXAMPLE: Consider the two 10-bit patterns p and q given by

p = 1000001000

q = 0000001001

The Hamming distance is 2 as they mismatch in bit positions 1 and 10.

Simple Matching Coefficient (SMC): Let us define the following:

M01 is the number of bits where p is 0 and q is 1

M10 is the number of bits where p is 1 and q is 0

M00 is the number of bits where p is 0 and q is 0

M11 is the number of bits where p is 1 and q is 1

Now we define SMC as follows:

SMC(p,q) =

Jaccard Coefficient (JC): It is defined as

JC(p,q) =

EXAMPLE: Consider again

p = 1000001000

q = 0000001001

Here, M11 = 1, M00 = 7, M01 = M10 =1.So,

SMC(p, q) = (1 + 7)/(7 + 1 + 1 + 1) = 8/10 = 0.8

JC(p, q) = 1/(1 + 1 + 1) = 1/3 = 0.33

DIFFERENT CLASSIFICATION ALGORITHMS BASED ON THE DISTANCE MEASURES

Many classification algorithms (classifiers) inherently depend upon some distance between a pair of patterns.

 Nearest Neighbor Classifier (NNC)

 K-Nearest Neighbor Classifier

 Weighted k-Nearest Neighbour (WkNN)

 Tree Based Nearest Neighbours Algorithm

 Branch and Bound Method

 Leader clustering

EXAMPLE: Let the training set consist of the following two-dimensional patterns with associated labels:

TABLE : Example data set

Data Point	X	Y	Label
x1	0.7	0.7	1
x2	0.8	0.8	1
x3	1.1	0.7	1
x4	0.7	1.1	1
x5	1.1	1.1	1
x6	3	2	2
x7	3.7	2.7	2
x8	4.1	2.7	2
x9	3.7	3.1	2
x10	4.1	3.1	2
x11	4.3	2.7	2
x12	4.3	3.1	2
x13	3.1	0.3	3
x14	3.1	0.6	3
x15	3.7	0.4	3
x16	3.4	0.6	3
x17	3.9	0.9	3
x18	3.9	0.6	3

K-Nearest Neighbor Classifier

In the k-nearest neighbor (KNN) algorithm, we find the k nearest neighbors of a test pattern x from the training data X and then assign the majority class label among the k neighbors to x. To determine the k nearest neighbors of x, it is necessary to calculate the distance between x and each of the n training patterns in the I-dimensional space. The distance metric used depends on the specific problem at hand, and can be Euclidean distance, Manhattan distance or cosine distance, among others. The class label of x is then determined based on the majority class label among its k nearest neighbors.

Figure : 2

Data Point	X	Y	Label	Distance from T = (2.1, 0.7)
x1	0.7	0.7	1	1.400
x2	0.8	0.8	1	1.304
x3	1.1	0.7	1	1.000
x4	0.7	1.1	1	1.456
x5	1.1	1.1	1	1.077
x6	3.0	2.0	2	1.581
x7	3.7	2.7	2	2.561
x8	4.1	2.7	2	2.828
x9	3.7	3.1	2	2.884
x10	4.1	3.1	2	3.124
x11	4.3	2.7	2	2.973
x12	4.3	3.1	2	3.256
x13	3.1	0.3	3	1.077
x14	3.1	0.6	3	1.005
x15	3.7	0.4	3	1.628
x16	3.4	0.6	3	1.304
x17	3.9	0.9	3	1.811
x18	3.9	0.6	3	1.803

Assuming that above figure is a visual representation of the KNN algorithm being applied to a test pattern T, if the value of k is set to 5, the five nearest neighbors of T are x3, x14, x13, x5, and x6. Among these five patterns, the majority class is Class 3. By using this method selecting the majority class label among the k nearest neighbors, the errors in classification can be reduced, especially when the training patterns are noisy. While the closest pattern to the test pattern may belong to a different class, considering the number of neighbors and the majority class label increases the likelihood of correct classification. Above figure visually demonstrates this phenomenon.

From Book

It indicates that the test point T is closest to point 5, which is an outlier in Class 1 and is represented as x,. If the KNN algorithm is used, the point T will be classified as part of Class 2, which is represented by + The selection of k is a crucial aspect of this algorithm. In the case of large data sets, k can be increased to decrease the error. The value of k can be determined through experimentation by keeping aside a subset of the training data as the validation data and classifying patterns from the validation set using different values of k using the training patterns to compute the neighbors. The value of k can be selected based on the lowest error observed in classification.

EXAMPLE : Based on above figure, if the pattern T' is (2.1, 0.7), its nearest neighbor is x3, and it would be classified as part of Class 1 if the nearest neighbor algorithm is employed. However, if the five nearest neighbors are considered, they are x3 and x5 both belonging to Class 1, and x14, x13 and x16 belonging to Class 3. Following the majority class rule, the pattern T would be classified as part of Class 3.

Radius Distance Nearest Neighbor Algorithm

This algorithm is an alternative to the KNN algorithm that considers all the neighbors within a specified distance r of the point of interest. This algorithm can be described as follows:

1. Given a point T, identify the subset of data points that fall within the radius centred at T. denoted by

Br(T) ={ xi ε X s.t, ||T-Xi||<= r}

2. If Br(T) is empty, output the majority class of the entire data set.

3. If Br(T) is not empty, output the majority class of the data points within Br(T)

This algorithm is useful for identifying outliers, as any pattern that does not have similarity with the patterns within the chosen radius can be identified as an outlier. The choice of the value of radius r is critical as it can affect the performance of the algorithm.

By considering all neighbors within the specified radius, this algorithm can be more effective than the traditional KNN algorithm, especially when the nearest neighbor is too far away to be relevant,

EXAMPLE: For the example shown in Fig. 2, from point T = (2.1, 0.7) the patterns which are located within a radius of 1.45 are x1, x2, x3, x5, x13, x14, and x16. The majority of these patterns belong to Class 1. T is therefore assigned to Class 1.

KNN REGRESSION

It is possible to use KNN for regression also. So, in this case we are given a set X of n labelled examples, where

Here x_{i} is a data vector and y_{1} is a scalar. It is possible for y_{1} also to be a vector i = 1, 2 ,...,n in some However, we restrict our attention to scalar ys. The regression model needs to use X to find the value of y for a new vector x In the case of regression based on KNN, we perform the following:

1. Find the k nearest neighbors of r from in data vectors. Let them be x1, x2, …, xk

2. Consider the y values associated with these xi s. Let them be y1, y2, …, yk .

3. Take the average of these y's and declare this average value to be the predicted value of y associated with x. So, the predicted value of y, call it is,

= ( y1 + y2 + … +yk )

We will illustrate it using the following example.

EXAMPLE: Consider the data shown in Table.

TABLE: Example data for KNN regression

Number (i)	Pattern (xi)		Target Yi
0.2	0.4	8	8
0.4	0.2	8	8
0.6	0.4	12	12
0.8	0.6	16	16
1.0	0.7	19	19
0.8	0.4	14	14
0.6	0.2	10	10
0.5	0.5	12	12
0.2	0.6	10	10

Number (i)	Pattern Xi		Target Yi	Distance from x=(0.3, 0.4)
1	0.2	0.4	8	0.100
2	0.4	0.2	8	0.224
3	0.6	0.4	12	0.300
4	0.8	0.6	16	0.539
5	1	0.7	19	0.762
6	0.8	0.4	14	0.500
7	0.6	0.2	10	0.361
8	0.5	0.5	12	0.224
9	0.2	0.6	10	0.224

Let the new pattern z = (0.3, 0.4) Let us see how to predict the target value for x using KNN regression. Let k = 3; the 3 NN from the patterns in the table are (0.2, 0.4), (0.4, 0.2) and (0.5, 0.5). The corresponding target values observed in the table are 8, 8 and 12, respectively. The average of these values is (8 + 8 + 12)/3 = 9.33 So, the predicted target value for x = (0.3, 0.4) is 9.33.

PERFORMANCE MEASURES

There are several measures used to evaluate the performance of ML models.

Performance of Classifiers

Classification Accuracy: Let n be the total number of patterns that are classified by a classification algorithm. Let n_c be the number of correctly classified patterns. Then the classification accuracy is

Classification accuracy = n_c/n

Confusion Matrix: It captures the results in a compact form. This data can be compacted further to better analyse the results. We illustrate this with an example.

EXAMPLE: Let there be three classes C1, C2 and C3. Let there be 200, 100 and 50 patterns from classes C1, C2 and C3 respectively. Let classifier classify these patterns into three classes, as shown in Table.

TABLE: Confusion matrix for a three-class problem

True / Predicted

True Positive (TP)

180

False Negative (FN)

True Positive (TP)

False Negative (FN)

True Positive (TP)

The first row of the table shows how the 200 patterns from C1 are classified: 180 of them are assigned to C1, 15 are assigned to C2 and 5 are assigned C3 Similarly row 2 accounts for 100 patterns from C2 and row 3 shows the assignment of 50 patterns from C3.

Note that classification accuracy is obtained by considering the correctly classified patterns. Note that 180 patterns from C1, 85 patterns from C2 and 45 patterns from C3 are correctly classified, as indicated by the diagonal entries in the matrix. They add up to 310. So, classification accuracy is 310/350 ≈0.8857. The percentage accuracy is obtained by multiplying this number by 100. So, the accuracy is 88.57%.

It is possible to compact the confusion table by looking at the entries with respect to C1. The compact table of size 2 x 2 is shown in below table.

TABLE: Compact confusion matrix for C1

True / Predicted

True Positive (TP)

180

False Negative (FN)

False Positive (FP)

True Negative (TN)

142

This compact table gives us useful information that could be used in calculating other evaluation measures. Now we are concerned with C1 and or not C1 (classes other than C1 ). The first row of this table shows that 180 patterns of C1 are correctly classified. This entry corresponds to the correctly classified number of C1 patterns to Class C1. So, these are called true positives or TP. TP = 180

The second column in the first row has 20 patterns. These patterns are from that are not classified as belonging to C1; they are assigned to , So, they are called false negatives or FN. Here, FN = 20

The first in the second row shows that 8 patterns from the other classes are assigned to Class C1. So, they are called false positives or FP. So FP = 8. The second entry in the second row corresponds to 142 patterns from that are assigned to These are called true negatives or TN. So, TN = 182.

We can use the confusion matrix entries, specifically TP, TN, FP and FN, to define additional evaluation measures. They are:

Precision: It is the ratio of TP to the total number of predicted positives (TP + FP). So, precision in this example is 180/188 ≈ 0.9574

Recall: It is the ratio of TP number of positives in the training data (TP + FN). So, in this example, it is 180/200 = 0.9.

F1 Score: It is the mean of precision and recall. More specifically, it is .

In this example, F1 Score 0.9278.

We can have one more popular measure called area under the ROC (receiver operating characteristic) curve based on some additional quantities derived from the table:

True positive rate or TPR is given by

TPR =

False positive rate or FPR is given by

FPR =

Area under the curve (AUC) is obtained by plotting a graph between FPR on the X-axis and TPR on the Y-axis. This is called the receiver operating characteristic (ROC).

EXAMPLE: Consider the entries in below table. The values are TP = 180, FP = 8, FN = 20 and TN = 182. So, TPR = 180/(180+20) = 0.9 and FPR= 8/190 = 0.042. So, (0.042, 0.9) give us a point point for the ROC plot.

We obtain multiple points by computing probabilities and thresholding them as explained next. Let there be two classes with labels 1 and 0. Let the training patterns from each class be given by

Class label 1: (1,1), (2,2), (3,3), (4,4), (5,5)

Class label 0: (4,1), (5,2), (6,3), (7.4), (8,5)

Let the validation patterns be given by

Class label 1: (2,1), (1,3)

Class label 0: (8,4), (7,5)

If we find k = 5 nearest neighbors for each validation pattern from the training set, we obtain details as shown in below table. Note that the table has 7 columns. The first column corresponds to the validation pattern considered. There are 4 rows in the table corresponding to the 4 validation patterns. The second column corresponds to the class label of these validation patterns. Note that the first two are from Class 1 and the remaining two are from Class 2.

TABLE: Computing probabilities and thresholding

Pattern	Class	5 NNs	Probability	Th > 0.3	Th > 0.5	Th > 0.7
(2,1)	1	(1,1),(2,2),(4,1),(3,3), (5,2)	0.6	1	1	0
(1,3)	1	(2,2),(1,1),(3,3),(4,4),(4,1)	0.8	1	1	1
(8,4)	0	(8,5),(7,4),(6,3),(5.5), (5,2)	0.2	0	0	0
(7,5)	0	(8,5),(7,4),(5,5),(6,3), (4,4)	0.4	1	0	0

The third column lists ( k =) NNs, out of the 10 training patterns, of the validation pattern given in column 1. The fourth column gives the probability that the validation pattern belongs to Class 1. This probability is the ratio of the number of neighbors, out of 5, from Class 1 to the total number of NNs considered, that is, 5.

For example, for the validation pattern (2,1), three patterns, namely, (1,1), (2,2), (3,3) are from Class 1. So, the probability is 3/5 = 0.6 Similarly, probabilities are calculated for the remaining three validation patterns.

In columns 5 to 7, we use thresholds (Th) to convert the probability into 1 or 0. For example in column 5, if the probability is more than ( Th =)0.3, we put a 1, otherwise, a 0. Similarly, we get a 1, 0, 1 in rows 2, 3 and 4 based on the Th) value of 0.3. Entries in columns 6 and 7 are obtained as in column 5 but with thresholds 0.5 and 0.7, respectively.

For each of the thresholdings, we have a possible assignment of the four validation patterns. By comparing with the true class label given in column 2 of the table, we get an (FPR, TPR) pair as follows:

For Th > 0.3 we get TP = 2, FN = 0, FP = 1, TN = 1. So, FPR= 1/(1+1) = 0.5 and TPR = 2/(2 + 0) = 1.

For Th > 0.5 we get TP = 2, FN = 0, FP = 0, TN = 2. So, FPR = 0/(2 + 0) = 0 and TPR = 2/(2+0) = 1.

For Th > 0.7 we get TP = 1, FN = 1, FP = 0, TN = 2. So, FPR = 0/(2 + 0) = 0 and TPR = 1/(1 + 1) = 0.5 .

So, by thresholding, we obtain different (FPR, TPR) pairs that are used in plotting the ROC curve. The standard rule is to convert the class decisions into probabilities and then threshold them to get FPR and TPR pairs.

Note that schemes for computing probabilities can differ when other classifiers are used. However, a similar thresholding scheme can be used to obtain FPR and TPR pairs once the probabilities are available.

Performance of the Regression Algorithms

Mean squared error (MSE): It is the most popular metric used for regression. It is defined by

MSE =

where n is the number of patterns, yi, is the target value for the i^th pattern and is the value predicted by the regression model.

Mean absolute error (MAE): It is the average of the difference between the target and the predicted values. It is given by MAE =