Chapter 3:

Chapter 4:

Chapter 5:

NOTE: For the missing topics and more information refer the TEXT BOOK and CLASS NOTES

Decision Tree Algorithm as a Classifier

A Decision Tree is a supervised machine learning algorithm used for both classification and regression tasks. It works by splitting the dataset into branches based on decision rules derived from feature values. The process continues until a stopping criterion is met, resulting in a tree structure where each leaf node represents a class label.

How It Works

  1. Root Node: This is the starting point, representing the entire dataset.

  2. Splitting: The algorithm selects the best feature and threshold to split the data, based on a criterion (e.g., Gini Index or Information Gain).

  3. Decision Nodes: Intermediate points where the dataset is further divided.

  4. Leaf Nodes: Terminal points that represent class labels.

Advantages

Disadvantages

Real-World Application

A Decision Tree Classifier can use Entropy and Information Gain to determine the best feature for splitting the data at each step.

Entropy Formula

Entropy measures the amount of disorder or impurity in the dataset. It is calculated as:

Where:

Entropy ranges from:

Information Gain Formula

Information gain measures the reduction in entropy after a split:

Where:

Example Dataset

Weather

Temperature

Humidity

Wind

Play Tennis

Sunny

Hot

High

Weak

No

Sunny

Hot

High

Strong

No

Overcast

Hot

High

Weak

Yes

Rain

Mild

High

Weak

Yes

Rain

Cool

Normal

Weak

Yes

Rain

Cool

Normal

Strong

No

Overcast

Cool

Normal

Strong

Yes

Sunny

Mild

High

Weak

No

Step 1: Calculate Entropy of the Root Node

We have 8 samples:

= -(4/8) x log2 (4/8) – (4/8) x log2 (4/8)

= 1

Step 2: Entropy After Splitting on "Weather"

Subset 1 (Sunny)

Weather

Play Tennis

Sunny

No

Sunny

No

Sunny

No

Subset 2 (Overcast)

Weather

Play Tennis

Overcast

Yes

Overcast

Yes

Subset 3 (Rain)

Weather

Play Tennis

Rain

Yes

Rain

Yes

Rain

No

= 0.389 + 0.528

= 0.917

Step 3: Weighted Entropy for the "Weather" Split

Step 4: Information Gain for "Weather" Split

= 1 – 0.344

= 0.656

The Information Gain for splitting on "Weather" is 0.656. You would repeat this process for other features (Temperature, Humidity and Wind) and choose the feature with the highest information gain to be the root of the decision tree.

Step 5: Entropy and Information Gain for Other Features

(1) "Temperature" Feature

Temperature

Play Tennis

Hot

No, No, Yes

Mild

Yes, No

Cool

Yes, Yes, No

Entropy of Hot Subset

Entropy of Mild Subset

Entropy of Cool Subset

Weighted Entropy for "Temperature"

Information Gain for "Temperature"

= 1 – 938

= 0.062

(2) "Humidity" Feature

Humidity

Play Tennis

High

No, No, No, Yes, Yes

Normal

Yes, Yes, No

Entropy of High Subset

Entropy High = -(2/5) x log2 (2/5) – (3/5) x log2 (3/5)

                        = 0.4 x 1.322 + 0.6 x 0.737

                        = 0.971

Entropy of Normal Subset

EntropyNormal = -(2/3) x log2 (2/3) – (1/3) x log2 (1/3)

                        = 0.666 x 0.5858 + 0.333 x 1.585

                        = 0.91826

Weighted Entropy for "Humidity"

EntropyHumidity = (5/8) x 0.971 + (3/8) x 0.918

                        = 0.9508

Information Gain for "Humidity"

            = 1 – 0.9508

            = 0.0491

(3) "Wind" Feature

Wind

Play Tennis

Weak

No, Yes, Yes, Yes, No

Strong

No, Yes, No

Entropy of Weak Subset

Entropy Weak = -(3/5) x log2 (3/5) – (2/5) x log2 (2/5)

                        = 0.6 x 0.737 + 0.4 x 1.322

                        = 0.971

Entropy of Strong Subset

Entropy Strong = -(1/3) x log2 (1/3) – (2/3) x log2 (2/3)

                        = 0.528 + 0.389

                        = 0.917

Weighted Entropy for "Wind"

Entropy Wind = (5/8) x 0.971 + (3/8) x 0.917

                        = 0.951

Information Gain for "Wind"

                        = 1 – 0.951

                        = 0.0491

Step 6: Choosing the Best Split

Feature

Information Gain

Weather

0.656

Temperature

0.062

Humidity

0.0491

Wind

0.491

Since "Weather" has the highest Information Gain (0.656), we choose it as the root node.

Final Decision Tree (Partial)

                 Weather

               /         |        \

    Sunny   Overcast   Rain

     (No)       (Yes)      Further Split

Decision Tree Properties:

1. Hierarchical Structure

This hierarchical representation allows decision trees to efficiently handle complex decision-making processes.

2. Recursive Partitioning

For instance, in a loan approval model, the first split might be based on credit score, followed by splits on income level and employment status.

3. Splitting Criteria

Gini Impurity:

Where pi​ is the probability of class iii. Lower Gini values indicate purer splits.

Entropy (Information Gain):

Information Gain measures the reduction in entropy after a split.

Mean Squared Error (MSE) for Regression:

These criteria guide the tree to create splits that improve prediction accuracy.

4. Non-parametric and Non-linear

5. Interpretability

The ability to visualize decision trees makes them particularly appealing in applications where model transparency is essential.

6. Prone to Overfitting

7. Feature Importance

8. Robustness to Outliers

9. Handle Mixed Data Types

10. Computational Complexity

Decision Tree as a Regressor

A Decision Tree Regressor is a variant of the Decision Tree algorithm used for predicting continuous, numerical target values. Instead of classifying data into categories, it aims to fit the data by dividing it into regions with constant target values.

How It Works

  1. Recursive Partitioning:

    • The decision tree splits the feature space into regions based on threshold values that minimize prediction error.
    • The objective is to find splits that reduce variance within each region.
  2. Prediction:

    • For a given input, traverse the tree from the root to a leaf node.
    • The value at the leaf node (usually the mean of the target values in that node) is the predicted output.

Splitting Criterion for Regression

Instead of using metrics like Gini impurity or entropy, regression trees rely on variance reduction techniques:

Example Process

Imagine predicting house prices based on features like the number of rooms and lot size:

  1. The root node contains all the training data.
  2. The first split might partition houses based on lot size (e.g., <5000 sqft and ≥5000 sqft).
  3. Subsequent splits might be based on the number of rooms or other features.
  4. At the leaf nodes, the average price of houses in each region becomes the predicted value.

Example: Predicting House Prices Based on Square Footage

We have a small dataset where the square footage of houses (X) and their corresponding prices (y) are known. We want to predict the price of a house given its square footage.

Square Footage (X)

House Price (y) ($)

1500

200,000

1700

220,000

2500

350,000

2700

370,000

3000

400,000

Step-by-Step Process

  1. Splitting the Data:
    The decision tree algorithm splits the feature space (square footage) to minimize the variance in each region.

    First Split:
    Let's say the algorithm decides the best first split is at 2500 sq ft. This creates two regions:

    • Region 1: Houses with square footage ≤ 2500 (mean price ≈ $240,000)
    • Region 2: Houses with square footage > 2500 (mean price ≈ $390,000)
  2. Further Splitting:
    The algorithm continues splitting until a stopping criterion is met (like minimum samples per node or maximum depth).

    After further splits, we might have:

    • Region 1a (1500 to 1700 sq ft): Mean price = $210,000
    • Region 1b (1700 to 2500 sq ft): Mean price = $285,000
    • Region 2a (2500 to 2700 sq ft): Mean price = $360,000
    • Region 2b (2700 to 3000 sq ft): Mean price = $400,000

Visualization

The splits can be visualized in a decision tree:

              [All Data]
               /     \
        X  2500     X > 2500
        /     \        /      \
X  1700   X > 1700  X  2700  X > 2700

Prediction Example

Suppose we want to predict the price for a house with 2600 sq ft.

Key Observations

Advantages of Decision Tree Regression

  1. Non-linear Relationships: Can model complex, non-linear relationships between features and the target variable.
  2. No Assumptions About Data Distribution: Works without needing data transformations (like scaling).
  3. Interpretability: The decision rules can be visualized and easily interpreted.
  4. Robustness to Outliers: Splits are based on conditions, not averages, making it resilient to outliers.

Disadvantages

  1. Overfitting: The tree can become too complex and fit the training data perfectly but fail to generalize.
    • Solution: Use pruning techniques or limit the maximum depth.
  2. Instability: Small changes in data can lead to vastly different trees.
  3. Piecewise Constant Predictions: The model creates flat regions for predictions, which may not capture smooth relationships.

Random Forest

A Random Forest is a powerful machine learning algorithm that builds a collection (or "forest") of decision trees during training time. For both classification and regression tasks, it makes predictions by aggregating the outputs of multiple trees.

How Does Random Forest Work?

  1. Bootstrapping (Bagging Technique):

    • Random Forest uses bootstrap sampling, where random subsets of data points are selected with replacement to train each decision tree.
    • This ensures that trees are trained on slightly different datasets, making the model robust and less prone to overfitting.
  2. Feature Randomness:

    • At each split within a tree, only a random subset of features is considered for the best split instead of all features.
    • This decorrelates the trees and reduces the likelihood of all trees converging to the same solution.
  3. Aggregation:

    • For Classification: The final prediction is made based on majority voting (the most common class across all trees).
    • For Regression: The final output is the average of all the individual tree predictions.

Why Use Random Forest?

Detailed Example: Predicting House Prices

Suppose we want to predict house prices based on square footage and the number of bedrooms.

Square Footage

Bedrooms

Price ($)

1500

3

200,000

1700

4

240,000

2500

4

350,000

2700

5

400,000

3000

4

500,000

Step-by-Step Process

  1. Data Bootstrapping:

    • Assume we train 3 decision trees in the forest. Each tree receives a bootstrapped sample of the dataset:
      • Tree 1 may receive rows [1500, 3, 200,000], [2700, 5, 400,000], [3000, 4, 500,000]
      • Tree 2 may receive rows [1700, 4, 240,000], [2500, 4, 350,000], [1500, 3, 200,000]
      • Each tree builds independently based on different samples.
  2. Feature Randomness:

    • At each split, each tree might consider only square footage or bedrooms, not both.
    • For instance, Tree 1 might split first on Square Footage > 2500 while Tree 2 splits on Bedrooms > 3.
  3. Prediction Aggregation:
    Suppose we want to predict the price for a 2600 sq ft house with 4 bedrooms.

    • Tree 1 predicts $360,000
    • Tree 2 predicts $375,000
    • Tree 3 predicts $370,000
    • Final Prediction (Average): (360000+375000+370000)/3=368,333

Key Observations

  1. Final Prediction:
    The predicted prices for test houses are $210,500 and $390,000.

  2. Feature Importance:
    The plot shows whether Square Footage or Bedrooms contributed more to the decision-making process.

Advantages of Random Forest

Disadvantages

Introduction to The Bayes Classifier

The Bayes classifier is an optimal classifier. It operates based on the probability structure associated with the domain of application. It employs Bayes' rule to convert the a priori probability of a class into posterior probability with the help of probability distributions associated with the class. These posterior probabilities are used to classify the test patterns; the pattern is assigned to the class that has the largest posterior probability. 

Some of the important properties of Bayesian classifiers are as follows:

Probability, Conditional Probability and Bayes' Rule:

Let us start with a simple scenario where domain knowledge is used in the form of prior probabilities to classify patterns.

Example: Let us assume that 10,000 people in a community had undergone the COVID-19 test and 50 of them tested positive, while the remaining 9950 tested negative. In this two-class (binary class) problem, a simple frequency estimate will give us the following values for the probabilities for the two classes: 

P(positive)= 50/10000 = 0.005 and P(negative) = 9950/10000 = 0.995

These probabilities are called prior probabilities as they are obtained using domain knowledge.

Suppose a new person, from the community, who has not undergone the test, needs to be classified. In the absence of any other information from the person, one would try to use the prior probabilities in decision making; so the new person is assigned a COVID-19-negative label as the probability P(negative) is significantly larger than P(positive) ( 0.995 >>0.005) So, invariably every other person in the community who has not undergone the test will be classified as COVID-19-negative using this rule of classification, which is influenced by the larger prior probability of being COVID-19-negative. Such a classification is erroneous if the new person is actually COVID-19-positive. This can occur with a probability of 0.005 (P(positive)); so, the probability of error is 0.005. Note that every person who is actually COVID-19-negative is correctly classified in this process. 

In the KNN classifier, if the value of k = n where n is the size of the training data set, and if we have k1 (out of k) from the COVID-19-positive class and k2(k - k1) from the COVID-19-negative class, then the probability estimates are

P(positive)= k1/n and P(negative) = k2/n

It is not difficult to see that KNN gives the same result as classification based on prior probabilities.

In a random experiment, we have a sample space, S, that is, the set of all outcomes. For example, tossing a coin gives us {H, T} as the sample space, where H stands for head and T stands for tail.

An event is a subset of the sample space. We associate probabilities with events. If A is an event, then its probability P(A) ε [0, 1] that is, probability is non-negative and is upper bounded (less than or equal to) 1. 

A and B are disjoint events (A ∩ B = ), where A ∩ B is the intersection of the sets A and B and is the null set or empty set, then P(A  B)= P(A) + P(B) , where A  B is the union of the sets A and B. This property holds for a countable union of events if they are pairwise disjoint.

Example: If a coin is tossed twice, the sample space is {HH, HT, TH, TT}. If A = {HH, HT} and B = {TT} then A ∩ B =  Further, P(A) = 2/4 as A has 2 out of the 4 P(B) = 1/4. Note that A  B = {HH, HT, TT} 

P(A  B)= 3/4 = P(A) + P(B) = 2/4 + 1/4 

However, A = {HH, HT} C = {HT, TT} then A ∩ C= {HT}  !=  . So, P(A C)= 3/4 whereas P(A) = 2/4 and P(C) = 2/4. So, here P(A  C) != P(A) + P(C) . It possible show that 

P(A  B)= P(A) + P(B) - P(A ∩ B)

If an event is described as at least one tail, then the event is {HT,TH,TT} and the probability of the event is 3/4 as out of 4 elements of the sample space, 3 elements are favourable to the event.

Given an event A, its compliment Ac is given by the set difference S – A. Below figure shows regions corresponding to various related events, given two events A and B. Note that A ∩ Bc is the intersection of events A and BC. The region for A ∩ B is the intersection of events A and B. The event. Ac ∩ B corresponds to the intersection of the events A and B. Finally the remaining region indicates the intersection of the events Ac and Bc,

FIG.  Some events related to two given events A and B

Conditional Probability

We need to update the probability values if new information is given. Consider the following example.

Example : Consider tossing a coin twice. We have seen in Example 2 that the probability of at least one tail is 3/4 The corresponding event is {HT, TH, TT}. If we are given additional information that one of the tosses has resulted in a head, then the sample space is constrained to {HT, HH, TH}. So, under the condition that one toss has resulted in a head, the sample space shrinks. Now for at least one tail, the event in the new sample space is {HT,TH}. So, the probability has reduced from 3/4 to 2/3 for the event in the presence of more information. This computation is captured by using notion of conditional probability. 

If A and B are two events such that P (B)!=0, then 

P(A|B)= P(A ∩ B)/ P(B)

It is the probability of A conditioned on B. It is undefined if P(B) = 0 In the current example, if A is the event specified by at least one tail and B is the event specified by one toss that has resulted in a head, then B = {HH, HT, TH}; so, P(B) = 3/4 Note that A ∩ B = {TH, HT} . So,

P(A|B)= (2/4)/(3/4) = 2/3

as computed earlier in this example.

We have another important concept called independent events. We say that two events A and B are independent if P(A ∩ B)= P(A) * P(B) . So, if A and B are independent, then 

P(A|B)= P(A ∩ B)/P(B) = (P(A) * P(B))/(P(B)) = P(A)

Total Probability

FIG. An example to illustrate total probability

Consider the Venn diagram shown in above fiigure. Here, event A is represented by the elliptical region and there are 4 events C1, C2, C3 and C4 that overlap with A. We can represent A using these overlapping sets by 

A =(A ∩ C 1 )  (A ∩ C 2 )  (A ∩ C 3 )  (A ∩ C 4 )

The four overlapping sets in the union are disjoint as C1, C2, C3 and C4 are disjoint. Let 

Bi =A ∩ Ci for i = 1,2,3,4 

So, P(Bi) =P(A|Ci )P(Ci ) 

So, P(A) = P(B1) + P(B2) + P(B3) + P(B4) 

Hence, 

P(A) =P(A|C1 )P(C1 )+P(A|C2 )P(C2 )+P(A|C3 )P(C3 )+P(A|C4 )P(C4 ) 

We know that P(Ci |A)= P(A|Ci ) P(Ci)/P(A) . 

So, in general 

P(Ci|A)= [P(A|Ci)P(Ci)] / [P(A|C1)P(C1) + P(A|C2)P(C2 ) + P(A|C3)P(C3) +P(A|C4)P(C4)]

Bayes' Rule and Inferences

Bayes' theorem provides a way to update our beliefs (probabilities) about events based on new evidence. It is a fundamental concept in probability and statistics, widely used in machine learning, decision theory, and other fields.

The mathematical formula for Bayes' theorem is:

Where:

EXAMPLE: Let C1 and C2 be two chests such that C1 has 20 white balls (WB) and 10 red balls (RB); C2 has 15 WBs and 15 RBs.

If one of the two chests is picked with equal probability and a ball randomly picked from the chest is WB, what is the probability that it came from C1? We have the following information: 

Prior: P(C1) = P(C2) = 1/2

Given: P( WB |C1 )= 2 /3, P( RB |C1 )= 1/3 and P( WB |C2 )=P(RB|C2 )= 1/2 Needed: P(C1 |WD)=[ P(WB|C1)P(C1) ] / [ P(WB|C1)P(C1) + P(WB|C2)P(C2 ) ]

= ((2/3)(1/2)) / ((2/3)(1/2) + (1/2)(1/2)) = 4/7 

The Bayes Classifier And Its Optimality

Suppose there are two classes: C1(Normal) and C2(Abnormal) 

Let P(C1) = 0.9 and P(C2) = 0.1 

Given a new pattern X, we compute the posterior using Bayes Rule 

Similarly, 

Bayes rule helps us in converting priors to posteriors. 

Note that P(X) =𝑃(𝑋|𝐶1)𝑃(𝐶1) +𝑃(𝑋|𝐶2)𝑃(𝐶2)is the evidence for X and is also a normalizer to ensure that P(C1|X) + P(C2|X) = 1. 

Now shift the responsibility from priors to posteriors after observing X, Assign X to C1if P(C1|X) > P(C2|X).

Example: 

So, the densities conditioned on C1 ( P(X | C1) ) and C2 ( P(X | C2) ) are uniform.

These are also called class conditional densities and are uniform in this example.

Let the prior probabilities be equal, that is P(C1) = P(C2) = 0.5

Then P(C1 | X) = (0.50.5)/(0.50.5+0.20.5)= 0.7 and P(C2 | X) = (0.20.5)/(0.50.5+0.20.5) = 0.3.

So, the pattern X (= 2) is assigned to class C 1 as P(C1 | X) > P(C2 | X).

Multi-Class Classification

We have discussed the use of the Bayes classifier in the two-class case. It can be easily used to deal with multi-class cases, that is, when the number of classes is more than 2. It may be described as follows:

Let the classes be C1, C2 , . . ., C q , where q >= 2

Let the prior probabilities be P(C1), P(C2), . . ., P(Cq)

Let x be the test pattern to be classified as belonging to one of these q classes.

Compute the posterior probabilities using Bayes' rule 

P(Ci |x)= P(x|Ci)P(Ci) /   , for i = 1, 2 , . . ., q

Assign the test pattern x to class Ct if P(Ct | x) >= P(Ci |x) for i = 1, 2 , . . ., q

In the case of a tie (two or more of the largest-valued posteriors are equal), assign arbitrarily to any one of the corresponding classes. In practice, breaking the tie arbitrarily is the prescription suggested for any ML model.

In this case, the probability of error is the sum of the posterior probabilities of the remaining q – 1. We know that the posteriors across all the q classes add up to 1. 

Applications of Bayes' Rule and Bayesian Inference

  1. Medical Diagnosis: Estimating disease probabilities based on test results
  2. Spam Detection: Filtering emails based on word patterns
  3. Natural Language Processing: Text categorization
  4. Finance: Risk assessment and portfolio optimization
  5. Machine Learning: Bayesian networks and probabilistic graphical models

Advantages

Disadvantages

Types of Bayes Classifiers

  1. Naive Bayes Classifier:
    Assumes independence between features for computational efficiency. Variants include:

    • Gaussian Naive Bayes (for continuous features)
    • Multinomial Naive Bayes (for count-based features like text)
    • Bernoulli Naive Bayes (for binary features)
  2. Optimal Bayes Classifier:
    Theoretically perfect classifier but computationally infeasible in practice.

Naive Bayes Classifier

The Naive Bayes Classifier is a simple yet powerful machine learning algorithm based on Bayes' theorem. It is called "naive" because it assumes that all features are independent of each other, which is often unrealistic but works well in practice.

Mathematical Formulation

Given a dataset with n features x1,x2, …, xn and class labels c1,c2, …, ck, the task is to predict the class c for a new instance with feature vector x.

Using Bayes' Theorem, the probability of class ci given the feature vector x is:

Where:

The decision rule is:

Example Dataset

Consider a dataset for predicting whether a person will buy a product (c{Yes,No} based on three features:

Age

Income

Student

Buys Product

Young

High

No

No

Young

High

Yes

Yes

Middle-aged

Medium

No

Yes

Old

Low

No

No

Old

Medium

No

Yes

Young

Medium

Yes

Yes

Middle-aged

High

No

Yes

Old

High

No

No

How Bayes Classifier Works

  1. Compute Priors:
    Calculate the prior probabilities of each class P(C) from the training data.

  2. Compute Likelihoods:
    Estimate the likelihood P(XC for each feature and class.

  3. Apply Bayes' Theorem:
    Use Bayes' theorem to compute the posterior probabilities for each class.

  4. Prediction:
    Choose the class with the highest posterior probability.

Example: Spam Email Classification

Free Offer

Money

Label

1

1

Spam

0

1

Not Spam

1

0

Spam

0

0

Not Spam

Step-by-Step Process

Step 1: Calculate Prior Probabilities

We first determine the prior probabilities of spam and not spam.

P(Spam)=2/4=0.5

P(Not Spam)=2/4=0.5

Step 2: Calculate Likelihood Probabilities

For each feature ("Free Offer" and "Money"), we calculate the probability given spam and Not spam.

For Free Offer

Free Offer

spam

Not spam

P(Free Offer | spam)

P(Free Offer |Not spam)

0

0

2

0/2

2/2=1

1

2

0

2/2=1

0/2

Total

2

2

For Money

Money

spam

Not spam

P(Money | spam)

P(Money |Not spam)

0

1

1

1/2 

1/2

1

1

1

1/2

1/2

Total

2

2

Step 3: Classify a New Email

Let's say we have a new email with:

We calculate the posterior probabilities using Bayes' theorem:

P(Spam | Free Offer = 1, Money = 1)

P(SpamFree Offer=1,Money=1)P(Free Offer=1Spam)P(Money=1Spam)P(Spam)

Substituting values:

P(Spam1,1)(1)×(0.5)×(0.5)=0.25

P(Not Spam | Free Offer = 1, Money = 1)

P(Not SpamFree Offer=1,Money=1)P(Free Offer=1Not Spam)P(Money=1Not Spam)P(Not Spam)

Substituting values:

P(Not Spam1,1)(0)×(0.5)×(0.5)=0

Step 4: Decision

Since P(Spam1,1)=0.25 and P(Not Spam1,1)=0, the email is classified as Spam.

Example: Data set to illustrate the difficulties associated with the Bayes classifier

Pattern

Feature1

Feature2

Feature3

Class

1

0

0

0

C0

2

1

0

1

C1

3

1

0

0

C0

4

1

1

1

C1

5

0

1

1

C1

6

0

1

1

C0

7

1

0

1

?

Step-by-Step Process

Step 1: Calculate Prior Probabilities

We first determine the prior probabilities of C0 and C1.

P(C0) = 3/6=1/2

P(C1) = 3/6=1/2

Step 2: Calculate Likelihood Probabilities

For each feature ("Feature1", “Feature2” and " Feature3"), we calculate the probability given C0 and C1.

For Feature1

Feature1

C0

C1

P(Feature1|C0)

P(Feature1|C1)

0

2

1

2/3

1/3

1

  1

2

1/3

2/3

Total

3

3

For Feature2

Feature1

C0

C1

P(Feature2|C0)

P(Feature2|C1)

0

2

1

2/3

1/3

1

  1

2

1/3

2/3

Total

3

3

For Feature3

Feature1

C0

C1

P(Feature3|C0)

P(Feature3|C1)

0

2

0

2/3

0/3

1

  1

3

1/3

3/3

Total

3

3

Step 3: Classify a New Pattern

Let's say we have a new pattern with:

We calculate the posterior probabilities using Bayes' theorem:

P(Spam | Free Offer = 1, Money = 1)

P(SpamFree Offer=1,Money=1)P(Free Offer=1Spam)P(Money=1Spam)P(Spam)

Substituting values:

P(Spam1,1)(0.5)×(0.5)×(0.5)=0.125

P(Not Spam | Free Offer = 1, Money = 1)

P(Not SpamFree Offer=1,Money=1)P(Free Offer=1Not Spam)P(Money=1Not Spam)P(Not Spam)

Substituting values:

P(Not Spam1,1)(0)×(0.5)×(0.5)=0

Step 4: Decision

Since P(Spam1,1)=0.125 and P(Not Spam1,1)=0, the email is classified as Spam.

Naive Bayes Classifier with Multiple Features: New Dataset

Let's consider a weather dataset for predicting whether a person will play tennis based on weather conditions. This dataset has the following features:

Training Dataset

Outlook

Temperature

Humidity

Wind

Play Tennis

Sunny

Hot

High

Weak

No

Sunny

Hot

High

Strong

No

Overcast

Hot

High

Weak

Yes

Rain

Mild

High

Weak

Yes

Rain

Cool

Normal

Weak

Yes

Rain

Cool

Normal

Strong

No

Overcast

Cool

Normal

Strong

Yes

Sunny

Mild

High

Weak

No

Sunny

Cool

Normal

Weak

Yes

Rain

Mild

Normal

Weak

Yes

Sunny

Mild

Normal

Strong

Yes

Overcast

Mild

High

Strong

Yes

Overcast

Hot

Normal

Weak

Yes

Rain

Mild

High

Strong

No

Step 1: Compute Priors

P(Yes)=9/14,

P(No)=5/14

Step 2: Compute Likelihoods

For each feature ("Outlook", “Temperature”, “Humidity” and " Wind"), we calculate the probability given Yes and No.

For Outlook

Outlook

Yes

No

P(Outlook| Yes)

P(Outlook| No)

Sunny

2

3

2/9

3/5

Overcast

4

0

4/9

0/5

Rain

3

2

3/9

2/5

Total

9

5

For Temperature

Temperature

Yes

No

P(Temperature | Yes)

P(Temperature | No)

Hot

2

2

2/9

2/5

Mild

4

2

4/9

2/5

Cool

3

1

3/9

1/5

Total

9

5

For Humidity

Humidity

Yes

No

P(Humidity | Yes)

P(Humidity | No)

High

3

4

3/9

4/5

Normal

6

1

6/9

1/5

Total

9

5

For Outlook

Wind

Yes

No

P(Wind | Yes)

P(Wind | No)

Weak

6

2

6/9

2/5

Strong

3

3

3/9

3/5

Total

9

5

Step 3: Classification Example

Input:

 

Outlook = Sunny, Temperature = Cool, Humidity = High, Wind = Strong

Posterior for "Yes"

P(YesSunny, Cool, High, Strong)P(Yes)×P(SunnyYes)×P(CoolYes)×P(HighYes)×P(StrongYes)

9/14×2/9×3/9×3/9×3/9

            0.01587

Posterior for "No"

P(NoSunny, Cool, High, Strong)P(No)×P(SunnyNo)×P(CoolNo)×P(HighNo)×P(StrongNo)

5/14×3/5×1/5×4/5×3/5

0.03429

Classification Decision

Since P(No)>P(Yes), the instance is classified as No (won't play tennis).

Why Naive Bayes Works Well

Limitations

 

bias-variance tradeoff:

The bias-variance tradeoff is a fundamental concept in machine learning that describes the tradeoff between two sources of error that affect model performance: bias and variance.

Bias

Bias refers to errors due to incorrect assumptions in the learning algorithm. A high-bias model is too simplistic and fails to capture the underlying pattern of the data. This leads to underfitting.

Variance

Variance refers to errors due to sensitivity to small fluctuations in the training data. A high-variance model learns the noise along with the signal, leading to overfitting.

The Tradeoff

Solutions to Address Bias and Variance

To strike the right balance and address bias and variance issues, consider the following solutions:

  1. Regularization: Regularization techniques like L1 (Lasso) and L2 (Ridge) can help mitigate overfitting. These methods add penalty terms to the model’s cost function, discouraging it from becoming overly complex.

  2. Feature Engineering: Thoughtful feature selection and engineering can reduce both bias and variance. By including relevant features and excluding noisy ones, you can improve model performance.

  3. Cross-Validation: Utilize cross-validation to assess your model’s performance on different subsets of the data. This helps you gauge how well your model generalizes across various data splits, providing valuable insights into bias and variance.

  4. Ensemble Methods: Ensemble techniques such as Random Forests and Gradient Boosting combine multiple models to achieve better performance. They can effectively reduce overfitting while improving predictive accuracy.

  5. Collect More Data: If your model suffers from high bias (underfitting), acquiring more data can help it capture more complex patterns. Additional data can be especially beneficial when dealing with deep neural networks.

Linear Discriminant Function: 

A linear discriminant function is a mathematical function used for classification that helps determine which class a given input belongs to by defining a linear decision boundary 

A linear discriminant function is a function of the form:

g(w, x) = wTx+b 

where:

That is w=( w(1), w(2) ,...,w(l)) be an l-dimensional weight vector and let x=( x(1), x(2) ,...,x(l)) be a feature vector.

The classification decision is made as follows:

This means that the decision boundary is given by the hyperplane:

wTx+b=0 

The linear discriminant function can be written as 

g(w, x) = wTx+b 

where w is the weight vector and b the bias or threshold weight.

EXAMPLE: Let x1 + x) > 0 or x1 + x2 - 0.5 > 0 be a linear discriminant function; here w=(1,1) and b=0 for the former and w = (1, 1) b = - 0.5 for the latter.

Both of them represent the Boolean function. Consider fig 1, which shows the Boolean OR, AND functions. Here, x1 and x2 are binary features taking values of either 1 (True) or 0 (False). Note that OR( x1 , x2) is 0 only when both x1 and x2 are 0, else it is True.  Similarly, AND(x1, x2) is 1 when both x1 and x2 are 1. 

In the case of OR, the decision boundary (line) characterizes points that satisfy x1 + x2= 0.5. Points that give output 1 satisfy the property that x1 + x2 > 0.5 and a point for which output is 0 satisfies x1 + x2 < 0.5. 

In the case of AND x = (x1, x2) on the line satisfies x1 + x2 =1.5. A point that gives 1 as output satisfies x1 + x2 > 1.5 and points that give 0 as output satisfy x1 + x) < 1.5

Fig 2: Truth table for Boolean OR and AND

Fig1: Boolean functions AND and OR

Perceptron Classifier:

Perceptron is the earliest ML model that was used in classifying linearly separable data. Some related observations are as follows:

Perceptron Learning Algorithm

The algorithm is very simple and it can be shown that the algorithm stops after a finite number of steps to give us a correct w if the classes are linearly separable.

Let the training data set be  (x1, l1), (x2, l2) ,...,(xn ,ln ) where xj is the jth  pattern, ij is its class label and lj   {- 1, 1} based on weather xj negative class or positive class. 

The algorithm follows the steps given below:

1. Let i = 0 Let the initial augmented weight vector wi the zero vector. Let j = 1 (Note that Jis the pattern index and i is the weight vector index).

2. If xj is incorrectly classified by w, then i = i + 1 and wi =wi - 1 +ljxj, set j = (j + 1) mod n and go to step 2. Else set j = (j + 1) mod n and go to step 2.

3. Terminate the algorithm if no change in weight vector wi for a successive steps.

Example: Let the training set be {((- 1, 1), L1 = - 1), ((- 2, - 2), L2 = - 1), ((4, 4), L3 = 1)} 

Compute the weight vector which classifies these feature vectors using perceptron learning.

Assignment Example: Consider the truth table of Boolean OR shown in the first three columns of  below Table. The inputs are x1 and x2 and the output is either a 0 or 1; we have a two-class problem based on these two. So, there is one pattern from Class 0 and three patterns from Class 1. The augmented patterns are shown in Table.

TABLE:  Augmented patterns for Boolean OR

1

x1

x2

OR( x1, x2)

1

0

0

0

1

0

1

1

1

1

0

1

1

1

1

1

Compute the weight vector which correctly classifies these feature vectors using perceptron learning.

Support Vector Machine Algorithm:

Support Vector Machine or SVM is Supervised Learning algorithms, which is used for Classification as well as Regression problems. However, primarily, it is used for Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space into classes so that we can easily put the new data point in the correct category in the future. This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are called as support vectors, and hence algorithm is termed as Support Vector Machine. Consider the below diagram in which there are two different categories that are classified using a decision boundary or hyperplane:

Support Vector Machine Algorithm

Example: Suppose we see a strange cat that also has some features of dogs, so if we want a model that can accurately identify whether it is a cat or dog, so such a model can be created by using the SVM algorithm. We will first train our model with lots of images of cats and dogs so that it can learn about different features of cats and dogs, and then we test it with this strange creature. So as support vector creates a decision boundary between these two data (cat and dog) and choose extreme cases (support vectors), it will see the extreme case of cat and dog. On the basis of the support vectors, it will classify it as a cat. Consider the below diagram:

Support Vector Machine Algorithm

SVM algorithm can be used for Face detection, image classification, text categorization, etc.

Types of SVM

Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-dimensional space, but we need to find out the best decision boundary that helps to classify the data points. This best boundary is known as the hyperplane of SVM.

The dimensions of the hyperplane depend on the features present in the dataset, which means if there are 2 features (as shown in image), then hyperplane will be a straight line. And if there are 3 features, then hyperplane will be a 2-dimension plane.

We always create a hyperplane that has a maximum margin, which means the maximum distance between the data points.

Support Vectors:

The data points or vectors that are the closest to the hyperplane and which affect the position of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence called a Support vector.

Linear SVM:

The working of the SVM algorithm can be understood by using an example. Suppose we have a dataset that has two tags (green and blue), and the dataset has two features x1 and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in either green or blue. Consider the below image:

Support Vector Machine Algorithm

So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But there can be multiple lines that can separate these classes. Consider the below image:

Support Vector Machine Algorithm

Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or region is called as a hyperplane. SVM algorithm finds the closest point of the lines from both the classes. These points are called support vectors. The distance between the vectors and the hyperplane is called as margin. And the goal of SVM is to maximize this margin. The hyperplane with maximum margin is called the optimal hyperplane.

Support Vector Machine Algorithm

Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data, we cannot draw a single straight line. Consider the below image:

Support Vector Machine Algorithm

So to separate these data points, we need to add one more dimension. For linear data, we have used two dimensions x and y, so for non-linear data, we will add a third dimension z. It can be calculated as:

                               z=x2 +y2

By adding the third dimension, the sample space will become as below image:

Support Vector Machine Algorithm

So now, SVM will divide the datasets into classes in the following way. Consider the below image:

Support Vector Machine Algorithm

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in 2d space with z=1, then it will become as:

Support Vector Machine Algorithm

Hence we get a circumference of radius 1 in case of non-linear data.

Kernel Trick:

The most interesting feature of SVM is that it can even work with a non-linear dataset and for this, we use “Kernel Trick” which makes it easier to classifies the points. Suppose we have a dataset like this:

Here we see we cannot draw a single line or say hyperplane which can classify the points correctly. So what we do is try converting this lower dimension space to a higher dimension space using some quadratic functions which will allow us to find a decision boundary that clearly divides the data points. These functions which help us do this are called Kernels and which kernel to use is purely determined by hyperparameter tuning.

Different Kernel Functions

Some kernel functions which you can use in SVM are given below:

1. Polynomial Kernel

Following is the formula for the polynomial kernel:

2. Sigmoid Kernel

We can use it as the proxy for neural networks. Equation is:

It is just taking your input, mapping them to a value of 0 and 1 so that they can be separated by a simple straight line.

3. RBF Kernel

What it actually does is to create non-linear combinations of our features to lift your samples onto a higher-dimensional feature space where we can use a linear decision boundary to separate your classes It is the most used kernel in SVM classifications, the following formula explains it mathematically:

where,

4. Bessel function kernel

It is mainly used for eliminating the cross term in mathematical functions. Following is the formula of the Bessel function kernel:

5. Anova Kernel

It performs well on multidimensional regression problems. The formula for this kernel function is:

Linear Regression: 

Linear Regression is a supervised learning algorithm used for predicting a continuous target variable based on one or more input features. It assumes a linear relationship between the independent variable(s) X and the dependent variable Y.

Mathematical Representation

For a single-variable (simple) linear regression, the relationship is:

Y=mX+b+ϵ 

Where:

For multiple linear regression with multiple features X1,X2,...,Xn​:

Y=b+w1X1+w2X2+...+wnXn+ϵ

where w1,w2,...,wn​ are the learned coefficients for each feature.

Linear Regression Using Linear Algebra

We can use linear algebra to calculate the weights (coefficients) efficiently. The key idea is to represent the problem in matrix form and solve for the optimal weights using the Normal Equation.

1. Matrix Representation of Linear Regression

The simple linear regression model:

Y= b+mX 

can be rewritten in matrix form for multiple data points:

Y=Xw 

Where:

For a dataset with n observations:

where:

2. Normal Equation (Optimal Solution)

To find the best weights that minimize the squared error, we use the Normal Equation:

This equation provides the optimal solution for linear regression without needing gradient descent.

EXAMPLE: Let us consider a simple example of three vectors x1 = (1, 1), x2 = (1, 2) and x3 = (2, 2). Let y1 =3, y2=5, and y3 = 6. Find the linear regression coefficients.

Assignment Example: Predicting House Prices

Imagine you have a dataset of house prices based on the size of the house. The goal is to predict the price given the size.

X (House Size in sq ft)

Y (Price in $1000s)

5

7

6

9

7

10

9

12

10

?

Advantages:

Disadvantages:

When to Use Linear Regression?

Logistic Regression:

Logistic Regression is a supervised learning algorithm used for binary classification problems (i.e., where the output is either 0 or 1). Unlike Linear Regression, which predicts continuous values, Logistic Regression predicts probabilities and applies a threshold to classify data points.

Logistic Regression is based on the sigmoid function (logistic function), which maps any real-valued number into the range (0,1), making it ideal for probability estimation.

In Linear Regression, we estimate w and b by minimizing the Mean Squared Error (MSE):

where Yi=wXi+b is the predicted output.

However, in Logistic Regression, the output is a probability:

This non-linearity makes the least squares approach suboptimal, but we can still approximate www and b by transforming the logistic model into a linear equation.

Transforming the Logistic Regression Equation

We take the logit transformation of the sigmoid function:

This converts our model into a linear equation:

Z=wX+b 

where:

Now, the problem is reduced to a linear regression problem, where we fit:

Z=wX+b 

using the Least Squares Method.

Estimating w and b Using the Normal Equation

For a dataset with N samples:

Z=Xw+b 

we can estimate w and b using the Normal Equation from Linear Regression:

w=(XTX)−1XT

where:

Example: Let us consider a simple example of four vectors x1 = (1, 2), x2 = (2, 1), x3 = (6, 7) and x4 = (7, 6). Let y1 =0, y2= 0, y3 = 1 and y4 = 1. Find the discriminant function.

Assignment Example:

Let's say we have the dataset:

Study Hours (X)

Pass (Y)

1

0.1

2

0.2

3

0.3

4

0.7

Step 1: Compute the Logit (Z) Values

We first transform Y using the logit function:

X

Y

Z

1

0.1

log(0.1/0.9)

2

0.2

log(0.2/0.8)

3

0.3

log(0.3/0.7)

4

0.7

log(0.7/0.3)

Step 2: Solve for w and b

Using the Least Squares Normal Equation:

w=(XTX)−1XT

Compute w and b.

Multilayer Perceptron Learning:

Multilayer Perceptron (MLP) is a type of neural network that consists of multiple layers, allowing it to solve more complex problems than a single-layer perceptron. Structurally, an MLP has an input layer to receive data, one or more hidden layers to process the information, and an output layer that provides the final prediction. This arrangement makes MLPs a feedforward neural network, meaning the data moves in one direction—from input to output.

One of the key aspects of MLPs is backpropagation, a method used to train the network. Backpropagation adjusts the network’s weights based on errors from predictions, helping the model learn over time by minimizing mistakes. This process allows the MLP to improve its accuracy through multiple training cycles, making it suitable for tasks where precision is essential.

A Multilayer Perceptron (MLP) processes data by passing it through a series of layers, each contributing to the network’s ability to learn and make predictions. Here’s a breakdown of each layer’s role:

In a multilayer perceptron, neurons process information in a step-by-step manner, performing computations that involve weighted sums and nonlinear transformations. Let's walk layer by layer to see the magic that goes within.

Input layer

Hidden layers

Output layer

During the training process, the network learns to adjust the weights associated with each neuron's inputs to minimize the discrepancy between the predicted outputs and the true target values in the training data. By adjusting the weights and learning the appropriate activation functions, the network learns to approximate complex patterns and relationships in the data, enabling it to make accurate predictions on new, unseen samples.

This adjustment is guided by an optimization algorithm, such as stochastic gradient descent (SGD), which computes the gradients of a loss function with respect to the weights and updates the weights iteratively.

Back Propagation for Training MLP: 

The back propagation algorithm is used in a Multilayer perceptron neural network to increase the accuracy of the output by reducing the error in predicted output and actual output.  

According to this algorithm,  

Advantages:  

Disadvantages: 

Introduction To Clustering:

Clustering refers to the process of arranging or organizing objects according to specific criteria. It plays a crucial role in uncovering concealed knowledge in large data sets. Clustering involves dividing or grouping the data into smaller data sets based on similarities/dissimilarities. Depending on the requirements, this grouping can lead to various outcomes, such as partitioning of data, data re-organization, compression of data and data summarization.

Partitioning of Data

Clustering of data is a crucial aspect of efficient data access in database-based applications.

Example: The number of records in EMPLOYEE data table =30,000

The distinct values for DEPT_CODE = 1000

If this data table is clustered by DEPT_CODE,

Accessing all the employees of the department with DEPT CODE = '15' requires log2 (1000) +30 accesses. 

Without clustering, accessing 30 employee records from a department would require, on average, 30 x 30000/2 accesses.

Data Re-organization

This data set comprises four binary patterns, each with six dimensions or features labelled fl to f6 and assigned a unique ID ranging from 1 to 4. 

Pattern

fl

f2

f3

f4

f5

f6

1

1

0

1

0

1

0

2

0

1

0

1

0

1

3

1

0

1

0

1

0

4

0

1

0

1

0

1

By clustering the rows, we can obtain the re-organized data set shown in below table. This process groups similar patterns together based on their row-wise similarities.

Pattern

fl

f2

f3

f4

f5

f6

1

1

0

1

0

1

0

3

1

0

1

0

1

0

2

0

1

0

1

0

1

4

0

1

0

1

0

1

TABLE: Data re-organization using row as the criterion

Further, by clustering the re-organized data set from above table by column and grouping similar columns together, we obtain the data set displayed in below table.

Pattern

fl

f3

f5

f2

f4

f6

1

1

1

1

0

0

0

3

1

1

1

0

0

0

2

0

0

0

1

1

1

4

0

0

0

1

1

1

TABLE: Data re-organization using rows and columns

This table reveals the hidden structure in the data. The cluster comprising patterns 1 and 3 can be described using the conjunction f1^f3^f5. Likewise, the second cluster comprising patterns 2 and 4 can be described using f2^f4^f6.

Data Compression

Clustering can also assist in data compression by reducing the time complexity of accessing the data features and minimizing the space required to store them.

Consider the data set shown in above table. Through clustering, we can compress the data set, resulting in the table shown in below table. Note that the frequent pattern (FP) tree and pattern count (PC) tree structures are some examples for achieving such compression of data.

Pattern

fl

f3

f5

f2

f4

f6

Count

1, 3

1

1

1

0

0

0

2

2, 4

0

0

0

1

1

1

2

TABLE: Compressed data

Summarization

The objective of data summarization is to extract a representative subset of samples from a large data set. The purpose is to simplify and expedite analysis on a smaller data set.

Example: When dealing with a data set of the marks of 100 students in a subject, statistical measures like mean (the numerical average of the marks), mode (the most frequently repeated marks) and median (the value in the middle of all the marks when the marks are ranked in order) can provide summarized information. Such summarized information can be derived through clustering.

Matrix Factorization

It is possible to view clustering as matrix factorization.

Let there be a data points in an l-dimensional space. We can represent it as a matrix Xnxl. It is possible to approximate X as a product of two matrices Bnxk and Ckx1 . So, X  BC where B is the cluster assignment matrix and C is the representatives matrix.

Example: Consider the data matrix 

X = [    [6, 6, 6], 

[6, 6, 8], 

[2, 4, 2], 

[2, 2, 2]

]

There are four patterns in a three-dimensional space. If we use the leader algorithm with a threshold of 3 units, we get two clusters: (6, 6, 6) and (6,6,8) that belong to cluster 1 with (6,6,6) as its leader and (2,4,2) and (2,2,2) that belong cluster 2 with (2,4,2) as its leader.

So, we have 

B = [    [1, 0], 

[1, 0], 

[0, 1], 

[0, 1]

It indicates that the first pattern, (6,6,6), is assigned to cluster 1. Correspondingly a 1 in the first row and first column. Matrix B has four rows corresponding to four patterns in X; B has two columns corresponding to two clusters Matrix C will have K rows if there are K clusters and the ith row is the representative of the ith cluster.

In the current example, 

C = [    [6, 6, 6], 

[2, 4, 2]

So the resulting matrix factorization is given by 

Leader is a hard clustering where pattern is assigned fully, that is, with a value 1, to a single cluster. So, each row of the B matrix will have one I and the rest as 0s.

It is possible to have soft clustering. In this case, we can have multiple non-zero entries in the range [0,1] in each row, indicating that a pattern is assigned to more than one cluster. The sum of the entries in a row is 1.

Instead of the leader, if we use the centroid of points in the cluster as its representative, then C = [[6, 6, 7], [2, 3, 2]]. There is no change in B. Note that the centroid of {(6,6,6), (6,6,8)} is (6,6,7) and the centroid of  the second cluster{ (2, 4, 2), (2, 2, 2) } is (2,3,2).

Clustering Of Patterns

Clustering is a technique that involves grouping a set of patterns, resulting in the creation of cohesive clusters or groups from a given collection of patterns. The process of clustering aims to group similar patterns together while keeping dissimilar patterns separate. Below figure illustrates the clustering of a two-dimensional data set (represented by f1 and f2), where three clusters are visually identifiable.

FIG.  Inter- and intra-cluster distances

The clustering process may ensure that the distance between any two points within the same cluster (intra-cluster distance), as measured by a dissimilarity measure such as Euclidean distance, is smaller than the distance between any two points belonging to different clusters (inter-cluster distance).

This indicates that the similarity between points within a cluster is higher than the similarity between clusters. 

Example: Let us consider a two-dimensional data set with 11 data points having two features f1 and f2 as listed in below table.

Data Point

f1

f2

x1

1

1

x2

1

3

x3

2

2

x4

3

1

x5

3

3

x6

4

7

x7

5

7

x8

6

9

x9

7

1

x10

7

3

x11

9

1

TABLE d1 : Data set with two features

Any two points are placed in the same cluster if the distance between them is lower than a certain threshold. In this example, the squared Euclidean distance is used to measure the distance between the points, and a threshold of 10 units is set to cluster them. The squared Euclidean distance (d) between two points, z, and, is calculated as follows:

d(xi, xj) =  

where I represents the dimensionality of the points.

In this example, since we are dealing with two-dimensional data points, l equals 2. Using this formula, the squared Euclidean distance between all pairs of points can be found and is presented in Table.

In Table, the clusters are clearly visible within the matrix itself. The table contains three sub-matrices of sizes 5 x 5, 3 x 3 and 3 x 3 that meet the condition of having a maximum value of 10 in any entry.

Cluster A = {x1, x2, x3, x4, x5}

Cluster B = {x6, x7, x6}

Cluster C = {x9, x10, x11}

One can that none of the data points is present in more than one cluster. Such a clustering is called hard clustering, otherwise it is known as soft or overlapping clustering.

TABLE d2: Squared Euclidean distance between pairs of data points listed in Table d1

Data Abstraction

Clustering is a useful method for data abstraction, and it can be applied to generate clusters of data points that can be represented by their centroid, medoid, leader or some other suitable entity.

The centroid is computed as the sample mean of the data points in a cluster, and it is given by 1/nc *   ) where nc is the number of patterns in the cluster C. 

FIG. Visualization of centroid and medoid

The medoid is the point that minimizes the sum of distances to all other points in the cluster. Above figure, illustrates this process, where an outlier that is located far away from the data points in the cluster is also present. The centroid can shift dramatically based on the position of the outlier, while the medoid remains stable within the boundaries of the original cluster. Therefore, clustering algorithms that utilize medoids are more resilient to noisy patterns or outliers.

EXAMPLE: Consider the data set shown in Table and the cluster thus generated given in Table d2. The centroid of the clusters, Cluster A (that is, µA), Cluster B (that is, µB) and Cluster C (that is, µC), are given below:

µA = 1/5 x [(1, 1) + (1, 3) + (2, 2) + (3, 1) + (3, 3)] = ((1 + 1 + 2 + 3 + 3)/5, (1 + 3 + 2 + 1 + 3)/5) = (2, 2) 

µB = 1/3 * [(4, 7) + (5, 7) + (6, 9)] = ((4 + 5 + 6)/3, (7 + 7 + 9)/3) = (5, 7.67)

µC = 1/3 * [(7, 1) + (7, 3) + (9, 1)] = ((7 + 7 + 9)/3, (1 + 3 + 1)/3) = (7.67, 1.67) 

For the same clusters, the medoids are given below: 

mA = (2, 2) mB = (5, 7) and mC = (7, 3)

We know that the medoid of a group of points is defined as a data point within the cluster that is located at the centre, where the total distance from the points in the cluster is at its minimum. Using the squared Euclidean distance as the metric for calculating distance, the distance of the points from mA, mB and mC can be determined as follows:

For Cluster_A: Considering X3 as the medoid (mA):

d(x1, mA) = 2

d(x2, mA) = 2

d(x3, mA) = 0

d(x4, mA) = 2

d(x5, mA) = 2

The sum of the distances is 8. Note that if you choose any other data point, other than x3 as the medoid, the corresponding sum of the distances is greater than 8.

For Cluster B: Considering x7 as medoid (mB): 

d(x6, mB) = 1 

d(x7, mB) = 0 

d(x8, mB) = 5

The sum of the distances is 6. Note that if you choose any other data point, other than x7 as the medoid, the corresponding sum of the distances is greater than 6.

For cluster C: Considering x9 as medoid (mC):

d(x9, mC) = 0

d(x10, mC) = 4

d(x11, mC) = 4

The sum of the distances is 8. Note that if you choose any other data point, other than x9 as the medoid, the corresponding sum of the distances is greater than 8.

Example: Consider the cluster of three data points x1=(4,1), x2=(5,1) and x3=(6,3)

The centroid for this cluster is µ = ((4+5+6)/3, (1+1+3)/3) = (5, 1.67)

The medoid for this cluster can be obtained from the below table.

x1

x2

x3

x1

0

1

8

9

x2

1

0

5

6

x3

8

5

0

13

The data point x2 has the minimum distance with all other data points. So, x2 is the medoid.

If we add the outlier point, x4=(20, 1), then the new centroid is 

= ((4+5+6+20)/3, (1+1+3+1)/3) = (8.75, 1.5)

The medoid for this cluster can be obtained from the below table.

x1

x2

x3

x4

x1

0

1

8

256

265

x2

1

0

5

225

231

x3

8

5

0

200

213

x4

256

225

200

0

681

The data point x3 has the minimum distance with all other data points. So, x3 is the medoid.

Note that in the above examples, we have used centroid and medoid as the representatives or descriptions of the cluster. However, it is feasible to have more than one or multiple representative elements per cluster.

Clusters are useful in several decision-making situations:

Clusters are useful in several decision-making situations like classification, prediction, etc. However, when the number of cluster representatives obtained is smaller than the number of input patterns, there is a reduction in computation while performing classification or prediction activity. The following example demonstrates this.

EXAMPLE : Consider the data set shown in below table. Let (2, 4) be a test pattern which needs to be classified using the nearest neighbor classifier on the 11 labelled patterns of Table d3.

Data Point

f1

f2

Class  Label

x1

1

1

A

x2

1

3

A

x3

2

2

A

x4

3

1

A

x5

3

3

A

x6

4

7

B

x7

5

7

B

x8

6

9

B

x9

7

1

B

x10

7

3

B

x11

9

1

B

TABLE d3 Data set with two features and corresponding labels

Note that (2, 4) is the nearest neighbor of (3, 3) with a squared Euclidean distance of 2 units between them. This can be obtained by computing 11 distance values calculating the squared Euclidean distance between the test pattern (2, 4) and each of the 11 labelled patterns. So, (2, 4) is classified as belonging to class "A" because its nearest neighbor (3, 3) belongs to "A"

However, by clustering the 11 patterns using a suitable clustering algorithm and representing each resulting cluster by its centroid, we can reduce the number of distance values to be computed, from the mumber of labelled patterns to the number of cluster representatives. This may be illustrated as follows.

Let the 11 patterns be clustered using the same criterion as the one used in Example 7, that is, the squared Euclidean distance between any two patterns placed in the same cluster should be within a threshold of 10 units. This results in three clusters: one from class "A" and two from class "B". 

The cluster corresponding to class "A" is:

cluster A = {(1, 1), (1, 3), (2, 2), (3, 1), (3, 3)} with centroid µA = (2, 2)

The two clusters corresponding to class "B" are:

Cluster  B = {(4, 7), (5, 7), (6, 9)} with centroid µB = (5, 7.67) 

Cluster C = {(7, 1), (7, 3), (9, 1)} with centroid µ= (7.67, 1.67) 

So, instead of using the 11 labelled patterns, if one uses the centroids of the three clusters as the labelled data, then there is a reduction in the data to be analysed by a factor of 3. The reduced labelled data is given in Table 7.10.

TABLE 7.10 Reduced data set with only cluster representatives

By referring to Table 7.10, it is possible to determine that the nearest neighbor of the point (2, 4) belongs to class "A", specifically the point (2, 2). As a result, the test pattern is classified as belonging to class "A" by making only three distance computations between the test pattern and each of the three centroids. Moreover, it should be noted that for the distance computations, only the centroids of the three clusters need to be stored, leading to a reduction in both time and space.

In this example, since we need to compute only three distances instead of 11 and need to store only three centroids out of 11 patterns, there is a reduction in both time and space.

It could be argued that clustering the labelled data and computing the centroids may require a significant amount of time and space. However, it is possible to perform the clustering process as soon as the labelled data becomes available, prior to the classification stage. Furthermore, computation of the cluster centroid is a one-time affair. After obtaining the representative centroids, we can use them as labelled data to classify new test patterns, resulting in a significant reduction in both time and space requirements.

This clustering-based classification approach is particularly valuable for solving large-scale pattern classification problems that commonly arise in machine learning and also in dealing with class imbalance, where patterns in the majority class are clustered to have a smaller number of representatives from the majority class.

Clustering Algorithms:

Clustering is an unsupervised learning technique used to group similar data points into clusters based on their inherent characteristics. It helps uncover hidden patterns in data when there are no predefined labels.

Why Clustering?

Clustering algorithms can be classified into hard and soft clustering, depending on whether clusters share data points or not. Hard clustering algorithms generate a partition of the given data set, while hierarchical algorithms generate a nested sequence of partitions. On the other hand, soft clustering algorithms utilize fuzzy sets, rough sets or evolutionary algorithms.

FIG. Classification of clustering algorithms

Hierachical Algorithms:

The process of generating a sequence of data partitions using hierarchical algorithms can be represented using a tree structure called a dendrogram. There are two types of hierarchical algorithms divisive and agglomerative.

Divisive algorithms follow a top-down approach, where a single cluster with all the patterns is split into smaller clusters at each step until there is only one pattern in each cluster or a collection of singleton clusters. On the other hand, agglomerative algorithms follow a bottom-up approach, where n singleton clusters are created for a input patterns. At each level, the two most similar clusters are merged to reduce the size of the partition by 1.

Once two patterns are placed in the same cluster in agglomerative algorithms, they remain in the same cluster in subsequent levels. Similarly, once two patterns are placed in different clusters in divisive algorithms, they remain in different clusters in subsequent levels.

Divisive Clustering (Top-Down Hierarchical Clustering)

Divisive Clustering, also known as Top-Down Hierarchical Clustering, is a type of hierarchical clustering method where the algorithm starts with a single large cluster containing all data points. It then recursively splits the cluster into smaller sub-clusters until each point becomes its own cluster (or meets a termination condition).

How Divisive Clustering Works

  1. Start with a single cluster: All data points are treated as one large cluster.

  2. Recursive splitting:

    • Choose a criterion to split the cluster (e.g., distance or similarity measures).

    • Divide the cluster into two or more sub-clusters.

  3. Continue splitting:

    • Repeat the process for each sub-cluster until a stopping condition is met, such as:

      • Each point forms its own cluster.

      • A predefined number of clusters is reached.

      • A similarity threshold is met.

Algorithm Steps

  1. Begin with all data points in one cluster.

  2. Choose a splitting criterion (such as K-Means or maximum variance).

  3. Split the cluster into two or more sub-clusters.

  4. Evaluate sub-clusters and recursively divide them if necessary.

  5. Stop when a termination condition is met.

Example

Suppose we have the following points in a 2D space:

Step 1: Start with a single cluster containing all points: [A,B,C,D]

Step 2: Split into two clusters based on distance:

Step 3: Recursively evaluate each cluster. If no further meaningful splits can be made, stop.

Final Clusters: [A,B] and [C,D]

Advantages

Disadvantages

Use Cases

Agglomerative Clustering (Bottom-Up Hierarchical Clustering)

Agglomerative Clustering is a hierarchical clustering method where the algorithm starts by treating each data point as its own cluster. It then merges the closest clusters iteratively until all points belong to a single cluster or a stopping criterion is met.

How Agglomerative Clustering Works

  1. Start with each data point as its own cluster.

  2. Compute distances: Measure the pairwise distances between clusters using a distance metric (Euclidean, Manhattan, etc.).

  3. Merge closest clusters: Merge the two clusters that have the smallest distance.

  4. Update distance matrix: Recompute the distances between the new cluster and the remaining clusters.

  5. Repeat until one cluster remains or a stopping condition is met.

Linkage Methods (Distance Measures)

The choice of linkage affects how the distance between clusters is computed:

  1. Single Linkage:

    • Tends to form elongated clusters.

  2. Complete Linkage:

    • Distance between the farthest points of two clusters.

    • Tends to form compact clusters.

  3. Average Linkage:

    • Average distance between all points in two clusters.

  4. Ward's Linkage:

    • Minimizes the increase in total variance when merging clusters.

    • Often produces better results.

Distance between the closest points of two clusters.

Example

Consider the following points in 2D space:

Point

Coordinates

A

(1, 1)

B

(2, 2)

C

(5, 5)

D

(6, 6)

Step-by-Step Process (Using Single Linkage):

  1. Start with each point as its own cluster: [A],[B],[C],[D]

  2. Compute distances between clusters:

    • Distance(A, B) = 1.41

    • Distance(A, C) = 5.66

    • Distance(B, C) = 4.24

    • And so on.

  3. Merge the closest pair [A],[B] into a new cluster [A,B].

  4. Update the distance matrix and continue merging until all points form a single cluster.

Final Clustering: [A,B], [C,D]

Dendrogram Visualization

Agglomerative clustering can be visualized using a dendrogram, which shows the hierarchical merging process.

Advantages

Disadvantages

Use Cases

K-Means Clustering:

K-Means clustering is an unsupervised learning algorithm used for partitioning data into K distinct clusters based on feature similarity. The goal is to minimize intra-cluster variance (i.e., how similar points within a cluster are) while maximizing inter-cluster distance.

Intuition Behind Clustering

Imagine you own a retail store, and you want to group customers based on their shopping behavior. Some customers buy only budget items, some focus on high-end products, and others are somewhere in between. K-Means can help automatically find these groups (clusters) without you specifying categories in advance.

How K-Means Works (Step-by-Step)

Step 1: Select the Number of Clusters K

Step 2: Initialize Centroids Randomly

Step 3: Cluster Assignment (E-step)

Step 4: Update Centroids (M-step)

Step 5: Repeat Steps 3 and 4

Example with Visualization

Dataset

Imagine the following 2D points representing customer spending behavior:

(1,1),(2,1),(4,3),(5,4)

Step 1: Initial Centroids

Choose two random centroids:

Step 2: Distance Calculation

Compute the Euclidean distance from each point to the centroids:

Cluster Assignment

After comparing distances, the points are divided into:

Step 3: Centroid Update

Compute new centroids:

Step 4: Repeat Until Convergence

After a few iterations, centroids stabilize at optimal positions.

5. Important Concepts in K-Means

How to choose the right k for the k means clustering:

Choosing the right number of clusters (K) in K-Means Clustering is crucial for meaningful and interpretable results. Selecting an inappropriate K can lead to poorly formed clusters. Below are several common methods to determine the optimal K:

1. The Elbow Method (Most Popular)

Steps:

  1. Run K-Means for a range of K values (e.g., from 1 to 10).

  2. Compute the Sum of Squared Errors (SSE) or Inertia for each value of K

    • Ci is the set of points in cluster i

    • μi is the centroid of cluster i

  3. Plot K against the SSE values.

where:

Look for the "elbow" point where the SSE starts decreasing at a slower rate

Fuzzy C-Means Clustering

In this type, each data point x, can be assigned to a cluster C, with a membership value which represents the degree to which x, belongs to C. The membership value is assumed to range between 0 and 1, and for each data point x, the sum of all p values across clusters is equal to 1.

The fuzzy c-means clustering algorithm follows an iterative scheme similar to the k-means algorithm. It begins by identifying e initial cluster centres from the data set or membership values between each data point and the cluster. The following steps are then performed iteratively:

NOTE: for Fuzzy C-Means Clustering refer the text book.