Chapter 3:
Chapter 4:
Chapter 5:
Agglomerative Clustering (Bottom-Up Hierarchical Clustering)
Soft Clustering
NOTE: For the missing topics and more information refer the TEXT BOOK and CLASS NOTES
Decision Tree Algorithm as a Classifier
A Decision Tree is a supervised machine learning algorithm used for both classification and regression tasks. It works by splitting the dataset into branches based on decision rules derived from feature values. The process continues until a stopping criterion is met, resulting in a tree structure where each leaf node represents a class label.
Root Node: This is the starting point, representing the entire dataset.
Splitting: The algorithm selects the best feature and threshold to split the data, based on a criterion (e.g., Gini Index or Information Gain).
Decision Nodes: Intermediate points where the dataset is further divided.
Leaf Nodes: Terminal points that represent class labels.
Advantages
Easy to Understand: The logic is simple and intuitive.
No Need for Feature Scaling: Works well with both categorical and continuous features.
Handles Non-linear Relationships: Can model complex decision boundaries.
Disadvantages
Overfitting: Can create overly complex trees that fit the training data perfectly but generalize poorly.
Bias to Features with More Levels: May prefer features with more unique values.
Instability: Small changes in the data can lead to a completely different tree.
Real-World Application
Medical Diagnosis: Classifying diseases based on symptoms
Credit Risk Assessment: Classifying loan applicants as high or low risk
Customer Segmentation: Classifying users for targeted marketing campaigns
Manufacturing: Quality control checks
A Decision Tree Classifier can use Entropy and Information Gain to determine the best feature for splitting the data at each step.
Entropy measures the amount of disorder or impurity in the dataset. It is calculated as:
Where:
c is the number of classes
pi is the proportion of samples belonging to class i
Entropy ranges from:
0: Pure set (all samples belong to one class)
1: Maximum disorder (equal probability for each class)
Information gain measures the reduction in entropy after a split:
Where:
n is the number of child nodes
Sk is the size of the child subset
S is the total size of the parent set
Weather |
Temperature |
Humidity |
Wind |
Play Tennis |
Sunny |
Hot |
High |
Weak |
No |
Sunny |
Hot |
High |
Strong |
No |
Overcast |
Hot |
High |
Weak |
Yes |
Rain |
Mild |
High |
Weak |
Yes |
Rain |
Cool |
Normal |
Weak |
Yes |
Rain |
Cool |
Normal |
Strong |
No |
Overcast |
Cool |
Normal |
Strong |
Yes |
Sunny |
Mild |
High |
Weak |
No |
Step 1: Calculate Entropy of the Root Node
We have 8 samples:
Yes: 4
No: 4
= -(4/8) x log2 (4/8) – (4/8) x log2 (4/8)
= 1
Weather |
Play Tennis |
Sunny |
No |
Sunny |
No |
Sunny |
No |
Total: 3 samples (No
: 3, Yes
: 0)
Weather |
Play Tennis |
Overcast |
Yes |
Overcast |
Yes |
Total: 2 samples (Yes
: 2)
Weather |
Play Tennis |
Rain |
Yes |
Rain |
Yes |
Rain |
No |
Total: 3 samples (Yes
: 2, No
: 1)
= 0.389 + 0.528
= 0.917
Step 3: Weighted Entropy for the "Weather" Split
Step 4: Information Gain for "Weather" Split
= 1 – 0.344
= 0.656
The Information Gain for splitting on "Weather" is 0.656. You would repeat this process for other features (Temperature, Humidity and Wind
) and choose the feature with the highest information gain to be the root of the decision tree.
Temperature |
Play Tennis |
Hot |
No, No, Yes |
Mild |
Yes, No |
Cool |
Yes, Yes, No |
Total: 3 samples (Yes
: 1, No
: 2)
Total: 2 samples (Yes
: 1, No
: 1)
Total: 3 samples (Yes
: 2, No
: 1)
Weighted Entropy for "Temperature"
Information Gain for "Temperature"
= 1 – 938
= 0.062
Humidity |
Play Tennis |
High |
No, No, No, Yes, Yes |
Normal |
Yes, Yes, No |
Total: 5 samples (Yes
: 2, No
: 3)
Entropy High = -(2/5) x log2 (2/5) – (3/5) x log2 (3/5)
= 0.4 x 1.322 + 0.6 x 0.737
= 0.971
Total: 3 samples (Yes
: 2, No
: 1)
EntropyNormal = -(2/3) x log2 (2/3) – (1/3) x log2 (1/3)
= 0.666 x 0.5858 + 0.333 x 1.585
= 0.91826
Weighted Entropy for "Humidity"
EntropyHumidity = (5/8) x 0.971 + (3/8) x 0.918
= 0.9508
Information Gain for "Humidity"
= 1 – 0.9508
= 0.0491
Wind |
Play Tennis |
Weak |
No, Yes, Yes, Yes, No |
Strong |
No, Yes, No |
Total: 5 samples (Yes
: 3, No
: 2)
Entropy Weak = -(3/5) x log2 (3/5) – (2/5) x log2 (2/5)
= 0.6 x 0.737 + 0.4 x 1.322
= 0.971
Total: 3 samples (Yes
: 1, No
: 2)
Entropy Strong = -(1/3) x log2 (1/3) – (2/3) x log2 (2/3)
= 0.528 + 0.389
= 0.917
Weighted Entropy for "Wind"
Entropy Wind = (5/8) x 0.971 + (3/8) x 0.917
= 0.951
Information Gain for "Wind"
= 1 – 0.951
= 0.0491
Step 6: Choosing the Best Split
Feature |
Information Gain |
Weather |
0.656 |
Temperature |
0.062 |
Humidity |
0.0491 |
Wind |
0.491 |
Since "Weather" has the highest Information Gain (0.656), we choose it as the root node.
Final Decision Tree (Partial)
Weather
/ | \
Sunny Overcast Rain
(No) (Yes) Further Split
Decision Tree Properties:
1. Hierarchical Structure
This hierarchical representation allows decision trees to efficiently handle complex decision-making processes.
2. Recursive Partitioning
For instance, in a loan approval model, the first split might be based on credit score, followed by splits on income level and employment status.
3. Splitting Criteria
Gini Impurity:
Where pi is the probability of class iii. Lower Gini values indicate purer splits.
Entropy (Information Gain):
Information Gain measures the reduction in entropy after a split.
Mean Squared Error (MSE) for Regression:
These criteria guide the tree to create splits that improve prediction accuracy.
4. Non-parametric and Non-linear
5. Interpretability
The ability to visualize decision trees makes them particularly appealing in applications where model transparency is essential.
6. Prone to Overfitting
7. Feature Importance
8. Robustness to Outliers
9. Handle Mixed Data Types
10. Computational Complexity
A Decision Tree Regressor is a variant of the Decision Tree algorithm used for predicting continuous, numerical target values. Instead of classifying data into categories, it aims to fit the data by dividing it into regions with constant target values.
Recursive Partitioning:
Prediction:
Instead of using metrics like Gini impurity or entropy, regression trees rely on variance reduction techniques:
Mean Squared Error (MSE):
The algorithm chooses the split that minimizes the MSE of the target values in the child nodes.
Mean Absolute Error (MAE):
Sometimes used as a robust alternative.
Imagine predicting house prices based on features like the number of rooms and lot size:
<5000 sqft
and ≥5000 sqft
).We have a small dataset where the square footage of houses (X
) and their corresponding prices (y
) are known. We want to predict the price of a house given its square footage.
Square Footage (X) |
House Price (y) ($) |
1500 |
200,000 |
1700 |
220,000 |
2500 |
350,000 |
2700 |
370,000 |
3000 |
400,000 |
Splitting the Data:
The decision tree algorithm splits the feature space (square footage) to minimize the variance in each region.
First Split:
Let's say the algorithm decides the best first split is at 2500 sq ft
. This creates two regions:
mean price ≈ $240,000
)mean price ≈ $390,000
)Further Splitting:
The algorithm continues splitting until a stopping criterion is met (like minimum samples per node or maximum depth).
After further splits, we might have:
The splits can be visualized in a decision tree:
[
All
Data]
/
\
X
≤
2500 X
>
2500
/
\
/
\
X
≤
1700 X
>
1700 X
≤
2700 X
>
2700
Suppose we want to predict the price for a house with 2600 sq ft
.
2500 < X ≤ 2700
, with a mean price of $360,000
.
A Random Forest is a powerful machine learning algorithm that builds a collection (or "forest") of decision trees during training time. For both classification and regression tasks, it makes predictions by aggregating the outputs of multiple trees.
Bootstrapping (Bagging Technique):
Feature Randomness:
Aggregation:
Suppose we want to predict house prices based on square footage and the number of bedrooms.
Square Footage |
Bedrooms |
Price ($) |
1500 |
3 |
200,000 |
1700 |
4 |
240,000 |
2500 |
4 |
350,000 |
2700 |
5 |
400,000 |
3000 |
4 |
500,000 |
Data Bootstrapping:
3 decision trees
in the forest. Each tree receives a bootstrapped sample of the dataset:
[1500, 3, 200,000]
, [2700, 5, 400,000]
, [3000, 4, 500,000]
[1700, 4, 240,000]
, [2500, 4, 350,000]
, [1500, 3, 200,000]
Feature Randomness:
square footage
or bedrooms
, not both.Square Footage > 2500
while Tree 2 splits on Bedrooms > 3
.Prediction Aggregation:
Suppose we want to predict the price for a 2600 sq ft
house with 4 bedrooms
.
Final Prediction:
The predicted prices for test houses are $210,500
and $390,000
.
Feature Importance:
The plot shows whether Square Footage or Bedrooms contributed more to the decision-making process.
Introduction to The Bayes Classifier
The Bayes classifier is an optimal classifier. It operates based on the probability structure associated with the domain of application. It employs Bayes' rule to convert the a priori probability of a class into posterior probability with the help of probability distributions associated with the class. These posterior probabilities are used to classify the test patterns; the pattern is assigned to the class that has the largest posterior probability.
Some of the important properties of Bayesian classifiers are as follows:
It minimizes the probability of error associated with classification.
It can deal with data that employs both categorical and numerical features.
It is more of a benchmark classifier having sound theoretical properties. However, in most practical applications, the underlying probability structure is not readily available.
Some additional constraints on the probability structure are applied to make estimation of the probability structure simpler. For example, the naïve Bayes classifier (NBC) is one that assumes that features are independent of each other, given that the data points belong to a class.
Primarily, there are two schemes for estimating the probabilities associated. One of them depends solely on the data; it is called the maximum likelihood estimate or the frequency-based estimate. The other scheme is more general and it combines application domain knowledge with the data in estimation; it is called the Bayesian estimation or Bayesian learning of the probability structure.
Probability, Conditional Probability and Bayes' Rule:
Let us start with a simple scenario where domain knowledge is used in the form of prior probabilities to classify patterns.
Example: Let us assume that 10,000 people in a community had undergone the COVID-19 test and 50 of them tested positive, while the remaining 9950 tested negative. In this two-class (binary class) problem, a simple frequency estimate will give us the following values for the probabilities for the two classes:
P(positive)= 50/10000 = 0.005 and P(negative) = 9950/10000 = 0.995
These probabilities are called prior probabilities as they are obtained using domain knowledge.
Suppose a new person, from the community, who has not undergone the test, needs to be classified. In the absence of any other information from the person, one would try to use the prior probabilities in decision making; so the new person is assigned a COVID-19-negative label as the probability P(negative) is significantly larger than P(positive) ( 0.995 >>0.005) So, invariably every other person in the community who has not undergone the test will be classified as COVID-19-negative using this rule of classification, which is influenced by the larger prior probability of being COVID-19-negative. Such a classification is erroneous if the new person is actually COVID-19-positive. This can occur with a probability of 0.005 (P(positive)); so, the probability of error is 0.005. Note that every person who is actually COVID-19-negative is correctly classified in this process.
In the KNN classifier, if the value of k = n where n is the size of the training data set, and if we have k1 (out of k) from the COVID-19-positive class and k2(k - k1) from the COVID-19-negative class, then the probability estimates are
P(positive)= k1/n and P(negative) = k2/n
It is not difficult to see that KNN gives the same result as classification based on prior probabilities.
In a random experiment, we have a sample space, S, that is, the set of all outcomes. For example, tossing a coin gives us {H, T} as the sample space, where H stands for head and T stands for tail.
An event is a subset of the sample space. We associate probabilities with events. If A is an event, then its probability P(A) ε [0, 1] that is, probability is non-negative and is upper bounded (less than or equal to) 1.
A and B are disjoint events (A ∩ B = ∅), where A ∩ B is the intersection of the sets A and B and is the null set or empty set, then P(A ∪ B)= P(A) + P(B) , where A ∪ B is the union of the sets A and B. This property holds for a countable union of events if they are pairwise disjoint.
Example: If a coin is tossed twice, the sample space is {HH, HT, TH, TT}. If A = {HH, HT} and B = {TT} then A ∩ B = ∅. Further, P(A) = 2/4 as A has 2 out of the 4 P(B) = 1/4. Note that A ∪ B = {HH, HT, TT}
P(A ∪ B)= 3/4 = P(A) + P(B) = 2/4 + 1/4
However, A = {HH, HT} C = {HT, TT} then A ∩ C= {HT} != ∅. So, P(A ∪ C)= 3/4 whereas P(A) = 2/4 and P(C) = 2/4. So, here P(A ∪ C) != P(A) + P(C) . It possible show that
P(A ∪ B)= P(A) + P(B) - P(A ∩ B)
If an event is described as at least one tail, then the event is {HT,TH,TT} and the probability of the event is 3/4 as out of 4 elements of the sample space, 3 elements are favourable to the event.
Given an event A, its compliment Ac is given by the set difference S – A. Below figure shows regions corresponding to various related events, given two events A and B. Note that A ∩ Bc is the intersection of events A and BC. The region for A ∩ B is the intersection of events A and B. The event. Ac ∩ B corresponds to the intersection of the events A and B. Finally the remaining region indicates the intersection of the events Ac and Bc,
FIG. Some events related to two given events A and B
Conditional Probability
We need to update the probability values if new information is given. Consider the following example.
Example : Consider tossing a coin twice. We have seen in Example 2 that the probability of at least one tail is 3/4 The corresponding event is {HT, TH, TT}. If we are given additional information that one of the tosses has resulted in a head, then the sample space is constrained to {HT, HH, TH}. So, under the condition that one toss has resulted in a head, the sample space shrinks. Now for at least one tail, the event in the new sample space is {HT,TH}. So, the probability has reduced from 3/4 to 2/3 for the event in the presence of more information. This computation is captured by using notion of conditional probability.
If A and B are two events such that P (B)!=0, then
P(A|B)= P(A ∩ B)/ P(B)
It is the probability of A conditioned on B. It is undefined if P(B) = 0 In the current example, if A is the event specified by at least one tail and B is the event specified by one toss that has resulted in a head, then B = {HH, HT, TH}; so, P(B) = 3/4 Note that A ∩ B = {TH, HT} . So,
P(A|B)= (2/4)/(3/4) = 2/3
as computed earlier in this example.
We have another important concept called independent events. We say that two events A and B are independent if P(A ∩ B)= P(A) * P(B) . So, if A and B are independent, then
P(A|B)= P(A ∩ B)/P(B) = (P(A) * P(B))/(P(B)) = P(A)
Total Probability
FIG. An example to illustrate total probability
Consider the Venn diagram shown in above fiigure. Here, event A is represented by the elliptical region and there are 4 events C1, C2, C3 and C4 that overlap with A. We can represent A using these overlapping sets by
A =(A ∩ C 1 ) ∪ (A ∩ C 2 ) ∪ (A ∩ C 3 ) ∪ (A ∩ C 4 )
The four overlapping sets in the union are disjoint as C1, C2, C3 and C4 are disjoint. Let
Bi =A ∩ Ci for i = 1,2,3,4
So, P(Bi) =P(A|Ci )P(Ci )
So, P(A) = P(B1) + P(B2) + P(B3) + P(B4)
Hence,
P(A) =P(A|C1 )P(C1 )+P(A|C2 )P(C2 )+P(A|C3 )P(C3 )+P(A|C4 )P(C4 )
We know that P(Ci |A)= P(A|Ci ) P(Ci)/P(A) .
So, in general
P(Ci|A)= [P(A|Ci)P(Ci)] / [P(A|C1)P(C1) + P(A|C2)P(C2 ) + P(A|C3)P(C3) +P(A|C4)P(C4)]
Bayes' theorem provides a way to update our beliefs (probabilities) about events based on new evidence. It is a fundamental concept in probability and statistics, widely used in machine learning, decision theory, and other fields.
The mathematical formula for Bayes' theorem is:
Where:
EXAMPLE: Let C1 and C2 be two chests such that C1 has 20 white balls (WB) and 10 red balls (RB); C2 has 15 WBs and 15 RBs.
If one of the two chests is picked with equal probability and a ball randomly picked from the chest is WB, what is the probability that it came from C1? We have the following information:
Prior: P(C1) = P(C2) = 1/2
Given: P( WB |C1 )= 2 /3, P( RB |C1 )= 1/3 and P( WB |C2 )=P(RB|C2 )= 1/2 Needed: P(C1 |WD)=[ P(WB|C1)P(C1) ] / [ P(WB|C1)P(C1) + P(WB|C2)P(C2 ) ]
= ((2/3)(1/2)) / ((2/3)(1/2) + (1/2)(1/2)) = 4/7
The Bayes Classifier And Its Optimality
Suppose there are two classes: C1(Normal) and C2(Abnormal)
Let P(C1) = 0.9 and P(C2) = 0.1
Given a new pattern X, we compute the posterior using Bayes Rule
Similarly,
Bayes rule helps us in converting priors to posteriors.
Note that P(X) =𝑃(𝑋|𝐶1)𝑃(𝐶1) +𝑃(𝑋|𝐶2)𝑃(𝐶2)is the evidence for X and is also a normalizer to ensure that P(C1|X) + P(C2|X) = 1.
Now shift the responsibility from priors to posteriors after observing X, Assign X to C1if P(C1|X) > P(C2|X).
Example:
So, the densities conditioned on C1 ( P(X | C1) ) and C2 ( P(X | C2) ) are uniform.
These are also called class conditional densities and are uniform in this example.
Let the prior probabilities be equal, that is P(C1) = P(C2) = 0.5
Then P(C1 | X) = (0.5∗0.5)/(0.5∗0.5+0.2∗0.5)= 0.7 and P(C2 | X) = (0.2∗0.5)/(0.5∗0.5+0.2∗0.5) = 0.3.
So, the pattern X (= 2) is assigned to class C 1 as P(C1 | X) > P(C2 | X).
Multi-Class Classification
We have discussed the use of the Bayes classifier in the two-class case. It can be easily used to deal with multi-class cases, that is, when the number of classes is more than 2. It may be described as follows:
Let the classes be C1, C2 , . . ., C q , where q >= 2
Let the prior probabilities be P(C1), P(C2), . . ., P(Cq)
Let x be the test pattern to be classified as belonging to one of these q classes.
Compute the posterior probabilities using Bayes' rule
P(Ci |x)= P(x|Ci)P(Ci) /
Assign the test pattern x to class Ct if P(Ct | x) >= P(Ci |x) for i = 1, 2 , . . ., q
In the case of a tie (two or more of the largest-valued posteriors are equal), assign arbitrarily to any one of the corresponding classes. In practice, breaking the tie arbitrarily is the prescription suggested for any ML model.
In this case, the probability of error is the sum of the posterior probabilities of the remaining q – 1. We know that the posteriors across all the q classes add up to 1.
Naive Bayes Classifier:
Assumes independence between features for computational efficiency. Variants include:
Optimal Bayes Classifier:
Theoretically perfect classifier but computationally infeasible in practice.
The Naive Bayes Classifier is a simple yet powerful machine learning algorithm based on Bayes' theorem. It is called "naive" because it assumes that all features are independent of each other, which is often unrealistic but works well in practice.
Given a dataset with n features x1,x2, …, xn and class labels c1,c2, …, ck, the task is to predict the class c for a new instance with feature vector x.
Using Bayes' Theorem, the probability of class ci given the feature vector x is:
Where:
P(ci) is the prior probability of class ci.
P(xj∣ci) is the likelihood of feature xj given class ci.
The naive assumption simplifies the computation by assuming that each feature is conditionally independent.
The decision rule is:
Consider a dataset for predicting whether a person will buy a product (c∈{Yes,No} based on three features:
Age: Young, Middle-aged, Old
Income: Low, Medium, High
Student Status: Yes, No
Age |
Income |
Student |
Buys Product |
Young |
High |
No |
No |
Young |
High |
Yes |
Yes |
Middle-aged |
Medium |
No |
Yes |
Old |
Low |
No |
No |
Old |
Medium |
No |
Yes |
Young |
Medium |
Yes |
Yes |
Middle-aged |
High |
No |
Yes |
Old |
High |
No |
No |
Compute Priors:
Calculate the prior probabilities of each class P(C) from the training data.
Compute Likelihoods:
Estimate the likelihood P(X∣C for each feature and class.
Apply Bayes' Theorem:
Use Bayes' theorem to compute the posterior probabilities for each class.
Prediction:
Choose the class with the highest posterior probability.
Free Offer |
Money |
Label |
1 |
1 |
Spam |
0 |
1 |
Not Spam |
1 |
0 |
Spam |
0 |
0 |
Not Spam |
Step 1: Calculate Prior Probabilities
We first determine the prior probabilities of spam and not spam.
P(Spam)=2/4=0.5
P(Not Spam)=2/4=0.5
For each feature ("Free Offer" and "Money"), we calculate the probability given spam and Not spam.
For Free Offer
Free Offer |
spam |
Not spam |
P(Free Offer | spam) |
P(Free Offer |Not spam) |
0 |
0 |
2 |
0/2 |
2/2=1 |
1 |
2 |
0 |
2/2=1 |
0/2 |
Total |
2 |
2 |
|
|
For Money
Money |
spam |
Not spam |
P(Money | spam) |
P(Money |Not spam) |
0 |
1 |
1 |
1/2 |
1/2 |
1 |
1 |
1 |
1/2 |
1/2 |
Total |
2 |
2 |
|
|
Let's say we have a new email with:
We calculate the posterior probabilities using Bayes' theorem:
P(Spam∣Free Offer=1,Money=1)∝P(Free Offer=1∣Spam)P(Money=1∣Spam)P(Spam)
Substituting values:
P(Spam∣1,1)∝(1)×(0.5)×(0.5)=0.25
P(Not Spam∣Free Offer=1,Money=1)∝P(Free Offer=1∣Not Spam)P(Money=1∣Not Spam)P(Not Spam)
Substituting values:
P(Not Spam∣1,1)∝(0)×(0.5)×(0.5)=0
Since P(Spam∣1,1)=0.25 and P(Not Spam∣1,1)=0, the email is classified as Spam.
Pattern |
Feature1 |
Feature2 |
Feature3 |
Class |
1 |
0 |
0 |
0 |
C0 |
2 |
1 |
0 |
1 |
C1 |
3 |
1 |
0 |
0 |
C0 |
4 |
1 |
1 |
1 |
C1 |
5 |
0 |
1 |
1 |
C1 |
6 |
0 |
1 |
1 |
C0 |
7 |
1 |
0 |
1 |
? |
Step 1: Calculate Prior Probabilities
We first determine the prior probabilities of C0 and C1.
P(C0) = 3/6=1/2
P(C1) = 3/6=1/2
For each feature ("Feature1", “Feature2” and " Feature3"), we calculate the probability given C0 and C1.
For Feature1
Feature1 |
C0 |
C1 |
P(Feature1|C0) |
P(Feature1|C1) |
0 |
2 |
1 |
2/3 |
1/3 |
1 |
1 |
2 |
1/3 |
2/3 |
Total |
3 |
3 |
|
|
For Feature2
Feature1 |
C0 |
C1 |
P(Feature2|C0) |
P(Feature2|C1) |
0 |
2 |
1 |
2/3 |
1/3 |
1 |
1 |
2 |
1/3 |
2/3 |
Total |
3 |
3 |
|
|
For Feature3
Feature1 |
C0 |
C1 |
P(Feature3|C0) |
P(Feature3|C1) |
0 |
2 |
0 |
2/3 |
0/3 |
1 |
1 |
3 |
1/3 |
3/3 |
Total |
3 |
3 |
|
|
Let's say we have a new pattern with:
We calculate the posterior probabilities using Bayes' theorem:
P(Spam∣Free Offer=1,Money=1)∝P(Free Offer=1∣Spam)P(Money=1∣Spam)P(Spam)
Substituting values:
P(Spam∣1,1)∝(0.5)×(0.5)×(0.5)=0.125
P(Not Spam∣Free Offer=1,Money=1)∝P(Free Offer=1∣Not Spam)P(Money=1∣Not Spam)P(Not Spam)
Substituting values:
P(Not Spam∣1,1)∝(0)×(0.5)×(0.5)=0
Since P(Spam∣1,1)=0.125 and P(Not Spam∣1,1)=0, the email is classified as Spam.
Let's consider a weather dataset for predicting whether a person will play tennis based on weather conditions. This dataset has the following features:
Outlook: Sunny, Overcast, Rain
Temperature: Hot, Mild, Cool
Humidity: High, Normal
Wind: Weak, Strong
Target: Play Tennis (Yes or No)
Outlook |
Temperature |
Humidity |
Wind |
Play Tennis |
Sunny |
Hot |
High |
Weak |
No |
Sunny |
Hot |
High |
Strong |
No |
Overcast |
Hot |
High |
Weak |
Yes |
Rain |
Mild |
High |
Weak |
Yes |
Rain |
Cool |
Normal |
Weak |
Yes |
Rain |
Cool |
Normal |
Strong |
No |
Overcast |
Cool |
Normal |
Strong |
Yes |
Sunny |
Mild |
High |
Weak |
No |
Sunny |
Cool |
Normal |
Weak |
Yes |
Rain |
Mild |
Normal |
Weak |
Yes |
Sunny |
Mild |
Normal |
Strong |
Yes |
Overcast |
Mild |
High |
Strong |
Yes |
Overcast |
Hot |
Normal |
Weak |
Yes |
Rain |
Mild |
High |
Strong |
No |
P(Yes)=9/14,
P(No)=5/14
For each feature ("Outlook", “Temperature”, “Humidity” and " Wind"), we calculate the probability given Yes and No.
For Outlook
Outlook |
Yes |
No |
P(Outlook| Yes) |
P(Outlook| No) |
Sunny |
2 |
3 |
2/9 |
3/5 |
Overcast |
4 |
0 |
4/9 |
0/5 |
Rain |
3 |
2 |
3/9 |
2/5 |
Total |
9 |
5 |
|
|
For Temperature
Temperature |
Yes |
No |
P(Temperature | Yes) |
P(Temperature | No) |
Hot |
2 |
2 |
2/9 |
2/5 |
Mild |
4 |
2 |
4/9 |
2/5 |
Cool |
3 |
1 |
3/9 |
1/5 |
Total |
9 |
5 |
|
|
For Humidity
Humidity |
Yes |
No |
P(Humidity | Yes) |
P(Humidity | No) |
High |
3 |
4 |
3/9 |
4/5 |
Normal |
6 |
1 |
6/9 |
1/5 |
Total |
9 |
5 |
|
|
For Outlook
Wind |
Yes |
No |
P(Wind | Yes) |
P(Wind | No) |
Weak |
6 |
2 |
6/9 |
2/5 |
Strong |
3 |
3 |
3/9 |
3/5 |
Total |
9 |
5 |
|
|
Input:
Outlook = Sunny, Temperature = Cool, Humidity = High, Wind = Strong
P(Yes∣Sunny, Cool, High, Strong)∝P(Yes)×P(Sunny∣Yes)×P(Cool∣Yes)×P(High∣Yes)×P(Strong∣Yes)
∝9/14×2/9×3/9×3/9×3/9
∝0.01587
P(No∣Sunny, Cool, High, Strong)∝P(No)×P(Sunny∣No)×P(Cool∣No)×P(High∣No)×P(Strong∣No)
∝5/14×3/5×1/5×4/5×3/5
∝0.03429
Since P(No)>P(Yes), the instance is classified as No (won't play tennis).
Why Naive Bayes Works Well
Efficient: Simple and fast to compute.
Robust to irrelevant features (they cancel out in the probability product).
Performs surprisingly well in text classification tasks like spam detection.
Limitations
Assumes feature independence, which may not hold in many real-world problems.
Poor performance with correlated features.
Struggles with small datasets.
The bias-variance tradeoff is a fundamental concept in machine learning that describes the tradeoff between two sources of error that affect model performance: bias and variance.
Bias
Bias refers to errors due to incorrect assumptions in the learning algorithm. A high-bias model is too simplistic and fails to capture the underlying pattern of the data. This leads to underfitting.
Characteristics of high bias:
Simple models (e.g., linear regression for a complex dataset)
Poor performance on both training and test data
Ignores relevant patterns in data
Example:
Trying to fit a linear model to a dataset with a quadratic relationship. The model cannot capture the curvature, leading to high error.
Variance
Variance refers to errors due to sensitivity to small fluctuations in the training data. A high-variance model learns the noise along with the signal, leading to overfitting.
Characteristics of high variance:
Complex models (e.g., deep neural networks with excessive parameters)
Good performance on training data but poor generalization to test data
Captures noise as if it were signal
Example:
A deep decision tree that perfectly memorizes training data but fails to generalize to unseen examples.
The Tradeoff
Low bias, high variance → Overfitting
High bias, low variance → Underfitting
Optimal model → A balance between bias and variance, minimizing total error
To strike the right balance and address bias and variance issues, consider the following solutions:
Regularization: Regularization techniques like L1 (Lasso) and L2 (Ridge) can help mitigate overfitting. These methods add penalty terms to the model’s cost function, discouraging it from becoming overly complex.
Feature Engineering: Thoughtful feature selection and engineering can reduce both bias and variance. By including relevant features and excluding noisy ones, you can improve model performance.
Cross-Validation: Utilize cross-validation to assess your model’s performance on different subsets of the data. This helps you gauge how well your model generalizes across various data splits, providing valuable insights into bias and variance.
Ensemble Methods: Ensemble techniques such as Random Forests and Gradient Boosting combine multiple models to achieve better performance. They can effectively reduce overfitting while improving predictive accuracy.
Collect More Data: If your model suffers from high bias (underfitting), acquiring more data can help it capture more complex patterns. Additional data can be especially beneficial when dealing with deep neural networks.
Linear Discriminant Function:
A linear discriminant function is a mathematical function used for classification that helps determine which class a given input belongs to by defining a linear decision boundary
A linear discriminant function is a function of the form:
g(w, x) = wTx+b
where:
That is w=( w(1), w(2) ,...,w(l)) be an l-dimensional weight vector and let x=( x(1), x(2) ,...,x(l)) be a feature vector.
The classification decision is made as follows:
This means that the decision boundary is given by the hyperplane:
wTx+b=0
The linear discriminant function can be written as
g(w, x) = wTx+b=
where w is the weight vector and b the bias or threshold weight.
EXAMPLE: Let x1 + x) > 0 or x1 + x2 - 0.5 > 0 be a linear discriminant function; here w=(1,1) and b=0 for the former and w = (1, 1) b = - 0.5 for the latter.
Both of them represent the Boolean function. Consider fig 1, which shows the Boolean OR, AND functions. Here, x1 and x2 are binary features taking values of either 1 (True) or 0 (False). Note that OR( x1 , x2) is 0 only when both x1 and x2 are 0, else it is True. Similarly, AND(x1, x2) is 1 when both x1 and x2 are 1.
In the case of OR, the decision boundary (line) characterizes points that satisfy x1 + x2= 0.5. Points that give output 1 satisfy the property that x1 + x2 > 0.5 and a point for which output is 0 satisfies x1 + x2 < 0.5.
In the case of AND x = (x1, x2) on the line satisfies x1 + x2 =1.5. A point that gives 1 as output satisfies x1 + x2 > 1.5 and points that give 0 as output satisfy x1 + x) < 1.5
Fig 2: Truth table for Boolean OR and AND
Fig1: Boolean functions AND and OR
Perceptron Classifier:
Perceptron is the earliest ML model that was used in classifying linearly separable data. Some related observations are as follows:
We learn the augmented weight vector w with the help of some two-class training data.
If the classes are linearly separable, then the perceptron learning algorithm will learn the augmented weight vector in a finite number of steps.
The obtained weight vector w will classify any augmented pattern to the negative class if wtx < 0 and assign x to the positive class if wtx > 0 We assume that the final w will classify training pattern to one of the two classes and there is no x that gives wtx = 0
It is likely that there are infinite possible solutions to the learning problem; that is, there could be possibly infinite w vectors. The perceptron learning algorithm will find one of them. It is based on starting with an initial weight vector and updating it until all the training patterns are correctly classified.
If the weight vector at step k, we, misclassifies an x from the negative class, that is, wtkx > 0 then it is updated as wk + 1 = wk -x. Consider wtk+1x = wtkx –xt x. So, wk + 1t x can be negative even if wtkx is positive xtx >= 0 and wk + 1tx<= wkt x
If the vector at step k, w k 1 misclassifies the positive class, that is, wtkx < 0 then it is updated as wk + 1 = wk + x - Consider wtk+1x = wtkx + xt x. So, wk + 1t x can be positive even if wtkx is negative because xtx >= 0 and wk + 1tx>= wkt x
So, this update scheme to get wk + 1 from wk is justified because wk + 1 compared to wk to classify correctly.
There could be different ways of selecting the initial weight vector w0 But a simple choice for theoretical analysis is to let w0 = 0 so wo is chosen to be the zero vector initially.
Perceptron Learning Algorithm
The algorithm is very simple and it can be shown that the algorithm stops after a finite number of steps to give us a correct w if the classes are linearly separable.
Let the training data set be (x1, l1), (x2, l2) ,...,(xn ,ln ) where xj is the jth pattern, ij is its class label and lj
The algorithm follows the steps given below:
1. Let i = 0 Let the initial augmented weight vector wi the zero vector. Let j = 1 (Note that Jis the pattern index and i is the weight vector index).
2. If xj is incorrectly classified by w, then i = i + 1 and wi =wi - 1 +ljxj, set j = (j + 1) mod n and go to step 2. Else set j = (j + 1) mod n and go to step 2.
3. Terminate the algorithm if no change in weight vector wi for a successive steps.
Example: Let the training set be {((- 1, 1), L1 = - 1), ((- 2, - 2), L2 = - 1), ((4, 4), L3 = 1)}
Compute the weight vector which classifies these feature vectors using perceptron learning.
Assignment Example: Consider the truth table of Boolean OR shown in the first three columns of below Table. The inputs are x1 and x2 and the output is either a 0 or 1; we have a two-class problem based on these two. So, there is one pattern from Class 0 and three patterns from Class 1. The augmented patterns are shown in Table.
TABLE: Augmented patterns for Boolean OR
1 |
x1 |
x2 |
OR( x1, x2) |
1 |
0 |
0 |
0 |
1 |
0 |
1 |
1 |
1 |
1 |
0 |
1 |
1 |
1 |
1 |
1 |
Compute the weight vector which correctly classifies these feature vectors using perceptron learning.
Support Vector Machine Algorithm:
Support Vector Machine or SVM is Supervised Learning algorithms, which is used for Classification as well as Regression problems. However, primarily, it is used for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space into classes so that we can easily put the new data point in the correct category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are called as support vectors, and hence algorithm is termed as Support Vector Machine. Consider the below diagram in which there are two different categories that are classified using a decision boundary or hyperplane:
Example: Suppose we see a strange cat that also has some features of dogs, so if we want a model that can accurately identify whether it is a cat or dog, so such a model can be created by using the SVM algorithm. We will first train our model with lots of images of cats and dogs so that it can learn about different features of cats and dogs, and then we test it with this strange creature. So as support vector creates a decision boundary between these two data (cat and dog) and choose extreme cases (support vectors), it will see the extreme case of cat and dog. On the basis of the support vectors, it will classify it as a cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Types of SVM
Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be classified into two classes by using a single straight line, then such data is termed as linearly separable data, and classifier is used called as Linear SVM classifier.
Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a dataset cannot be classified by using a straight line, then such data is termed as non-linear data and classifier used is called as Non-linear SVM classifier.
Hyperplane and Support Vectors in the SVM algorithm:
Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-dimensional space, but we need to find out the best decision boundary that helps to classify the data points. This best boundary is known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which means if there are 2 features (as shown in image), then hyperplane will be a straight line. And if there are 3 features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence called a Support vector.
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we have a dataset that has two tags (green and blue), and the dataset has two features x1 and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in either green or blue. Consider the below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But there can be multiple lines that can separate these classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or region is called as a hyperplane. SVM algorithm finds the closest point of the lines from both the classes. These points are called support vectors. The distance between the vectors and the hyperplane is called as margin. And the goal of SVM is to maximize this margin. The hyperplane with maximum margin is called the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we have used two dimensions x and y, so for non-linear data, we will add a third dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in 2d space with z=1, then it will become as:
Hence we get a circumference of radius 1 in case of non-linear data.
Here we see we cannot draw a single line or say hyperplane which can classify the points correctly. So what we do is try converting this lower dimension space to a higher dimension space using some quadratic functions which will allow us to find a decision boundary that clearly divides the data points. These functions which help us do this are called Kernels and which kernel to use is purely determined by hyperparameter tuning.
Some kernel functions which you can use in SVM are given below:
Following is the formula for the polynomial kernel:
Here d is the degree of the polynomial, which we need to specify manually.
Suppose we have two features X1 and X2 and output variable as Y, so using polynomial kernel we can write it as:
We can use it as the proxy for neural networks. Equation is:
It is just taking your input, mapping them to a value of 0 and 1 so that they can be separated by a simple straight line.
What it actually does is to create non-linear combinations of our features to lift your samples onto a higher-dimensional feature space where we can use a linear decision boundary to separate your classes It is the most used kernel in SVM classifications, the following formula explains it mathematically:
where,
‘σ’ is the variance and our hyperparameter
||X₁ – X₂|| is the Euclidean Distance between two points X₁ and X₂
It is mainly used for eliminating the cross term in mathematical functions. Following is the formula of the Bessel function kernel:
It performs well on multidimensional regression problems. The formula for this kernel function is:
Linear Regression:
Linear Regression is a supervised learning algorithm used for predicting a continuous target variable based on one or more input features. It assumes a linear relationship between the independent variable(s) X and the dependent variable Y.
Mathematical Representation
For a single-variable (simple) linear regression, the relationship is:
Y=mX+b+ϵ
Where:
Y is the predicted output (dependent variable),
X is the input (independent variable),
m is the slope of the line (coefficient/weight),
b is the intercept (bias term),
ϵ is the error term (unaccounted variation in data).
For multiple linear regression with multiple features X1,X2,...,Xn:
Y=b+w1X1+w2X2+...+wnXn+ϵ
where w1,w2,...,wn are the learned coefficients for each feature.
Linear Regression Using Linear Algebra
We can use linear algebra to calculate the weights (coefficients) efficiently. The key idea is to represent the problem in matrix form and solve for the optimal weights using the Normal Equation.
1. Matrix Representation of Linear Regression
The simple linear regression model:
Y= b+mX
can be rewritten in matrix form for multiple data points:
Y=Xw
Where:
Y is the vector of target values,
X is the matrix of input features (with a column of ones for the bias),
W is the vector of coefficients (m and b).
For a dataset with n observations:
where:
The first column of 1s accounts for the intercept b,
The second column represents the values of X,
The vector w contains b and m.
2. Normal Equation (Optimal Solution)
To find the best weights that minimize the squared error, we use the Normal Equation:
This equation provides the optimal solution for linear regression without needing gradient descent.
EXAMPLE: Let us consider a simple example of three vectors x1 = (1, 1), x2 = (1, 2) and x3 = (2, 2). Let y1 =3, y2=5, and y3 = 6. Find the linear regression coefficients.
Assignment Example: Predicting House Prices
Imagine you have a dataset of house prices based on the size of the house. The goal is to predict the price given the size.
X (House Size in sq ft) |
Y (Price in $1000s) |
5 |
7 |
6 |
9 |
7 |
10 |
9 |
12 |
10 |
? |
Advantages:
Simple and interpretable.
Efficient for small to medium-sized datasets.
Works well with linearly correlated data.
Disadvantages:
Assumes linearity (does not work well with nonlinear data).
Sensitive to outliers.
Can be impacted by multicollinearity in multiple regression.
When to Use Linear Regression?
When you need an interpretable and simple model.
When the relationship between variables is approximately linear.
When computational efficiency is required.
Logistic Regression:
Logistic Regression is a supervised learning algorithm used for binary classification problems (i.e., where the output is either 0 or 1). Unlike Linear Regression, which predicts continuous values, Logistic Regression predicts probabilities and applies a threshold to classify data points.
Logistic Regression is based on the sigmoid function (logistic function), which maps any real-valued number into the range (0,1), making it ideal for probability estimation.
In Linear Regression, we estimate w and b by minimizing the Mean Squared Error (MSE):
where Yi=wXi+b is the predicted output.
However, in Logistic Regression, the output is a probability:
This non-linearity makes the least squares approach suboptimal, but we can still approximate www and b by transforming the logistic model into a linear equation.
Transforming the Logistic Regression Equation
We take the logit transformation of the sigmoid function:
This converts our model into a linear equation:
Z=wX+b
where:
Now, the problem is reduced to a linear regression problem, where we fit:
Z=wX+b
using the Least Squares Method.
Estimating w and b Using the Normal Equation
For a dataset with N samples:
Z=Xw+b
we can estimate w and b using the Normal Equation from Linear Regression:
w=(XTX)−1XTZ
where:
X is the matrix of input features.
Z is the transformed output using the logit function.
Example: Let us consider a simple example of four vectors x1 = (1, 2), x2 = (2, 1), x3 = (6, 7) and x4 = (7, 6). Let y1 =0, y2= 0, y3 = 1 and y4 = 1. Find the discriminant function.
Assignment Example:
Let's say we have the dataset:
Study Hours (X) |
Pass (Y) |
1 |
0.1 |
2 |
0.2 |
3 |
0.3 |
4 |
0.7 |
Step 1: Compute the Logit (Z) Values
We first transform Y using the logit function:
X |
Y |
Z |
1 |
0.1 |
log(0.1/0.9) |
2 |
0.2 |
log(0.2/0.8) |
3 |
0.3 |
log(0.3/0.7) |
4 |
0.7 |
log(0.7/0.3) |
Step 2: Solve for w and b
Using the Least Squares Normal Equation:
w=(XTX)−1XTZ
Compute w and b.
Multilayer Perceptron Learning:
A Multilayer Perceptron (MLP) is a type of neural network that consists of multiple layers, allowing it to solve more complex problems than a single-layer perceptron. Structurally, an MLP has an input layer to receive data, one or more hidden layers to process the information, and an output layer that provides the final prediction. This arrangement makes MLPs a feedforward neural network, meaning the data moves in one direction—from input to output.
One of the key aspects of MLPs is backpropagation, a method used to train the network. Backpropagation adjusts the network’s weights based on errors from predictions, helping the model learn over time by minimizing mistakes. This process allows the MLP to improve its accuracy through multiple training cycles, making it suitable for tasks where precision is essential.
A Multilayer Perceptron (MLP) processes data by passing it through a series of layers, each contributing to the network’s ability to learn and make predictions. Here’s a breakdown of each layer’s role:
In a multilayer perceptron, neurons process information in a step-by-step manner, performing computations that involve weighted sums and nonlinear transformations. Let's walk layer by layer to see the magic that goes within.
The input layer of an MLP receives input data, which could be features extracted from the input samples in a dataset. Each neuron in the input layer represents one feature.
Neurons in the input layer do not perform any computations; they simply pass the input values to the neurons in the first hidden layer.
The hidden layers of an MLP consist of interconnected neurons that perform computations on the input data.
Each neuron in a hidden layer receives input from all neurons in the previous layer. The inputs are multiplied by corresponding weights, denoted as w
. The weights determine how much influence the input from one neuron has on the output of another.
In addition to weights, each neuron in the hidden layer has an associated bias, denoted as b
. The bias provides an additional input to the neuron, allowing it to adjust its output threshold. Like weights, biases are learned during training.
For each neuron in a hidden layer or the output layer, the weighted sum of its inputs is computed. This involves multiplying each input by its corresponding weight, summing up these products, and adding the bias:
Where n
is the total number of input connections, wi
is the weight for the i-th input, and xi
is the i-th input value.
The weighted sum is then passed through an activation function, denoted as f
. The activation function introduces nonlinearity into the network, allowing it to learn and represent complex relationships in the data. The activation function determines the output range of the neuron and its behavior in response to different input values. The choice of activation function depends on the nature of the task and the desired properties of the network.
The output layer of an MLP produces the final predictions or outputs of the network. The number of neurons in the output layer depends on the task being performed (e.g., binary classification, multi-class classification, regression).
Each neuron in the output layer receives input from the neurons in the last hidden layer and applies an activation function. This activation function is usually different from those used in the hidden layers and produces the final output value or prediction.
During the training process, the network learns to adjust the weights associated with each neuron's inputs to minimize the discrepancy between the predicted outputs and the true target values in the training data. By adjusting the weights and learning the appropriate activation functions, the network learns to approximate complex patterns and relationships in the data, enabling it to make accurate predictions on new, unseen samples.
This adjustment is guided by an optimization algorithm, such as stochastic gradient descent (SGD), which computes the gradients of a loss function with respect to the weights and updates the weights iteratively.
Back Propagation for Training MLP:
The back propagation algorithm is used in a Multilayer perceptron neural network to increase the accuracy of the output by reducing the error in predicted output and actual output.
According to this algorithm,
Calculate the error after calculating the output from the Multilayer perceptron neural network.
This error is the difference between the output generated by the neural network and the actual output. The calculated error is fed back to the network, from the output layer to the hidden layer.
Now, the output becomes the input to the network.
The model reduces error by adjusting the weights in the hidden layer.
Calculate the predicted output with adjusted weight and check the error. The process is recursively used till there is minimum or no error.
This algorithm helps in increasing the accuracy of the neural network.
Advantages:
MultiLayer Perceptron Neural Networks can easily work with non-linear problems.
It can handle complex problems while dealing with large datasets.
It has a higher accuracy rate and reduces prediction error by using backpropagation.
After training the model, the Multilayer Perceptron Neural Network quickly predicts the output.
Disadvantages:
This Neural Network consists of large computation, which sometimes increases the overall cost of the model.
The model will perform well only when it is trained perfectly.
Due to this model’s tight connections, the number of parameters and node redundancy increases.
Introduction To Clustering:
Clustering refers to the process of arranging or organizing objects according to specific criteria. It plays a crucial role in uncovering concealed knowledge in large data sets. Clustering involves dividing or grouping the data into smaller data sets based on similarities/dissimilarities. Depending on the requirements, this grouping can lead to various outcomes, such as partitioning of data, data re-organization, compression of data and data summarization.
Partitioning of Data
Clustering of data is a crucial aspect of efficient data access in database-based applications.
Example: The number of records in EMPLOYEE data table =30,000
The distinct values for DEPT_CODE = 1000
If this data table is clustered by DEPT_CODE,
Accessing all the employees of the department with DEPT CODE = '15' requires log2 (1000) +30 accesses.
Without clustering, accessing 30 employee records from a department would require, on average, 30 x 30000/2 accesses.
Data Re-organization
This data set comprises four binary patterns, each with six dimensions or features labelled fl to f6 and assigned a unique ID ranging from 1 to 4.
Pattern |
fl |
f2 |
f3 |
f4 |
f5 |
f6 |
1 |
1 |
0 |
1 |
0 |
1 |
0 |
2 |
0 |
1 |
0 |
1 |
0 |
1 |
3 |
1 |
0 |
1 |
0 |
1 |
0 |
4 |
0 |
1 |
0 |
1 |
0 |
1 |
By clustering the rows, we can obtain the re-organized data set shown in below table. This process groups similar patterns together based on their row-wise similarities.
Pattern |
fl |
f2 |
f3 |
f4 |
f5 |
f6 |
1 |
1 |
0 |
1 |
0 |
1 |
0 |
3 |
1 |
0 |
1 |
0 |
1 |
0 |
2 |
0 |
1 |
0 |
1 |
0 |
1 |
4 |
0 |
1 |
0 |
1 |
0 |
1 |
TABLE: Data re-organization using row as the criterion
Further, by clustering the re-organized data set from above table by column and grouping similar columns together, we obtain the data set displayed in below table.
Pattern |
fl |
f3 |
f5 |
f2 |
f4 |
f6 |
1 |
1 |
1 |
1 |
0 |
0 |
0 |
3 |
1 |
1 |
1 |
0 |
0 |
0 |
2 |
0 |
0 |
0 |
1 |
1 |
1 |
4 |
0 |
0 |
0 |
1 |
1 |
1 |
TABLE: Data re-organization using rows and columns
This table reveals the hidden structure in the data. The cluster comprising patterns 1 and 3 can be described using the conjunction f1^f3^f5. Likewise, the second cluster comprising patterns 2 and 4 can be described using f2^f4^f6.
Data Compression
Clustering can also assist in data compression by reducing the time complexity of accessing the data features and minimizing the space required to store them.
Consider the data set shown in above table. Through clustering, we can compress the data set, resulting in the table shown in below table. Note that the frequent pattern (FP) tree and pattern count (PC) tree structures are some examples for achieving such compression of data.
Pattern |
fl |
f3 |
f5 |
f2 |
f4 |
f6 |
Count |
1, 3 |
1 |
1 |
1 |
0 |
0 |
0 |
2 |
2, 4 |
0 |
0 |
0 |
1 |
1 |
1 |
2 |
TABLE: Compressed data
Summarization
The objective of data summarization is to extract a representative subset of samples from a large data set. The purpose is to simplify and expedite analysis on a smaller data set.
Example: When dealing with a data set of the marks of 100 students in a subject, statistical measures like mean (the numerical average of the marks), mode (the most frequently repeated marks) and median (the value in the middle of all the marks when the marks are ranked in order) can provide summarized information. Such summarized information can be derived through clustering.
Matrix Factorization
It is possible to view clustering as matrix factorization.
Let there be a data points in an l-dimensional space. We can represent it as a matrix Xnxl. It is possible to approximate X as a product of two matrices Bnxk and Ckx1 . So, X ≈ BC where B is the cluster assignment matrix and C is the representatives matrix.
Example: Consider the data matrix
X = [ [6, 6, 6],
[6, 6, 8],
[2, 4, 2],
[2, 2, 2]
]
There are four patterns in a three-dimensional space. If we use the leader algorithm with a threshold of 3 units, we get two clusters: (6, 6, 6) and (6,6,8) that belong to cluster 1 with (6,6,6) as its leader and (2,4,2) and (2,2,2) that belong cluster 2 with (2,4,2) as its leader.
So, we have
B = [ [1, 0],
[1, 0],
[0, 1],
[0, 1]
]
It indicates that the first pattern, (6,6,6), is assigned to cluster 1. Correspondingly a 1 in the first row and first column. Matrix B has four rows corresponding to four patterns in X; B has two columns corresponding to two clusters Matrix C will have K rows if there are K clusters and the ith row is the representative of the ith cluster.
In the current example,
C = [ [6, 6, 6],
[2, 4, 2]
]
So the resulting matrix factorization is given by
Leader is a hard clustering where pattern is assigned fully, that is, with a value 1, to a single cluster. So, each row of the B matrix will have one I and the rest as 0s.
It is possible to have soft clustering. In this case, we can have multiple non-zero entries in the range [0,1] in each row, indicating that a pattern is assigned to more than one cluster. The sum of the entries in a row is 1.
Instead of the leader, if we use the centroid of points in the cluster as its representative, then C = [[6, 6, 7], [2, 3, 2]]. There is no change in B. Note that the centroid of {(6,6,6), (6,6,8)} is (6,6,7) and the centroid of the second cluster{ (2, 4, 2), (2, 2, 2) } is (2,3,2).
Clustering Of Patterns
Clustering is a technique that involves grouping a set of patterns, resulting in the creation of cohesive clusters or groups from a given collection of patterns. The process of clustering aims to group similar patterns together while keeping dissimilar patterns separate. Below figure illustrates the clustering of a two-dimensional data set (represented by f1 and f2), where three clusters are visually identifiable.
FIG. Inter- and intra-cluster distances
The clustering process may ensure that the distance between any two points within the same cluster (intra-cluster distance), as measured by a dissimilarity measure such as Euclidean distance, is smaller than the distance between any two points belonging to different clusters (inter-cluster distance).
This indicates that the similarity between points within a cluster is higher than the similarity between clusters.
Example: Let us consider a two-dimensional data set with 11 data points having two features f1 and f2 as listed in below table.
Data Point |
f1 |
f2 |
x1 |
1 |
1 |
x2 |
1 |
3 |
x3 |
2 |
2 |
x4 |
3 |
1 |
x5 |
3 |
3 |
x6 |
4 |
7 |
x7 |
5 |
7 |
x8 |
6 |
9 |
x9 |
7 |
1 |
x10 |
7 |
3 |
x11 |
9 |
1 |
TABLE d1 : Data set with two features
Any two points are placed in the same cluster if the distance between them is lower than a certain threshold. In this example, the squared Euclidean distance is used to measure the distance between the points, and a threshold of 10 units is set to cluster them. The squared Euclidean distance (d) between two points, z, and, is calculated as follows:
d(xi, xj) =
where I represents the dimensionality of the points.
In this example, since we are dealing with two-dimensional data points, l equals 2. Using this formula, the squared Euclidean distance between all pairs of points can be found and is presented in Table.
In Table, the clusters are clearly visible within the matrix itself. The table contains three sub-matrices of sizes 5 x 5, 3 x 3 and 3 x 3 that meet the condition of having a maximum value of 10 in any entry.
Cluster A = {x1, x2, x3, x4, x5}
Cluster B = {x6, x7, x6}
Cluster C = {x9, x10, x11}
One can that none of the data points is present in more than one cluster. Such a clustering is called hard clustering, otherwise it is known as soft or overlapping clustering.
TABLE d2: Squared Euclidean distance between pairs of data points listed in Table d1
Data Abstraction
Clustering is a useful method for data abstraction, and it can be applied to generate clusters of data points that can be represented by their centroid, medoid, leader or some other suitable entity.
The centroid is computed as the sample mean of the data points in a cluster, and it is given by 1/nc *
FIG. Visualization of centroid and medoid
The medoid is the point that minimizes the sum of distances to all other points in the cluster. Above figure, illustrates this process, where an outlier that is located far away from the data points in the cluster is also present. The centroid can shift dramatically based on the position of the outlier, while the medoid remains stable within the boundaries of the original cluster. Therefore, clustering algorithms that utilize medoids are more resilient to noisy patterns or outliers.
EXAMPLE: Consider the data set shown in Table and the cluster thus generated given in Table d2. The centroid of the clusters, Cluster A (that is, µA), Cluster B (that is, µB) and Cluster C (that is, µC), are given below:
µA = 1/5 x [(1, 1) + (1, 3) + (2, 2) + (3, 1) + (3, 3)] = ((1 + 1 + 2 + 3 + 3)/5, (1 + 3 + 2 + 1 + 3)/5) = (2, 2)
µB = 1/3 * [(4, 7) + (5, 7) + (6, 9)] = ((4 + 5 + 6)/3, (7 + 7 + 9)/3) = (5, 7.67)
µC = 1/3 * [(7, 1) + (7, 3) + (9, 1)] = ((7 + 7 + 9)/3, (1 + 3 + 1)/3) = (7.67, 1.67)
For the same clusters, the medoids are given below:
mA = (2, 2) mB = (5, 7) and mC = (7, 3)
We know that the medoid of a group of points is defined as a data point within the cluster that is located at the centre, where the total distance from the points in the cluster is at its minimum. Using the squared Euclidean distance as the metric for calculating distance, the distance of the points from mA, mB and mC can be determined as follows:
For Cluster_A: Considering X3 as the medoid (mA):
d(x1, mA) = 2
d(x2, mA) = 2
d(x3, mA) = 0
d(x4, mA) = 2
d(x5, mA) = 2
The sum of the distances is 8. Note that if you choose any other data point, other than x3 as the medoid, the corresponding sum of the distances is greater than 8.
For Cluster B: Considering x7 as medoid (mB):
d(x6, mB) = 1
d(x7, mB) = 0
d(x8, mB) = 5
The sum of the distances is 6. Note that if you choose any other data point, other than x7 as the medoid, the corresponding sum of the distances is greater than 6.
For cluster C: Considering x9 as medoid (mC):
d(x9, mC) = 0
d(x10, mC) = 4
d(x11, mC) = 4
The sum of the distances is 8. Note that if you choose any other data point, other than x9 as the medoid, the corresponding sum of the distances is greater than 8.
Example: Consider the cluster of three data points x1=(4,1), x2=(5,1) and x3=(6,3)
Calculate the centroid and medoid
Observe the effect of Outlier point x4=(20,1)
The centroid for this cluster is µ = ((4+5+6)/3, (1+1+3)/3) = (5, 1.67)
The medoid for this cluster can be obtained from the below table.
|
x1 |
x2 |
x3 |
|
x1 |
0 |
1 |
8 |
9 |
x2 |
1 |
0 |
5 |
6 |
x3 |
8 |
5 |
0 |
13 |
The data point x2 has the minimum distance with all other data points. So, x2 is the medoid.
If we add the outlier point, x4=(20, 1), then the new centroid is
= ((4+5+6+20)/3, (1+1+3+1)/3) = (8.75, 1.5)
The medoid for this cluster can be obtained from the below table.
|
x1 |
x2 |
x3 |
x4 |
|
x1 |
0 |
1 |
8 |
256 |
265 |
x2 |
1 |
0 |
5 |
225 |
231 |
x3 |
8 |
5 |
0 |
200 |
213 |
x4 |
256 |
225 |
200 |
0 |
681 |
The data point x3 has the minimum distance with all other data points. So, x3 is the medoid.
Note that in the above examples, we have used centroid and medoid as the representatives or descriptions of the cluster. However, it is feasible to have more than one or multiple representative elements per cluster.
Clusters are useful in several decision-making situations:
Clusters are useful in several decision-making situations like classification, prediction, etc. However, when the number of cluster representatives obtained is smaller than the number of input patterns, there is a reduction in computation while performing classification or prediction activity. The following example demonstrates this.
EXAMPLE : Consider the data set shown in below table. Let (2, 4) be a test pattern which needs to be classified using the nearest neighbor classifier on the 11 labelled patterns of Table d3.
Data Point |
f1 |
f2 |
Class Label |
x1 |
1 |
1 |
A |
x2 |
1 |
3 |
A |
x3 |
2 |
2 |
A |
x4 |
3 |
1 |
A |
x5 |
3 |
3 |
A |
x6 |
4 |
7 |
B |
x7 |
5 |
7 |
B |
x8 |
6 |
9 |
B |
x9 |
7 |
1 |
B |
x10 |
7 |
3 |
B |
x11 |
9 |
1 |
B |
TABLE d3 Data set with two features and corresponding labels
Note that (2, 4) is the nearest neighbor of (3, 3) with a squared Euclidean distance of 2 units between them. This can be obtained by computing 11 distance values calculating the squared Euclidean distance between the test pattern (2, 4) and each of the 11 labelled patterns. So, (2, 4) is classified as belonging to class "A" because its nearest neighbor (3, 3) belongs to "A"
However, by clustering the 11 patterns using a suitable clustering algorithm and representing each resulting cluster by its centroid, we can reduce the number of distance values to be computed, from the mumber of labelled patterns to the number of cluster representatives. This may be illustrated as follows.
Let the 11 patterns be clustered using the same criterion as the one used in Example 7, that is, the squared Euclidean distance between any two patterns placed in the same cluster should be within a threshold of 10 units. This results in three clusters: one from class "A" and two from class "B".
The cluster corresponding to class "A" is:
cluster A = {(1, 1), (1, 3), (2, 2), (3, 1), (3, 3)} with centroid µA = (2, 2)
The two clusters corresponding to class "B" are:
Cluster B = {(4, 7), (5, 7), (6, 9)} with centroid µB = (5, 7.67)
Cluster C = {(7, 1), (7, 3), (9, 1)} with centroid µC = (7.67, 1.67)
So, instead of using the 11 labelled patterns, if one uses the centroids of the three clusters as the labelled data, then there is a reduction in the data to be analysed by a factor of 3. The reduced labelled data is given in Table 7.10.
TABLE 7.10 Reduced data set with only cluster representatives
By referring to Table 7.10, it is possible to determine that the nearest neighbor of the point (2, 4) belongs to class "A", specifically the point (2, 2). As a result, the test pattern is classified as belonging to class "A" by making only three distance computations between the test pattern and each of the three centroids. Moreover, it should be noted that for the distance computations, only the centroids of the three clusters need to be stored, leading to a reduction in both time and space.
In this example, since we need to compute only three distances instead of 11 and need to store only three centroids out of 11 patterns, there is a reduction in both time and space.
It could be argued that clustering the labelled data and computing the centroids may require a significant amount of time and space. However, it is possible to perform the clustering process as soon as the labelled data becomes available, prior to the classification stage. Furthermore, computation of the cluster centroid is a one-time affair. After obtaining the representative centroids, we can use them as labelled data to classify new test patterns, resulting in a significant reduction in both time and space requirements.
This clustering-based classification approach is particularly valuable for solving large-scale pattern classification problems that commonly arise in machine learning and also in dealing with class imbalance, where patterns in the majority class are clustered to have a smaller number of representatives from the majority class.
Clustering is an unsupervised learning technique used to group similar data points into clusters based on their inherent characteristics. It helps uncover hidden patterns in data when there are no predefined labels.
Market segmentation
Anomaly detection
Image compression
Document categorization
Clustering algorithms can be classified into hard and soft clustering, depending on whether clusters share data points or not. Hard clustering algorithms generate a partition of the given data set, while hierarchical algorithms generate a nested sequence of partitions. On the other hand, soft clustering algorithms utilize fuzzy sets, rough sets or evolutionary algorithms.
FIG. Classification of clustering algorithms
Hierachical Algorithms:
The process of generating a sequence of data partitions using hierarchical algorithms can be represented using a tree structure called a dendrogram. There are two types of hierarchical algorithms divisive and agglomerative.
Divisive algorithms follow a top-down approach, where a single cluster with all the patterns is split into smaller clusters at each step until there is only one pattern in each cluster or a collection of singleton clusters. On the other hand, agglomerative algorithms follow a bottom-up approach, where n singleton clusters are created for a input patterns. At each level, the two most similar clusters are merged to reduce the size of the partition by 1.
Once two patterns are placed in the same cluster in agglomerative algorithms, they remain in the same cluster in subsequent levels. Similarly, once two patterns are placed in different clusters in divisive algorithms, they remain in different clusters in subsequent levels.
Divisive Clustering (Top-Down Hierarchical Clustering)
Divisive Clustering, also known as Top-Down Hierarchical Clustering, is a type of hierarchical clustering method where the algorithm starts with a single large cluster containing all data points. It then recursively splits the cluster into smaller sub-clusters until each point becomes its own cluster (or meets a termination condition).
How Divisive Clustering Works
Start with a single cluster: All data points are treated as one large cluster.
Recursive splitting:
Choose a criterion to split the cluster (e.g., distance or similarity measures).
Divide the cluster into two or more sub-clusters.
Continue splitting:
Repeat the process for each sub-cluster until a stopping condition is met, such as:
Each point forms its own cluster.
A predefined number of clusters is reached.
A similarity threshold is met.
Algorithm Steps
Begin with all data points in one cluster.
Choose a splitting criterion (such as K-Means or maximum variance).
Split the cluster into two or more sub-clusters.
Evaluate sub-clusters and recursively divide them if necessary.
Stop when a termination condition is met.
Example
Suppose we have the following points in a 2D space:
Points: A(1,1),B(2,2),C(10,10),D(11,11)
Step 1: Start with a single cluster containing all points: [A,B,C,D]
Step 2: Split into two clusters based on distance:
Cluster 1: [A,B] (close together)
Cluster 2: [C,D] (close together)
Step 3: Recursively evaluate each cluster. If no further meaningful splits can be made, stop.
Final Clusters: [A,B] and [C,D]
Advantages
Captures hierarchical relationships among data.
Useful for discovering nested structures.
Can produce a dendrogram to visualize the hierarchy.
Disadvantages
Computationally expensive for large datasets.
Sensitive to the choice of splitting criterion.
Difficult to decide the stopping condition.
Use Cases
Document categorization: Dividing documents into broader categories, then into subcategories.
Gene expression analysis: Identifying hierarchical relationships between gene clusters.
Customer segmentation: Grouping customers based on behavior patterns.
Agglomerative Clustering (Bottom-Up Hierarchical Clustering)
Agglomerative Clustering is a hierarchical clustering method where the algorithm starts by treating each data point as its own cluster. It then merges the closest clusters iteratively until all points belong to a single cluster or a stopping criterion is met.
How Agglomerative Clustering Works
Start with each data point as its own cluster.
Compute distances: Measure the pairwise distances between clusters using a distance metric (Euclidean, Manhattan, etc.).
Merge closest clusters: Merge the two clusters that have the smallest distance.
Update distance matrix: Recompute the distances between the new cluster and the remaining clusters.
Repeat until one cluster remains or a stopping condition is met.
Linkage Methods (Distance Measures)
The choice of linkage affects how the distance between clusters is computed:
Single Linkage:
Tends to form elongated clusters.
Complete Linkage:
Distance between the farthest points of two clusters.
Tends to form compact clusters.
Average Linkage:
Average distance between all points in two clusters.
Ward's Linkage:
Minimizes the increase in total variance when merging clusters.
Often produces better results.
Distance between the closest points of two clusters.
Example
Consider the following points in 2D space:
Point |
Coordinates |
A |
(1, 1) |
B |
(2, 2) |
C |
(5, 5) |
D |
(6, 6) |
Step-by-Step Process (Using Single Linkage):
Start with each point as its own cluster: [A],[B],[C],[D]
Compute distances between clusters:
Distance(A, B) = 1.41
Distance(A, C) = 5.66
Distance(B, C) = 4.24
And so on.
Merge the closest pair [A],[B] into a new cluster [A,B].
Update the distance matrix and continue merging until all points form a single cluster.
Final Clustering: [A,B], [C,D]
Dendrogram Visualization
Agglomerative clustering can be visualized using a dendrogram, which shows the hierarchical merging process.
Advantages
Captures hierarchical relationships between data points.
Does not require specifying the number of clusters in advance.
Can handle non-spherical clusters.
Disadvantages
Computationally expensive for large datasets (O(n3) complexity).
Sensitive to noise and outliers.
Choice of linkage method affects the clustering result.
Use Cases
Bioinformatics: Understanding gene expression patterns.
Image Segmentation: Grouping pixels based on color or texture.
Social Network Analysis: Identifying communities.
K-Means clustering is an unsupervised learning algorithm used for partitioning data into K distinct clusters based on feature similarity. The goal is to minimize intra-cluster variance (i.e., how similar points within a cluster are) while maximizing inter-cluster distance.
Imagine you own a retail store, and you want to group customers based on their shopping behavior. Some customers buy only budget items, some focus on high-end products, and others are somewhere in between. K-Means can help automatically find these groups (clusters) without you specifying categories in advance.
Assume we want to cluster customers into K = 2 groups (for simplicity).
Randomly select K points (centroids) from the dataset.
These centroids will act as initial cluster centers.
For each data point, calculate its distance from each centroid.
Assign the point to the nearest centroid. The distance metric is typically Euclidean distance:
Compute the new centroid for each cluster by taking the mean of all points assigned to that cluster:
Continue updating centroids and reassigning points until convergence (when centroids no longer change significantly).
Imagine the following 2D points representing customer spending behavior:
(1,1),(2,1),(4,3),(5,4)
Choose two random centroids:
Centroid 1: (1,1)
Centroid 2: (5,4)
Compute the Euclidean distance from each point to the centroids:
Distance between (1,1) and centroid (1,1) is 0
Distance between (1,1) and centroid (5,4) is 5
Distance between (2,1) and centroid (1,1) is 1
And so on...
After comparing distances, the points are divided into:
Cluster 1: (1,1),(2,1)
Cluster 2: (4,3),(5,4)
Compute new centroids:
Cluster 1 centroid = (1+2)/2, (1+1)/2 = (1.5, 1)
Cluster 2 centroid = (4+5)/2, (3+4)/2 = (4.5, 3.5)
After a few iterations, centroids stabilize at optimal positions.
Choosing the right number of clusters (K) in K-Means Clustering is crucial for meaningful and interpretable results. Selecting an inappropriate K can lead to poorly formed clusters. Below are several common methods to determine the optimal K:
Run K-Means for a range of K values (e.g., from 1 to 10).
Compute the Sum of Squared Errors (SSE) or Inertia for each value of K:
Ci is the set of points in cluster i
μi is the centroid of cluster i
Plot K against the SSE values.
where:
Look for the "elbow" point where the SSE starts decreasing at a slower rate
Fuzzy C-Means Clustering
In this type, each data point x, can be assigned to a cluster C, with a membership value which represents the degree to which x, belongs to C. The membership value is assumed to range between 0 and 1, and for each data point x, the sum of all p values across clusters is equal to 1.
The fuzzy c-means clustering algorithm follows an iterative scheme similar to the k-means algorithm. It begins by identifying e initial cluster centres from the data set or membership values between each data point and the cluster. The following steps are then performed iteratively:
NOTE: for Fuzzy C-Means Clustering refer the text book.