EduLearn - NLP Chapter 1: Introduction

Bag of Words (BoW)

Definition:

The Bag of Words model is one of the simplest and most widely used techniques to convert textual data into a numerical representation. It converts a collection of text documents into a matrix of word counts or frequencies, enabling the application of machine learning algorithms on textual data. This model treats every text document as a bag (multiset) of its words, which captures the presence and count of words but not their sequence. It treats each document as a “bag” of its words, disregarding grammar and word order but preserving multiplicity (frequency) of words.

Working:

Vocabulary Creation
- Compile a list of all unique words that appear across the entire corpus (collection of documents).
- This forms the vocabulary for encoding documents.
Vector Representation
- Each document is transformed into a fixed-length vector, where:

Example:

Let’s say we have two documents:

Doc1: “NLP is fun”
Doc2: “NLP is powerful”

Vocabulary = [“NLP”, “is”, “fun”, “powerful”]

Word	Doc1	Doc2
NLP	1	1
is	1	1
fun	1	0
powerful	0	1

So:

Doc1 = [1, 1, 1, 0]
Doc2 = [1, 1, 0, 1]

Advantages:

Simplicity: Easy to understand, implement, and interpret.
Computationally Efficient: Suitable for small datasets and quick experimentation.
Baseline Performance: Often performs reasonably well as a baseline for classification tasks like spam detection, topic classification, etc.

Disadvantages:

Ignores Word Order: Important syntactic and semantic information is lost because the sequence of words is not considered.
Context Insensitivity: Cannot differentiate between words used in different contexts or handle polysemy (multiple meanings of a word).
Vocabulary Explosion: With larger corpora, the vocabulary grows significantly, leading to high-dimensional feature spaces, which can make models prone to overfitting and computationally expensive.
Equal Weightage: All words are treated with equal importance, regardless of their informativeness or relevance.

Applications of BoW:

Text Classification: Spam detection, sentiment analysis.
Information Retrieval: Search engines use similar vector space models.
Topic Modeling: Preprocessing for clustering or topic identification.

Python Example: BoW using CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer

# Sample documents
documents = [
    "NLP is fun",
    "NLP is powerful"
]

# Initialize CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the documents
X = vectorizer.fit_transform(documents)

# Display the vocabulary
print("Vocabulary:", vectorizer.vocabulary_)

# Display the feature names
print("Feature Names:", vectorizer.get_feature_names_out())

# Convert the matrix to an array for better readability
print("Document-Term Matrix:\n", X.toarray())

Output:

Vocabulary: {'nlp': 2, 'is': 1, 'fun': 0, 'powerful': 3}
Feature Names: ['fun' 'is' 'nlp' 'powerful']
Document-Term Matrix:
 [[1 1 1 0]
 [0 1 1 1]]

TF-IDF (Term Frequency – Inverse Document Frequency)

TF-IDF is a weighting scheme that reflects how important a word is to a document in a collection of documents (corpus).
It balances:

Term Frequency (TF): How often a term appears in a document.
Inverse Document Frequency (IDF): How rare the term is across all documents.

This weighting scheme is widely used in Information Retrieval (IR), text mining, and Natural Language Processing (NLP) tasks like document classification and keyword extraction.

Formula:

TF-IDF(t,d) = TF(t,d) × log(N / DF(t))

TF(t, d) = frequency of term t in document d
N = total number of documents
DF(t) = number of documents containing term t

Term Frequency (TF):

Measures how often a word appears in a document. A higher frequency suggests greater importance. If a term appears frequently in a document, it is likely relevant to the document’s content.

Formula:
TF(t,d) = (The number of times the term t appears in document d) / (The Total number of terms in document d)

Example:

Assume 3 documents:

Doc1: "NLP is fun"
Doc2: "NLP is powerful"
Doc3: "NLP is fun and powerful"

Term Frequency for fun is = 1 / 3 = 0.167

Limitations of TF Alone:

TF does not account for the global importance of a term across the entire corpus.
Common words like "the" or "and" may have high TF scores but are not meaningful in distinguishing documents.

Inverse Document Frequency (IDF)

Reduces the weight of common words across multiple documents while increasing the weight of rare words. If a term appears in fewer documents, it is more likely to be meaningful and specific.

Formula:

IDF(t, D) = log((The total number of documents in corpus D) / (The number of documents containing the term t))

The logarithm is used to dampen the effect of very large or very small values, ensuring the IDF score scales appropriately.
It also helps balance the impact of terms that appear in extremely few or extremely many documents.

Limitations of IDF Alone:

IDF does not consider how often a term appears within a specific document.
A term might be rare across the corpus (high IDF) but irrelevant in a specific document (low TF).

Example:

Assume 3 documents:

Doc1: "NLP is fun"
Doc2: "NLP is powerful"
Doc3: "NLP is fun and powerful"

Step 1: Calculate Term Frequency (TF)

For Document 1: "fun" appears 1 time out of 3 → TF = 1/3

For Document 2: "fun" = 0

For Document 3: "fun" appears 1 time out of 5 → TF = 1/5

TF for "fun":

Doc1 = 0.333
Doc2 = 0
Doc3 = 0.2

Step 2: Calculate Inverse Document Frequency (IDF)

Total documents = 3; Documents with "fun" = 2 → IDF = log(3/2) ≈ 0.176

Step 3: Calculate TF-IDF

TF-IDF(t,d) = TF(t,d) × log(N / DF(t))

Doc1: 0.167 × 0.333 = 0.0556
Doc2: 0 × 0.333 = 0
Doc3: 0.167 × 0.333 = 0.0556

"NLP" appears in all documents → low IDF
"fun" appears only in Doc1 & Doc3 → higher IDF

Advantages:

More accurate than BoW for feature representation
Penalizes common, non-informative terms
Good balance of simplicity and performance

Disadvantages:

Still ignores word order and semantics
Sparse and high-dimensional representation
Cannot capture meaning or context across sentences

Example: TF-IDF using Scikit-learn

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Sample documents
documents = [
    "NLP is fun",
    "NLP is powerful",
    "NLP is fun and powerful"
]

# Initialize the TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the documents
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

# Get feature names (vocabulary)
feature_names = tfidf_vectorizer.get_feature_names_out()
print("Vocabulary (Features):", feature_names)

# Convert the TF-IDF matrix to a dense array and display as a DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)
print("\nTF-IDF Matrix:")
print(tfidf_df)

Output:

Vocabulary (Features): ['and' 'fun' 'is' 'nlp' 'powerful']
TF-IDF Matrix:
         and     fun       is        nlp      powerful
0  0.000000  0.673255  0.522842  0.522842  0.000000
1  0.000000  0.000000  0.522842  0.522842  0.673255
2  0.591887  0.450145  0.349578  0.349578  0.450145

Language Modeling: Unigrams, Bigrams, and N-gram Models

What is Language Modeling?

A language model in natural language processing (NLP) is a statistical or machine learning model that is used to predict the next word in a sequence given the previous words. Language models play a crucial role in various NLP tasks such as machine translation, speech recognition, text generation, and sentiment analysis. They analyze and understand the structure and use of human language, enabling machines to process and generate text that is contextually appropriate and coherent.

Language Modeling (LM) is a core task in Natural Language Processing (NLP) that involves assigning probabilities to sequences of words. The goal is to predict the likelihood of a given word sequence, which is crucial for applications like:

Machine Translation: Language Modeling helps improve the accuracy and fluency of machine translation systems by generating contextually appropriate translations.
Text Generation: Language Modeling enables the generation of coherent and contextually accurate text, which is useful in content creation, chatbot responses, and automatic report writing.
Speech Recognition: Language Modeling aids in accurate speech recognition by predicting the most likely sequence of words given the audio input.
Question Answering: Language Modeling can be used to build question answering systems that understand natural language queries and provide relevant answers.
Named Entity Recognition: Language Modeling can assist in identifying and classifying named entities (e.g., person names, locations, organizations) within a text.
Spelling & Grammar Correction: Identifying and fixing errors.
Autocomplete/Next Word Prediction: Suggesting the next word as a user types.

Formally, given a sequence of words w1,w2,...,wn, the language model estimates:
P(w1,w2,...,wn)

Or for a generative model (using the chain rule):
P(w1)⋅P(w2∣w1)⋅P(w3∣w1,w2)⋯P(wn∣w1,...,wn−1)

However, computing these full probabilities is computationally intractable due to the curse of dimensionality. Hence, we use the N-gram approximation.

N-gram Models

N-gram models predict the probability of a word given the previous n−1 words. For example, a trigram model uses the preceding two words to predict the next word:

Goal: Calculate p(w∣h), the probability that the next word is w, given context/history h.

Example: For the phrase: “This article is on…”, if we want to predict the likelihood of “NLP” as the next word:
p("NLP"∣"This","article","is","on")

Chain Rule of Probability

The probability of a sequence of words is computed as:
P(w1,w2,…,wn)= ∏_i=1ⁿ P(wi∣w1,w2,…,wi-1)

Markov Assumption

To reduce complexity, N-gram models assume the probability of a word depends only on the previous n−1 words:
P(wi∣w1,…,wi−1)≈P(wi∣wi−(n−1),…,wi−1)

N-gram models are simple, easy to implement, and computationally efficient, making them suitable for applications with limited computational resources. However, they have significant limitations. They struggle with capturing long-range dependencies due to their limited context window. As n increases, the number of possible n-grams grows exponentially, leading to sparsity issues where many sequences are never observed in the training data. This sparsity makes it difficult to accurately estimate the probabilities of less common sequences.

Unigram Model

In the Unigram Model, each word is assumed to be independent of any other word:

P(w1,w2,...,wn)=∏_(i=1)^n▒〖P(wi)〗

Example:
Sentence: “I love NLP”
Probability: P(I)⋅P(love)⋅P(NLP)

Use:
• Baseline models
• Text classification with Naive Bayes

Limitation:
• Ignores word order and context completely

Bigram Model

Definition:
In a Bigram Model, the probability of a word depends only on the immediately previous word:

P(wn∣wn−1) = (Count(wn-1,wn)) / (Count(wn-1))

Example:
Sentence: “I love NLP”
Probability: P(I)⋅P(love∣I)⋅P(NLP∣love)

Use:
• Captures basic word dependencies
• Used in predictive text and speech recognition

Limitation:
• Only captures local dependencies; misses longer relationships.

Trigram Model

Definition:
In a Trigram Model, the probability of a word depends only on the immediately previous two words:

P(wn∣wn−2,wn−1) = (Count(wn-2,wn-1,wn)) / (Count(wn-2,wn-1))

Example (Trigram):
Sentence: “I love natural language processing”
Probability: P(I)⋅P(love∣I)⋅P(natural∣I,love)⋅P(language∣love,natural)⋅P(processing∣natural,language)

Use:
• Improves prediction accuracy
• Common in traditional NLP systems

Limitation:
• Data sparsity: Higher N leads to many unseen word combinations
• Requires smoothing techniques (e.g., Laplace, Kneser-Ney)

Example:

import nltk
nltk.download('punkt')

from nltk import ngrams
from nltk.tokenize import word_tokenize

# Example sentence
sentence = "N-grams enhance language processing tasks."
# Tokenize the sentence
tokens = word_tokenize(sentence)
# Generate bigrams
bigrams = list(ngrams(tokens, 2))
# Generate trigrams
trigrams = list(ngrams(tokens, 3))
# Print the results
print("Bigrams:", bigrams)
print("Trigrams:", trigrams)

Output:

Bigrams: [('N-grams', 'enhance'), ('enhance', 'language'), ('language', 'processing'), ('processing', 'tasks'), ('tasks', '.')]
Trigrams: [('N-grams', 'enhance', 'language'), ('enhance', 'language', 'processing'), ('language', 'processing', 'tasks'), ('processing', 'tasks', '.')]

Word Embeddings and Similarity Measures

In Natural Language Processing (NLP), converting words into vectors - commonly referred to as word embeddings - is fundamental. These embeddings serve as the foundation for numerous NLP applications, enabling computers to understand and interpret human language.

Word embeddings are numerical vector representations of words in a high-dimensional space that capture their semantic and syntactic relationships. Unlike sparse models like Bag-of-Words or TF-IDF, embeddings are dense, meaning they reduce dimensionality while preserving meaning.

Word Embeddings Overview

Each word is mapped to a vector of real numbers (e.g., 100–300 dimensions)
Words with similar meanings are placed close together in the vector space
Enable algebraic operations such as:
vec("king") − vec("man") + vec("woman") ≈ vec("queen")

Word2Vec

Definition:

Word2Vec is a prediction-based model developed by Google that learns word embeddings using shallow neural networks. Word2Vec is a neural approach for generating word embeddings. It belongs to the family of neural word embedding techniques and specifically falls under the category of distributed representation models. It is a popular technique in natural language processing (NLP).

Represent words as continuous vector spaces.
Aim: Capture the semantic relationships between words by mapping them to high-dimensional vectors.
Words with similar meanings should have similar vector representations. Every word is assigned a vector.

There are two neural embedding methods for Word2Vec: Continuous Bag of Words (CBOW) and Skip-gram.

Approaches:

CBOW (Continuous Bag of Words): Predicts a target word from surrounding context words
Skip-Gram: Predicts context words from a target word

Continuous Bag of Words (CBOW)

Continuous Bag of Words (CBOW) is a type of neural network architecture used in the Word2Vec model. The primary objective of CBOW is to predict a target word based on its context, which consists of the surrounding words in a given window. Given a sequence of words in a context window, the model is trained to predict the target word at the center of the window.

Feedforward neural network with a single hidden layer.
The input layer, hidden layer, and output layer represent the context words, learned continuous vectors or embeddings, and the target word.
Useful for learning distributed representations of words in a continuous vector space.

The hidden layer contains the continuous vector representations (word embeddings) of the input words. The weights between the input layer and the hidden layer are learned during training. The dimensionality of the hidden layer represents the size of the word embeddings (the continuous vector space).

Skip-Gram

The Skip-Gram model learns distributed representations of words in a continuous vector space. The main objective of Skip-Gram is to predict context words (words surrounding a target word) given a target word. This is the opposite of the Continuous Bag of Words (CBOW) model, where the objective is to predict the target word based on its context. It is shown that this method produces more meaningful embeddings.

Output: Trained vectors of each word after many iterations through the corpus.
Preserve syntactical or semantic information, Converted to lower dimensions.
Similar meaning (semantic info) vectors are placed close to each other in space.
vector_size parameter controls the dimensionality of the word vectors, and you can adjust other parameters such as window.

The choice between CBOW and Skip-gram depends on data and the task.

CBOW might be preferred when training resources are limited, and capturing syntactic information is important.
Skip-gram is chosen when semantic relationships and the representation of rare words are important.

Example:
Sentence: “I enjoy natural language processing.”

CBOW: Context → Target
(Input: “I”, “natural”, “language” → Predict: “enjoy”)
Skip-Gram: Target → Context
(Input: “enjoy” → Predict: “I”, “natural”, “language”)

Properties:

Captures semantic relationships
Efficient training using negative sampling

Example:

from gensim.models import Word2Vec

# Sample data
sentences = [
    ['this', 'is', 'the', 'first', 'document'],
    ['this', 'document', 'is', 'the', 'second', 'document'],
    ['and', 'this', 'is', 'the', 'third', 'one'],
    ['is', 'this', 'the', 'first', 'document']
]

# Initialize the Word2Vec model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

# Train the model
model.train(sentences, total_examples=len(sentences), epochs=10)

# Get vector for a word
print(model.wv['document'])

GloVe (Global Vectors for Word Representation)

Definition: GloVe (Global Vectors for Word Representation) is an unsupervised learning algorithm designed to generate dense vector representations also known as embeddings. Its primary objective is to capture semantic relationships between words by analyzing their co-occurrence patterns in a large text corpus.

GloVe is a count-based model developed at Stanford that uses global co-occurrence statistics of words.

Xij=Number of times word j appears in the context of i

It constructs a word-word co-occurrence matrix and factorizes it to learn vector representations.

The creation of a word co-occurrence matrix is the fundamental component of GloVe. This matrix provides a quantitative measure of the semantic affinity between words by capturing the frequency with which they appear together in a given context. Further, by minimising the difference between the dot product of vectors and the pointwise mutual information of corresponding words, GloVe optimises word vectors. It is able to produce dense vector representations that capture syntactic and semantic relationships.

Let's see how the matrix is created. Corpus:

It is a nice evening.
Good Evening!
Is it a nice evening?

	it	is	a	nice	evening	good
it	0					
is	1+1	0	 	 	 	 
a	1/2+1	1+1/2	0	 	 	 
nice	1/3+1/2	1/2+1/3	1+1	0	 	 
evening	1/4+1/3	1/3+1/4	1/2+1/2	1+1	0	 
good	0	0	0	0	1	0

The upper half of the matrix will be a reflection of the lower half. We can consider a window frame as well to calculate the co-occurrences by shifting the frame till the end of the corpus. This helps gather information about the context in which the word is used.

Vectors for each word is assigned randomly.
Take two pairs of vectors and see closeness in space.
If they occur together more often or have a higher value in the co-occurrence matrix and are far apart in space then they are brought close to each other.
If they are close to each other but are rarely or not frequently used together then they are moved further apart in space.
Output: Vector space representation that approximates the information from the co-occurrence matrix.

Example:

• If “king” and “queen” appear in similar contexts, GloVe will place them close in vector space.

Example:

import gensim.downloader as api

# Download pre-trained GloVe model (choose the size you need - 50, 100, 200, or 300 dimensions)
glove_vectors = api.load("glove-wiki-gigaword-100")  # Example: 100-dimensional GloVe

# Get word vectors (embeddings)
word1 = "king"
word2 = "queen"
vector1 = glove_vectors[word1]
vector2 = glove_vectors[word2]

# Compute cosine similarity between the two word vectors
similarity = glove_vectors.similarity(word1, word2)

print(f"Word vectors for '{word1}': {vector1}")
print(f"Word vectors for '{word2}': {vector2}")
print(f"Cosine similarity between '{word1}' and '{word2}': {similarity}")

Advantages:

Captures both global statistics and local context
Faster than training large neural networks

FastText

Definition:

FastText is an extension of Word2Vec developed by Facebook. It treats each word as a bag of character n-grams, which helps capture morphology.

For example, the word “apple” with n=3 is represented as:

<ap, app, ppl, ple, le>

Advantages:

Learns better embeddings for rare and out-of-vocabulary (OOV) words: By utilizing subword information, FastText can generate meaningful representations for rare words or words not seen during training. This is particularly useful for morphologically rich languages (e.g., agglutinative languages) and in situations where the vocabulary is constantly evolving.
Useful for morphologically rich languages

Example:

Even if the word “playfulness” is rare, it can still be embedded effectively using its subword components: "play", "ful", "ness".

Example:

from gensim.models import FastText

# Sample data
sentences = [
    ['this', 'is', 'the', 'first', 'document'],
    ['this', 'document', 'is', 'the', 'second', 'document'],
    ['and', 'this', 'is', 'the', 'third', 'one'],
    ['is', 'this', 'the', 'first', 'document']
]

# Initialize the FastText model
model = FastText(sentences, vector_size=100, window=5, min_count=1, workers=4)

# Train the model
model.train(sentences, total_examples=len(sentences), epochs=10)

# Get vector for a word
print(model.wv['document'])

Cosine Similarity and Distance Measures

Cosine Similarity:

Measures the angle between two vectors. Ranges from -1 (opposite) to 1 (same direction). It is scale-invariant, making it ideal for measuring word similarity.

Cosine Similarity = (A⋅B) / (||A|| ⋅ ||B||)
Where:
A⋅B is the dot product of the vectors
||A|| and ||B|| are magnitudes

Cosine similarity ranges from -1 to 1. A cosine similarity of 1 means the vectors are perfectly aligned (no angle between them), indicating maximum similarity, whereas a value of -1 implies they are diametrically opposite, reflecting maximum dissimilarity. Values near zero indicate orthogonality.

Calculating and Visualizing Cosine Distance

Let’s take an example with two vectors: Vector A (2,4) and Vector B (4,2), and calculate their cosine distance:

Calculate the dot product of A and B:
A · B = (2 × 4) + (4 × 2) = 8 + 8 = 16
Calculate the magnitudes of A and B:
||A|| = √(2² + 4²) = √20
||B|| = √(4² + 2²) = √20
Divide the dot product by the product of the magnitudes:
( A · B ) / ( ||A|| × ||B|| ) = 16 / (√20 × √20) = 16 / 20 = 0.8
Calculate the cosine distance:
Cosine Distance = 1 - 0.8 = 0.2

With a cosine distance of 0.2, the result falls quite low on the 0 to 2 scale. This suggests that vectors A and B are quite similar in terms of the direction they point to.

In this representation:

Vector A is (2, 4)
Vector B is (4, 2)
θ is the angle between A and B

The cosine similarity (0.8) represents the cosine of angle θ. As the vectors are close in direction, the angle is small, resulting in a high cosine similarity and a low cosine distance (0.2).

Because cosine distance can also be defined as 1 - cos(θ), where cos(θ) is the cosine similarity, this implies:

When vectors are identical in direction (θ = 0°), cos(θ) = 1, so cosine distance = 0
When vectors are perpendicular (θ = 90°), cos(θ) = 0, so cosine distance = 1
When vectors are opposite (θ = 180°), cos(θ) = -1, so cosine distance = 2

Cosine Distance Formula

Cosine distance measures the dissimilarity between two vectors by calculating the cosine of the angle between them. It can be defined as one minus cosine similarity:

Cosine Distance = 1 - Cosine Similarity

Example:

cosine_similarity("doctor", "nurse") ≈ 0.8
cosine_similarity("doctor", "banana") ≈ 0.1

Applications of Cosine Distance

Let’s take a look at some of the important applications.

Natural language processing (NLP): Cosine distance plays a crucial role in various natural language processing tasks by effectively measuring semantic relationships between text representations such as word2vec.
Document classification and similarity: In document classification, cosine distance measures content similarity by comparing the angles between document vectors. Each document is converted into a vector, with dimensions reflecting TF-IDF scores of words.
Clustering and semantic grouping: Cosine distance excels in clustering tasks like K-means clustering, grouping semantically related documents even if they don't use the same words. This technique is often used in unsupervised learning to discover natural groupings within document sets.
Information retrieval and search engines: Cosine distance matches search queries to documents by vector comparison, ranking those with the smallest distances highest for relevance.
Word embeddings and word similarity: Cosine distance is vital in word embeddings like Word2Vec or GloVe. These embeddings capture semantic meaning such that words with similar meanings are embedded close together in the vector space.
Image recognition: In image recognition, cosine distance compares feature vectors extracted from images, boosting both the accuracy and effectiveness of image-based systems.
Recommender systems: Cosine distance pairs users with movies or products that align with their past likes by examining the closeness in vector space.

Euclidean Distance

Measures the straight-line distance between two vectors. Less commonly used for word embeddings due to scale sensitivity.

Evaluation Metrics in NLP: Accuracy, Precision, Recall, F1 Score

In NLP and other machine learning tasks, especially in classification problems (e.g., spam detection, sentiment analysis), it is critical to evaluate model performance using appropriate metrics. The most commonly used metrics are:

Accuracy
Precision
Recall
F1 Score

These metrics are derived from the confusion matrix, which breaks down predictions into:

	Predicted Positive	Predicted Negative
Actual Positive	True Positive (TP)	False Negative (FN)
Actual Negative	False Positive (FP)	True Negative (TN)

1. Accuracy

Definition:
Accuracy is the ratio of correct predictions to the total number of predictions.

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Useful When: The dataset is balanced
Not Reliable When: The dataset is imbalanced (e.g., 95% non-spam, 5% spam → predicting “non-spam” always gives 95% accuracy)

2. Precision

Definition:
Precision is the ratio of correctly predicted positives to the total predicted positives.

Precision = TP / (TP + FP)

Interpretation: Out of all documents predicted as “positive” (e.g., spam), how many were truly spam?
Useful When: False positives are costly (e.g., marking important emails as spam)

3. Recall (Sensitivity)

Definition:
Recall is the ratio of correctly predicted positives to the total actual positives.

Recall = TP / (TP + FN)

Interpretation: Out of all actual spam emails, how many were correctly identified?
Useful When: False negatives are costly (e.g., missing spam or failing to detect hate speech)

4. F1 Score

Definition:
F1 score is the harmonic mean of precision and recall, balancing both in one metric.

F1 Score = 2 × (Precision × Recall) / (Precision + Recall)

Useful When: You want a single score that balances false positives and false negatives
The dataset is imbalanced

Example:

Assume a sentiment classifier is tested on 100 reviews:

70 are positive, 30 are negative
Model predicts 80 as positive, 20 as negative
Of the 80 predicted positives, 60 are actually positive (TP = 60, FP = 20)
Of the 20 predicted negatives, 10 are actually positive (FN = 10, TN = 10)

Metric	Formula	Result (%)
Accuracy	(TP + TN) / Total	(60+10)/100 = 70%
Precision	TP / (TP + FP)	60 / 80 = 75%
Recall	TP / (TP + FN)	60 / 70 = 85.7%
F1 Score	2 × (P × R) / (P + R)	≈ 80.1%

Summary Table

Metric	Best When	Key Strength	Formula
Accuracy	Balanced data	Overall correctness	(TP + TN) / Total
Precision	False positives are costly	Trust in positive predictions	TP / (TP + FP)
Recall	False negatives are costly	Capturing all true positives	TP / (TP + FN)
F1 Score	Balance needed	Compromise between P and R	2 × (P × R) / (P + R)