EduLearn - NLP Chapter 1

Introduction to Natural Language Processing

Bag of Words (BoW)

Definition:

The Bag of Words model is one of the simplest and most widely used techniques to convert textual data into a numerical representation. It converts a collection of text documents into a matrix of word counts or frequencies, enabling the application of machine learning algorithms on textual data. This model treats every text document as a bag (multiset) of its words, which captures the presence and count of words but not their sequence. It treats each document as a “bag” of its words, disregarding grammar and word order but preserving multiplicity (frequency) of words.

Working:

  1. Vocabulary Creation
    • Compile a list of all unique words that appear across the entire corpus (collection of documents).
    • This forms the vocabulary for encoding documents.
  2. Vector Representation
    • Each document is transformed into a fixed-length vector, where:
      • Each dimension corresponds to a word in the vocabulary.
      • The value represents the frequency (count) of the word in the respective document.

Example:

Let’s say we have two documents:

Vocabulary = [“NLP”, “is”, “fun”, “powerful”]

Word Doc1 Doc2
NLP 1 1
is 1 1
fun 1 0
powerful 0 1

So:

Advantages:

Disadvantages:

Applications of BoW:

Python Example: BoW using CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer

# Sample documents
documents = [
    "NLP is fun",
    "NLP is powerful"
]

# Initialize CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the documents
X = vectorizer.fit_transform(documents)

# Display the vocabulary
print("Vocabulary:", vectorizer.vocabulary_)

# Display the feature names
print("Feature Names:", vectorizer.get_feature_names_out())

# Convert the matrix to an array for better readability
print("Document-Term Matrix:\n", X.toarray())

Output:

Vocabulary: {'nlp': 2, 'is': 1, 'fun': 0, 'powerful': 3}
Feature Names: ['fun' 'is' 'nlp' 'powerful']
Document-Term Matrix:
 [[1 1 1 0]
 [0 1 1 1]]

TF-IDF (Term Frequency – Inverse Document Frequency)

TF-IDF is a weighting scheme that reflects how important a word is to a document in a collection of documents (corpus).
It balances:

This weighting scheme is widely used in Information Retrieval (IR), text mining, and Natural Language Processing (NLP) tasks like document classification and keyword extraction.

Formula:

TF-IDF(t,d) = TF(t,d) × log(N / DF(t))

Term Frequency (TF):

Measures how often a word appears in a document. A higher frequency suggests greater importance. If a term appears frequently in a document, it is likely relevant to the document’s content.

Formula:
TF(t,d) = (The number of times the term t appears in document d) / (The Total number of terms in document d)

Example:

Assume 3 documents:

Term Frequency for fun is = 1 / 3 = 0.167

Limitations of TF Alone:

Inverse Document Frequency (IDF)

Reduces the weight of common words across multiple documents while increasing the weight of rare words. If a term appears in fewer documents, it is more likely to be meaningful and specific.

Formula:

IDF(t, D) = log((The total number of documents in corpus D) / (The number of documents containing the term t))

Limitations of IDF Alone:

Example:

Assume 3 documents:

Step 1: Calculate Term Frequency (TF)

For Document 1: "fun" appears 1 time out of 3 → TF = 1/3

For Document 2: "fun" = 0

For Document 3: "fun" appears 1 time out of 5 → TF = 1/5

TF for "fun":

Step 2: Calculate Inverse Document Frequency (IDF)

Total documents = 3; Documents with "fun" = 2 → IDF = log(3/2) ≈ 0.176

Step 3: Calculate TF-IDF

TF-IDF(t,d) = TF(t,d) × log(N / DF(t))

"NLP" appears in all documents → low IDF
"fun" appears only in Doc1 & Doc3 → higher IDF

Advantages:

Disadvantages:

Example: TF-IDF using Scikit-learn

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Sample documents
documents = [
    "NLP is fun",
    "NLP is powerful",
    "NLP is fun and powerful"
]

# Initialize the TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the documents
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

# Get feature names (vocabulary)
feature_names = tfidf_vectorizer.get_feature_names_out()
print("Vocabulary (Features):", feature_names)

# Convert the TF-IDF matrix to a dense array and display as a DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)
print("\nTF-IDF Matrix:")
print(tfidf_df)

Output:

Vocabulary (Features): ['and' 'fun' 'is' 'nlp' 'powerful']
TF-IDF Matrix:
         and     fun       is        nlp      powerful
0  0.000000  0.673255  0.522842  0.522842  0.000000
1  0.000000  0.000000  0.522842  0.522842  0.673255
2  0.591887  0.450145  0.349578  0.349578  0.450145

Language Modeling: Unigrams, Bigrams, and N-gram Models

What is Language Modeling?

A language model in natural language processing (NLP) is a statistical or machine learning model that is used to predict the next word in a sequence given the previous words. Language models play a crucial role in various NLP tasks such as machine translation, speech recognition, text generation, and sentiment analysis. They analyze and understand the structure and use of human language, enabling machines to process and generate text that is contextually appropriate and coherent.

Language Modeling (LM) is a core task in Natural Language Processing (NLP) that involves assigning probabilities to sequences of words. The goal is to predict the likelihood of a given word sequence, which is crucial for applications like:

Formally, given a sequence of words w1,w2,...,wn, the language model estimates:
P(w1,w2,...,wn)

Or for a generative model (using the chain rule):
P(w1)⋅P(w2∣w1)⋅P(w3∣w1,w2)⋯P(wn∣w1,...,wn−1)

However, computing these full probabilities is computationally intractable due to the curse of dimensionality. Hence, we use the N-gram approximation.

N-gram Models

N-gram models predict the probability of a word given the previous n−1 words. For example, a trigram model uses the preceding two words to predict the next word:

Goal: Calculate p(w∣h), the probability that the next word is w, given context/history h.

Example: For the phrase: “This article is on…”, if we want to predict the likelihood of “NLP” as the next word:
p("NLP"∣"This","article","is","on")

Chain Rule of Probability

The probability of a sequence of words is computed as:
P(w1,w2,…,wn)= ∏i=1n P(wi∣w1,w2,…,wi-1)

Markov Assumption

To reduce complexity, N-gram models assume the probability of a word depends only on the previous n−1 words:
P(wi∣w1,…,wi−1)≈P(wi∣wi−(n−1),…,wi−1)

N-gram models are simple, easy to implement, and computationally efficient, making them suitable for applications with limited computational resources. However, they have significant limitations. They struggle with capturing long-range dependencies due to their limited context window. As n increases, the number of possible n-grams grows exponentially, leading to sparsity issues where many sequences are never observed in the training data. This sparsity makes it difficult to accurately estimate the probabilities of less common sequences.

Unigram Model

In the Unigram Model, each word is assumed to be independent of any other word:

P(w1,w2,...,wn)=∏_(i=1)^n▒〖P(wi)〗

Example:
Sentence: “I love NLP”
Probability: P(I)⋅P(love)⋅P(NLP)

Use:
• Baseline models
• Text classification with Naive Bayes

Limitation:
• Ignores word order and context completely

Bigram Model

Definition:
In a Bigram Model, the probability of a word depends only on the immediately previous word:

P(wn∣wn−1) = (Count(wn-1,wn)) / (Count(wn-1))

Example:
Sentence: “I love NLP”
Probability: P(I)⋅P(love∣I)⋅P(NLP∣love)

Use:
• Captures basic word dependencies
• Used in predictive text and speech recognition

Limitation:
• Only captures local dependencies; misses longer relationships.

Trigram Model

Definition:
In a Trigram Model, the probability of a word depends only on the immediately previous two words:

P(wn∣wn−2,wn−1) = (Count(wn-2,wn-1,wn)) / (Count(wn-2,wn-1))

Example (Trigram):
Sentence: “I love natural language processing”
Probability: P(I)⋅P(love∣I)⋅P(natural∣I,love)⋅P(language∣love,natural)⋅P(processing∣natural,language)

Use:
• Improves prediction accuracy
• Common in traditional NLP systems

Limitation:
• Data sparsity: Higher N leads to many unseen word combinations
• Requires smoothing techniques (e.g., Laplace, Kneser-Ney)

Example:

import nltk
nltk.download('punkt')

from nltk import ngrams
from nltk.tokenize import word_tokenize

# Example sentence
sentence = "N-grams enhance language processing tasks."
# Tokenize the sentence
tokens = word_tokenize(sentence)
# Generate bigrams
bigrams = list(ngrams(tokens, 2))
# Generate trigrams
trigrams = list(ngrams(tokens, 3))
# Print the results
print("Bigrams:", bigrams)
print("Trigrams:", trigrams)

Output:

Bigrams: [('N-grams', 'enhance'), ('enhance', 'language'), ('language', 'processing'), ('processing', 'tasks'), ('tasks', '.')]
Trigrams: [('N-grams', 'enhance', 'language'), ('enhance', 'language', 'processing'), ('language', 'processing', 'tasks'), ('processing', 'tasks', '.')]

Word Embeddings and Similarity Measures

In Natural Language Processing (NLP), converting words into vectors - commonly referred to as word embeddings - is fundamental. These embeddings serve as the foundation for numerous NLP applications, enabling computers to understand and interpret human language.

Cosine Similarity

Word embeddings are numerical vector representations of words in a high-dimensional space that capture their semantic and syntactic relationships. Unlike sparse models like Bag-of-Words or TF-IDF, embeddings are dense, meaning they reduce dimensionality while preserving meaning.

Word Embeddings Overview

Word2Vec

Definition:

Word2Vec is a prediction-based model developed by Google that learns word embeddings using shallow neural networks. Word2Vec is a neural approach for generating word embeddings. It belongs to the family of neural word embedding techniques and specifically falls under the category of distributed representation models. It is a popular technique in natural language processing (NLP).

There are two neural embedding methods for Word2Vec: Continuous Bag of Words (CBOW) and Skip-gram.

Approaches:

Continuous Bag of Words (CBOW)

Continuous Bag of Words (CBOW) is a type of neural network architecture used in the Word2Vec model. The primary objective of CBOW is to predict a target word based on its context, which consists of the surrounding words in a given window. Given a sequence of words in a context window, the model is trained to predict the target word at the center of the window.

Cosine Similarity

The hidden layer contains the continuous vector representations (word embeddings) of the input words. The weights between the input layer and the hidden layer are learned during training. The dimensionality of the hidden layer represents the size of the word embeddings (the continuous vector space).

Skip-Gram

The Skip-Gram model learns distributed representations of words in a continuous vector space. The main objective of Skip-Gram is to predict context words (words surrounding a target word) given a target word. This is the opposite of the Continuous Bag of Words (CBOW) model, where the objective is to predict the target word based on its context. It is shown that this method produces more meaningful embeddings.

Cosine Similarity

The choice between CBOW and Skip-gram depends on data and the task.

Example:
Sentence: “I enjoy natural language processing.”

Properties:

Example:

from gensim.models import Word2Vec

# Sample data
sentences = [
    ['this', 'is', 'the', 'first', 'document'],
    ['this', 'document', 'is', 'the', 'second', 'document'],
    ['and', 'this', 'is', 'the', 'third', 'one'],
    ['is', 'this', 'the', 'first', 'document']
]

# Initialize the Word2Vec model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

# Train the model
model.train(sentences, total_examples=len(sentences), epochs=10)

# Get vector for a word
print(model.wv['document'])

GloVe (Global Vectors for Word Representation)

Definition: GloVe (Global Vectors for Word Representation) is an unsupervised learning algorithm designed to generate dense vector representations also known as embeddings. Its primary objective is to capture semantic relationships between words by analyzing their co-occurrence patterns in a large text corpus.

GloVe is a count-based model developed at Stanford that uses global co-occurrence statistics of words.

Xij=Number of times word j appears in the context of i

It constructs a word-word co-occurrence matrix and factorizes it to learn vector representations.

The creation of a word co-occurrence matrix is the fundamental component of GloVe. This matrix provides a quantitative measure of the semantic affinity between words by capturing the frequency with which they appear together in a given context. Further, by minimising the difference between the dot product of vectors and the pointwise mutual information of corresponding words, GloVe optimises word vectors. It is able to produce dense vector representations that capture syntactic and semantic relationships.

Let's see how the matrix is created. Corpus:

It is a nice evening.
Good Evening!
Is it a nice evening?
	it	is	a	nice	evening	good
it	0					
is	1+1	0	 	 	 	 
a	1/2+1	1+1/2	0	 	 	 
nice	1/3+1/2	1/2+1/3	1+1	0	 	 
evening	1/4+1/3	1/3+1/4	1/2+1/2	1+1	0	 
good	0	0	0	0	1	0

The upper half of the matrix will be a reflection of the lower half. We can consider a window frame as well to calculate the co-occurrences by shifting the frame till the end of the corpus. This helps gather information about the context in which the word is used.

Example:

• If “king” and “queen” appear in similar contexts, GloVe will place them close in vector space.

Example:

import gensim.downloader as api

# Download pre-trained GloVe model (choose the size you need - 50, 100, 200, or 300 dimensions)
glove_vectors = api.load("glove-wiki-gigaword-100")  # Example: 100-dimensional GloVe

# Get word vectors (embeddings)
word1 = "king"
word2 = "queen"
vector1 = glove_vectors[word1]
vector2 = glove_vectors[word2]

# Compute cosine similarity between the two word vectors
similarity = glove_vectors.similarity(word1, word2)

print(f"Word vectors for '{word1}': {vector1}")
print(f"Word vectors for '{word2}': {vector2}")
print(f"Cosine similarity between '{word1}' and '{word2}': {similarity}")

Advantages:

FastText

Definition:

FastText is an extension of Word2Vec developed by Facebook. It treats each word as a bag of character n-grams, which helps capture morphology.

For example, the word “apple” with n=3 is represented as:

Advantages:

Example:

Even if the word “playfulness” is rare, it can still be embedded effectively using its subword components: "play", "ful", "ness".

Example:

from gensim.models import FastText

# Sample data
sentences = [
    ['this', 'is', 'the', 'first', 'document'],
    ['this', 'document', 'is', 'the', 'second', 'document'],
    ['and', 'this', 'is', 'the', 'third', 'one'],
    ['is', 'this', 'the', 'first', 'document']
]

# Initialize the FastText model
model = FastText(sentences, vector_size=100, window=5, min_count=1, workers=4)

# Train the model
model.train(sentences, total_examples=len(sentences), epochs=10)

# Get vector for a word
print(model.wv['document'])

Cosine Similarity and Distance Measures

Cosine Similarity:

Measures the angle between two vectors. Ranges from -1 (opposite) to 1 (same direction). It is scale-invariant, making it ideal for measuring word similarity.

Cosine Similarity = (A⋅B) / (||A|| ⋅ ||B||)
Where:
A⋅B is the dot product of the vectors
||A|| and ||B|| are magnitudes

Cosine similarity ranges from -1 to 1. A cosine similarity of 1 means the vectors are perfectly aligned (no angle between them), indicating maximum similarity, whereas a value of -1 implies they are diametrically opposite, reflecting maximum dissimilarity. Values near zero indicate orthogonality.

Calculating and Visualizing Cosine Distance

Let’s take an example with two vectors: Vector A (2,4) and Vector B (4,2), and calculate their cosine distance:

  1. Calculate the dot product of A and B:
    A · B = (2 × 4) + (4 × 2) = 8 + 8 = 16
  2. Calculate the magnitudes of A and B:
    ||A|| = √(2² + 4²) = √20
    ||B|| = √(4² + 2²) = √20
  3. Divide the dot product by the product of the magnitudes:
    ( A · B ) / ( ||A|| × ||B|| ) = 16 / (√20 × √20) = 16 / 20 = 0.8
  4. Calculate the cosine distance:
    Cosine Distance = 1 - 0.8 = 0.2

With a cosine distance of 0.2, the result falls quite low on the 0 to 2 scale. This suggests that vectors A and B are quite similar in terms of the direction they point to.

Cosine Similarity

In this representation:

The cosine similarity (0.8) represents the cosine of angle θ. As the vectors are close in direction, the angle is small, resulting in a high cosine similarity and a low cosine distance (0.2).

Because cosine distance can also be defined as 1 - cos(θ), where cos(θ) is the cosine similarity, this implies:

Cosine Distance Formula

Cosine distance measures the dissimilarity between two vectors by calculating the cosine of the angle between them. It can be defined as one minus cosine similarity:

Cosine Distance = 1 - Cosine Similarity

Example:

cosine_similarity("doctor", "nurse") ≈ 0.8
cosine_similarity("doctor", "banana") ≈ 0.1

Applications of Cosine Distance

Let’s take a look at some of the important applications.

Euclidean Distance

Measures the straight-line distance between two vectors. Less commonly used for word embeddings due to scale sensitivity.

Evaluation Metrics in NLP: Accuracy, Precision, Recall, F1 Score

In NLP and other machine learning tasks, especially in classification problems (e.g., spam detection, sentiment analysis), it is critical to evaluate model performance using appropriate metrics. The most commonly used metrics are:

These metrics are derived from the confusion matrix, which breaks down predictions into:

Predicted PositivePredicted Negative
Actual PositiveTrue Positive (TP)False Negative (FN)
Actual NegativeFalse Positive (FP)True Negative (TN)

1. Accuracy

Definition:
Accuracy is the ratio of correct predictions to the total number of predictions.

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Useful When: The dataset is balanced
Not Reliable When: The dataset is imbalanced (e.g., 95% non-spam, 5% spam → predicting “non-spam” always gives 95% accuracy)

2. Precision

Definition:
Precision is the ratio of correctly predicted positives to the total predicted positives.

Precision = TP / (TP + FP)

Interpretation: Out of all documents predicted as “positive” (e.g., spam), how many were truly spam?
Useful When: False positives are costly (e.g., marking important emails as spam)

3. Recall (Sensitivity)

Definition:
Recall is the ratio of correctly predicted positives to the total actual positives.

Recall = TP / (TP + FN)

Interpretation: Out of all actual spam emails, how many were correctly identified?
Useful When: False negatives are costly (e.g., missing spam or failing to detect hate speech)

4. F1 Score

Definition:
F1 score is the harmonic mean of precision and recall, balancing both in one metric.

F1 Score = 2 × (Precision × Recall) / (Precision + Recall)

Useful When: You want a single score that balances false positives and false negatives
The dataset is imbalanced

Example:

Assume a sentiment classifier is tested on 100 reviews:

MetricFormulaResult (%)
Accuracy(TP + TN) / Total(60+10)/100 = 70%
PrecisionTP / (TP + FP)60 / 80 = 75%
RecallTP / (TP + FN)60 / 70 = 85.7%
F1 Score2 × (P × R) / (P + R)≈ 80.1%

Summary Table

MetricBest WhenKey StrengthFormula
AccuracyBalanced dataOverall correctness(TP + TN) / Total
PrecisionFalse positives are costlyTrust in positive predictionsTP / (TP + FP)
RecallFalse negatives are costlyCapturing all true positivesTP / (TP + FN)
F1 ScoreBalance neededCompromise between P and R2 × (P × R) / (P + R)