unit 3

Overview of Text Classification Tasks

1. What is Text Classification

Text Classification is the process of automatically assigning text documents to predefined categories.

In simple words, it means a computer reads text and decides which group it belongs to.

Example:
Email → Spam / Not Spam
Review → Positive / Negative


2. Why Text Classification is Needed

Today there is a huge amount of digital text such as emails, reviews, messages, and articles.
It is not possible for humans to manually classify all this data.

So, machines are used to classify text automatically.


3. Basic Example

Message: "Win money now"
Category: Spam

Message: "Meeting at 5 pm"
Category: Not Spam

The computer learns from such examples and classifies new messages.

4. Steps in Text Classification

Step 1: Data Collection
Collect labeled text data.

Example: Spam and non-spam emails.

Step 2: Text Preprocessing
Clean the text.

Includes:
Removing symbols
Removing extra spaces
Converting to lowercase
Removing common words

Example:
"I Love AI!!!" → "love ai"

Step 3: Feature Extraction
Convert text into numbers.

Methods:
Bag of Words
TF-IDF
Word Embeddings

Step 4: Model Training
Train machine learning algorithms.

Examples:
Naive Bayes
Logistic Regression
SVM
Neural Networks

Step 5: Prediction
New text is given to the model.
The model predicts the category.


5. Common Algorithms Used

Naive Bayes: Fast and simple, good for spam detection
Logistic Regression: Easy to understand, good accuracy
SVM: High accuracy for complex data


6. Applications of Text Classification

Used in:

Email filtering
Product reviews
News categorization
Chatbots
Search engines
Social media monitoring


Naive Bayes for Text Classification

Basic Idea

Naive Bayes is a probability-based classifier.

It uses past data to calculate the probability of a document belonging to a class.

It assumes that all words in a document are independent of each other.

This assumption is called the naive assumption.


Bayes Formula (Using a and b)

Let:

a = Class (Spam, Positive, etc.)
b = Document

Main formula:

P(a | b) = ( P(b | a) × P(a) ) / P(b)

Meaning:

Probability of class a when document b is given.

In practice, P(b) is same for all classes, so we use:

P(a | b) ∝ P(b | a) × P(a)


Word Independence Formula

If document b has words w1, w2, w3:

P(b | a) = P(w1 | a) × P(w2 | a) × P(w3 | a)

Final formula:

P(a | b) ∝ P(a) × ∏ P(wi | a)


Decision Rule

The class with highest probability is selected:

Class = max [ P(a) × ∏ P(wi | a) ]


Advantages and Limitations

Advantages:

Limitations:


Support Vector Machine (SVM) for Text Classification

Basic Idea

SVM is a margin-based classifier.

It draws a boundary between two classes such that the distance from both classes is maximum.

This distance is called margin.

Only the closest points decide the boundary.

These points are called support vectors.


Hyperplane Formula

The separating line or plane is:

w · x + b = 0

Where:


Classification Formula

For a new document x:

f(x) = sign(w · x + b)

If value is positive → Class 1
If value is negative → Class 0


Optimization Formula

SVM tries to minimize:

(1/2) ||w||²

With condition:

y (w · x + b) ≥ 1

For noisy data:

(1/2)||w||² + C Σ ξ


Advantages and Limitations

Advantages:

Limitations:


Logistic Regression for Text Classification

Basic Idea

Logistic Regression is a probability-based linear classifier.

It first finds probability, then converts it into class.

It is mainly used for binary classification.


Linear Formula

First, a linear value is calculated:

z = w · x + b


Sigmoid Formula

Sigmoid converts z into probability:

P = 1 / (1 + e⁻ᶻ)

So final formula is:

P = 1 / (1 + e⁻(w·x + b))


Decision Rule

If P ≥ 0.5 → Class = 1
If P < 0.5 → Class = 0


Loss Formula

Logistic Regression uses log loss:

L = −[ y log(p) + (1 − y) log(1 − p) ]


Advantages and Limitations

Advantages:

Limitations:


Comparison of Three Algorithms

Key Differences

Feature Naive Bayes SVM Logistic Regression
Method Probability Margin Probability
Speed Very Fast Medium Fast
Accuracy Medium High High
Boundary Simple Complex Linear
Output Class Class Probability

Evaluation Metrics for Classification

In classification tasks, we measure how well a model is performing using specific metrics. These metrics help assess the quality of predictions and compare models. (Google for Developers)

Confusion Matrix

Before defining metrics, the confusion matrix is fundamental. It shows counts of correct and incorrect predictions. (GeeksforGeeks)

The matrix has four components for binary classification:


Accuracy

Accuracy measures overall correctness of the model. It is the proportion of all correct predictions. (Wikipedia)

Formula:

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Accuracy is intuitive but can be misleading when classes are imbalanced. (Wikipedia)


Precision

Precision measures how many of the positive predictions are actually correct. It focuses on positive predictions. (Wikipedia)

Formula:

Precision = TP / (TP + FP)

High precision means the model’s positive predictions are reliable.


Recall

Recall measures how many of the actual positive cases the model correctly identifies. It focuses on true positive cases. (Wikipedia)

Formula:

Recall = TP / (TP + FN)

High recall means the model finds most of the positive cases.


F1 Score

F1 score balances precision and recall into a single metric. It is useful when you want a trade-off between these two metrics, especially in imbalanced datasets. (Google for Developers)

Formula:

F1 = 2 × (Precision × Recall) / (Precision + Recall)

F1 score ranges from 0 to 1. Higher values mean better performance.


When to Use Which Metric


Multi-Class Classification

For more than two classes, precision and recall can be averaged across classes using macro or micro averaging. (evidentlyai.com)


Summary of Formulas

Accuracy = (TP + TN) / Total
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 = 2 × (Precision × Recall) / (Precision + Recall)

Introduction to Topic Modeling

What is Topic Modeling

Topic Modeling is a technique used to automatically discover hidden topics in a collection of documents.

In simple words:

It helps a computer find
“What is this document mainly about?”

without reading it manually.

Example:

From many articles, the model may find topics like:


Why Topic Modeling is Needed

Large amounts of text are generated every day:

Reading all manually is not possible.

So topic modeling helps to:


Basic Idea of Topic Modeling

Topic modeling assumes that:

Example:

A document about “Cricket World Cup” may have:

Topic 1: Sports
Topic 2: Team
Topic 3: Tournament

So one document can belong to many topics.


Latent Dirichlet Allocation (LDA)

What is LDA

LDA stands for Latent Dirichlet Allocation.

It is the most popular algorithm for topic modeling.

LDA is a probabilistic model that finds topics from text data.

“Latent” means hidden
“Dirichlet” is a probability distribution
“Allocation” means assigning topics

So:

LDA finds hidden topics and assigns them to documents.


Main Idea of LDA

LDA assumes that:

  1. Each document is a mixture of topics

  2. Each topic is a mixture of words

In simple words:

Example:

Document: “AI and Data Science”

Topic 1 (AI): model, learning, neural
Topic 2 (Data): data, analysis, mining

Document = 60% Topic 1 + 40% Topic 2


Probabilistic View of LDA

LDA works using probability.

It tries to find:

So it learns:

P(Topic | Document)
P(Word | Topic)


Generative Process of LDA (How LDA Thinks)

LDA assumes that documents are created like this:

Step 1: Choose topic distribution for document
Step 2: Choose a topic from that distribution
Step 3: Choose a word from that topic
Step 4: Repeat for all words

This is called the generative process.

It means LDA imagines how text was generated.


Important Terms in LDA

Document-Topic Distribution (θ)

Shows how much each topic appears in a document.

Example:

Document 1:


Topic-Word Distribution (φ)

Shows how much each word belongs to a topic.

Example:

Topic: Sports

cricket: 0.2
match: 0.15
player: 0.1


Dirichlet Distribution

Used to control topic and word probabilities.

Two parameters:

α (alpha): controls topic distribution
β (beta): controls word distribution

High value → more spread
Low value → more focused


Applications of LDA

LDA is used in:


Advantages of LDA

Advantages:


Limitations of LDA

Limitations:


Comparison: Topic Modeling vs Text Classification

Key Differences

Feature Topic Modeling (LDA) Text Classification
Labels Not required Required
Type Unsupervised Supervised
Output Topics Classes
Purpose Discover themes Predict category