unit 4
Unit 4: Advanced NLP Techniques and Transformer Models
Recurrent Neural Networks (RNNs)
What is RNN
Recurrent Neural Network (RNN) is a type of neural network designed to work with sequence data.
Sequence data means data where order matters.
Examples:
-
Sentences
-
Speech
-
Time-series data
RNN remembers previous information while processing new input.
So, it is useful for language tasks.
Main Idea of RNN
RNN processes data step by step.
At each step, it uses:
-
Current input
-
Previous memory
This helps RNN understand context.
Example:
In sentence
"I am learning AI"
To understand "AI", the model remembers "I am learning".
Working of RNN
RNN has a loop structure.
Output depends on:
-
Current word
-
Previous hidden state
So information flows from past to present.
Advantages and Limitations
Advantages:
-
Handles sequential data
-
Remembers past information
-
Useful for language modeling
Limitations:
-
Cannot remember long sentences well
-
Suffers from vanishing gradient problem
-
Training is slow
Long Short-Term Memory (LSTM)
What is LSTM
Long Short-Term Memory (LSTM) is a special type of RNN.
It is designed to solve the memory problem of RNN.
LSTM can remember important information for a long time.
Why LSTM is Needed
Normal RNN forgets old information quickly.
Example:
In a long paragraph, RNN may forget the starting words.
LSTM solves this by using memory cells and gates.
Main Components of LSTM
LSTM has three main gates:
-
Forget Gate
Decides what information to remove -
Input Gate
Decides what new information to store -
Output Gate
Decides what information to send out
These gates control memory flow.
Working of LSTM
LSTM keeps a memory cell.
It updates this memory using gates.
Important information is kept.
Unimportant information is removed.
This makes LSTM powerful for long texts.
Advantages and Limitations
Advantages:
-
Remembers long sequences
-
Better than RNN
-
High accuracy in NLP tasks
Limitations:
-
More complex
-
Slower than RNN
-
Needs more computation
Implementation using Keras and TensorFlow
Keras and TensorFlow
TensorFlow is a deep learning framework.
Keras is a high-level library built on TensorFlow.
They are used to build neural networks easily.
Purpose of Using Keras and TensorFlow
They help to:
-
Build models quickly
-
Train large datasets
-
Test and deploy models
-
Reduce coding complexity
Basic Steps of Implementation
-
Import libraries
-
Prepare text data
-
Convert text to numbers (tokenization)
-
Build RNN/LSTM model
-
Compile model
-
Train model
-
Test model
Example Use
Using Keras, we can easily create:
-
RNN for text prediction
-
LSTM for sentiment analysis
-
Language translation models
Introduction to Transformers (BERT, GPT)
What is Transformer
Transformer is a modern deep learning model used in NLP.
It does not use RNN or LSTM.
Instead, it uses attention mechanism.
Attention helps the model focus on important words.
Main Idea of Transformers
Transformers process all words at the same time.
They find relationships between words using attention.
So they are:
-
Faster
-
More accurate
-
Better at understanding context
Attention Mechanism
Attention tells the model:
“Which words are important for this word?”
Example:
"I went to bank to deposit money"
Attention helps understand that "bank" means financial bank.
BERT and GPT
BERT:
-
Reads text from both sides (left and right)
-
Good for understanding meaning
GPT:
-
Reads text from left to right
-
Good for generating text
Both are based on transformers.
Advantages of Transformers
-
Handle long text easily
-
High accuracy
-
Parallel processing
-
Better context understanding
Pre-trained Models and Fine-Tuning
Pre-trained Models
Pre-trained models are models trained on huge datasets.
They already know:
-
Grammar
-
Vocabulary
-
Language patterns
Examples:
-
BERT
-
GPT
-
RoBERTa
Why Pre-trained Models are Used
Training from scratch needs:
-
Huge data
-
High cost
-
Long time
Pre-trained models save time and money.
Fine-Tuning
Fine-tuning means:
Using a pre-trained model and training it again on your own data.
Only small changes are made.
So the model learns your specific task.
Process of Fine-Tuning
-
Load pre-trained model
-
Add task-specific layer
-
Train on new dataset
-
Adjust weights slightly
-
Test performance
Advantages of Fine-Tuning
-
Requires less data
-
Faster training
-
Better accuracy
-
Easy implementation
Comparison of RNN, LSTM, and Transformers
Key Differences
| Feature | RNN | LSTM | Transformer |
|---|---|---|---|
| Memory | Short | Long | Very Long |
| Speed | Slow | Slower | Fast |
| Structure | Sequential | Sequential | Parallel |
| Accuracy | Medium | High | Very High |
| Used Today | Rare | Limited | Very Common |
One-Line Summary for Exam
RNN processes sequential data using memory, LSTM improves RNN by storing long-term information, transformers use attention for better context understanding, and pre-trained models with fine-tuning provide high accuracy with less data.
Memory Shortcut
RNN → Basic memory
LSTM → Strong memory
Transformer → Attention power
Pre-trained → Ready model
Fine-tuning → Customize model
BERT and GPT in Detail (Transformer-Based Models)
Both BERT and GPT are advanced NLP models based on the Transformer architecture.
They are called Large Language Models because they are trained on very large text data.
They understand language using attention mechanism instead of RNN or LSTM.
BERT (Bidirectional Encoder Representations from Transformers)
What is BERT
BERT is a Transformer-based model designed mainly for understanding text.
Full form:
Bidirectional Encoder Representations from Transformers
Meaning:
-
Bidirectional → reads text from both sides
-
Encoder → uses only encoder part of transformer
-
Representations → creates meaning-based vectors
So, BERT understands words using left and right context together.
Key Idea of BERT (Bidirectional Reading)
Traditional models read text in one direction.
Example sentence:
"I went to the bank to deposit money"
Unidirectional model:
Reads only from left side.
BERT:
Reads from both sides.
So it knows:
"bank" is related to "deposit" and "money"
This gives better understanding.
Architecture of BERT
BERT uses:
-
Transformer Encoder blocks
-
Multi-head attention
-
Feed-forward layers
Structure:
Input → Embedding → Encoder Layers → Output
It does not use decoder.
So BERT is mainly for analysis and understanding, not generation.
Input Format of BERT
Before sending text to BERT, it is converted into special format.
Example:
[CLS] I love machine learning [SEP]
Where:
[CLS] → Classification token
[SEP] → Separator token
BERT uses these tokens internally.
Pre-Training of BERT
BERT is trained using two main tasks.
1. Masked Language Model (MLM)
Some words are hidden.
Example:
"I love [MASK] learning"
Model predicts missing word.
Output:
"machine"
This helps BERT learn deep meaning.
2. Next Sentence Prediction (NSP)
Two sentences are given.
Model predicts:
Are they related or not?
Example:
Sentence A: I am studying NLP
Sentence B: It is very interesting
Related → Yes
This helps in question answering and reasoning.
Working Principle of BERT
-
Input sentence is tokenized
-
Converted into embeddings
-
Passed through encoder layers
-
Attention connects related words
-
Final vectors represent meaning
Each word gets a context-aware vector.
Applications of BERT
BERT is mainly used for:
-
Question Answering
-
Text Classification
-
Named Entity Recognition
-
Sentiment Analysis
-
Document Search
-
Text Similarity
It is best for tasks where understanding is important.
Advantages of BERT
-
Deep understanding of context
-
Reads both sides
-
High accuracy
-
Works well for analysis tasks
Limitations of BERT
-
Cannot generate long text well
-
Large size
-
High memory usage
-
Slower than GPT in generation
GPT (Generative Pre-trained Transformer)
What is GPT
GPT is a Transformer-based model designed mainly for text generation.
Full form:
Generative Pre-trained Transformer
Meaning:
-
Generative → creates text
-
Pre-trained → trained on huge data
-
Transformer → uses transformer architecture
So GPT is mainly used for writing and generating language.
Key Idea of GPT (Unidirectional Reading)
GPT reads text from left to right only.
Example:
"I am learning artificial intelligence"
GPT predicts:
"I" → "am" → "learning" → "artificial" → "intelligence"
One word at a time.
This is called autoregressive modeling.
Architecture of GPT
GPT uses:
-
Transformer Decoder blocks
-
Self-attention layers
-
Feed-forward layers
Structure:
Input → Embedding → Decoder Layers → Output
It does not use encoder.
So GPT focuses on generation.
Working Principle of GPT
GPT learns:
Given previous words, predict next word.
Example:
Input: "India is a"
Output: "country"
Then:
"India is a country"
Next prediction: "in"
This continues.
Pre-Training of GPT
GPT is trained using:
Language Modeling Task
Formula:
Predict next word:
P(wₙ | w₁, w₂, ..., wₙ₋₁)
Meaning:
Probability of next word depends on previous words.
It reads billions of sentences and learns patterns.
Training Method of GPT
Step 1: Read huge text data
Step 2: Learn grammar and structure
Step 3: Learn writing style
Step 4: Learn reasoning patterns
This makes GPT a general-purpose model.
Fine-Tuning of GPT
After pre-training, GPT is fine-tuned for:
-
Chatbots
-
Coding
-
Essay writing
-
Question answering
-
Translation
Fine-tuning adapts GPT to specific tasks.
Applications of GPT
GPT is used for:
-
Chat systems
-
Content writing
-
Story generation
-
Code generation
-
Summarization
-
Virtual assistants
It is best for creative and interactive tasks.
Advantages of GPT
-
Generates human-like text
-
Good at conversation
-
Flexible usage
-
Works for many tasks
Limitations of GPT
-
Can generate wrong information
-
May be biased
-
Needs large resources
-
Not always reliable
Comparison of BERT and GPT
Key Differences
| Feature | BERT | GPT |
|---|---|---|
| Direction | Both sides | Left to right |
| Architecture | Encoder | Decoder |
| Main Use | Understanding | Generation |
| Best For | Analysis tasks | Writing tasks |
| Output | Labels, answers | Text |
BERT vs GPT in Simple Words
BERT is good at:
Reading and understanding
GPT is good at:
Writing and generating
BERT = Reader
GPT = Writer
Role in Modern NLP
Today, most advanced NLP systems use:
-
BERT-type models for analysis
-
GPT-type models for generation
Many hybrid models combine both ideas.
Example:
T5, BART, PaLM
One-Line Summary for Exam
BERT is a bidirectional transformer model used for understanding text, while GPT is a unidirectional transformer model used for generating human-like language.