GenAI Learning Series | Week 2

Generative AI with Large Language Models — Week 2 Study Notes

Course: AWS & DeepLearning.AI (Coursera) | Focus: Fine-tuning, Evaluation, PEFT and LLM Optimisation

June 2, 2026 • 15 min read GenAI LLM Fine-tuning LoRA PEFT AWS

Learning in Public — Study notes from a 24-week Generative AI course, written from an infrastructure engineer's perspective.

Week 2 — GenAI with LLMs: Fine-tuning, PEFT, LoRA, ROUGE and model evaluation

1. The GenAI Project Lifecycle

The lifecycle maps out every stage from idea to production deployment.

1. Scope           → Define what the LLM needs to do (narrow = cheaper)
2. Select Model    → Existing foundation model or train from scratch
3. Adapt & Align   → Prompt engineering → Fine-tuning → RLHF (iterative)
4. Evaluate        → Metrics, benchmarks, human evaluation
5. Deploy & Optimise → Quantisation, distillation, pruning
6. Augment         → RAG, agents, tool use (overcome LLM limitations)

Key insight: Pre-training is the most complex and expensive stage — most teams skip it entirely and start with an existing foundation model. Prompt engineering is tried first (no training required), then fine-tuning if needed.

Stage	Complexity	Typical Duration
Pre-training	Very High	Months
Prompt Engineering	Low	Days
Fine-tuning	Medium	Days to weeks
RLHF	Medium-High	Weeks (depends on human feedback availability)
Optimisation	Medium	Days

2. In-Context Learning (ICL)

Teaching the model via examples in the prompt — no weight updates.

Type	Description	Example
Zero-shot	No examples — just instruction	"Classify sentiment: [text]"
One-shot	One example provided	One labelled example + new input
Few-shot	Multiple examples (2–5)	Multiple labelled examples + new input

Limitation: Large models handle zero/few-shot well. Smaller models may still need fine-tuning even with good examples. ICL uses up context window space.

Context Window

The context window is the maximum amount of text the model can "see" at once — both your prompt and the model's response have to fit inside it. I kept seeing this term and not really having a clean definition for it. In short: the bigger the model, generally the larger the context window. When you add examples for few-shot learning, those examples eat into that space. So there's a real trade-off between giving the model more examples and leaving room for the actual input and output.

3. Prompt Engineering vs Prompt Tuning

	Prompt Engineering	Prompt Tuning
What changes	The language/wording of the prompt	Trainable embedding vectors (soft prompts)
Training required	No	Yes — supervised learning
Human-readable	Yes	No — vectors, not words
Cost	Low	Medium
Performance	Good for large models	Approaches fine-tuning for large models

Hard Prompts vs Soft Prompts

Hard prompts — human-written text, interpretable, flexible. Used in prompt engineering.
Soft prompts — learned embedding vectors prepended to the input. Same length as token embeddings. Task-specific. Not interpretable. Trained end-to-end.

Prompt Optimization

A term I came across that I want to make sure I don't confuse with prompt engineering. Prompt engineering is about manually choosing better words. Prompt optimization is more systematic — it's about addressing the inconsistency and unreliability of outputs. The question it tries to answer: how do you make the model give you a good answer reliably, not just occasionally? That might involve structuring prompts differently, adding constraints, or using techniques like chain-of-thought. I'll dig into this more in later weeks.

4. Fine-Tuning

Supervised learning that updates model weights using labelled examples. Uses a much smaller dataset than pre-training (hundreds to thousands of examples vs billions of tokens).

Pre-training is self-supervised learning — something I hadn't connected before. The model learns by predicting the next token from unlabelled text. No human labels required. That's what makes it possible to train on the entire internet. Fine-tuning after that is supervised — you need labelled prompt-completion pairs.

4a. Instruction Fine-Tuning

Trains the model to follow instructions rather than just predict next tokens. Uses prompt-completion pairs. Improves generalisation across tasks. Dataset is instruction-formatted examples — much smaller than pre-training data.

4b. Single-Task Fine-Tuning

Fine-tune on only one task (e.g. sentiment analysis, summarisation). Can achieve good results with just 500–1,000 examples.

Catastrophic Forgetting

The model forgets how to do other tasks because full fine-tuning modifies all weights. Example: after fine-tuning for sentiment analysis, the model may lose its ability to do named entity recognition — it knew how before fine-tuning.

How to avoid it:

Multi-task fine-tuning — train on 50,000–100,000 examples across many tasks simultaneously
PEFT — only update a small subset of weights, preserving original model capabilities

4c. Multi-Task Instruction Fine-Tuning

Fine-tune on many tasks at once. Requires significantly more data (50k–100k examples). Avoids catastrophic forgetting. FLAN models (e.g. FLAN-T5) use this approach.

Cross Entropy Loss

This came up when the course explained how the model actually learns during training. I'm still not fully comfortable with the maths here, but my understanding is: the model outputs a probability distribution across all tokens in its vocabulary. Cross entropy loss measures how wrong that distribution is compared to the actual next token. The model is trying to minimise this loss across all the training examples — that's what drives the weight updates. The reason cross entropy works here is that the LLM is producing probabilities, not a single class prediction like in simpler classifiers.

5. PEFT — Parameter-Efficient Fine-Tuning

Full fine-tuning updates every parameter. PEFT updates only 15–20% of parameters (or fewer). Most original weights are frozen.

Dramatically lower memory requirements (often fits on a single GPU)
Less prone to catastrophic forgetting
Small trained weights (~MBs) can be swapped in/out per task — no need to store full model copies
Multiple task adapters can share one base model

Compute budget matters too, not just memory. The course asked: "besides memory, what key resource must be sufficient for full fine-tuning?" The answer is compute. Even if you can technically fit the model in memory, full fine-tuning requires enough GPU compute to run all the forward and backward passes. PEFT helps on both fronts — less memory AND less compute because fewer parameters need gradients.

Category	How It Works	Example
Selective	Fine-tune a subset of existing parameters (specific layers)	Mixed results — trade-off between parameter and compute efficiency
Reparameterisation	Reduce trainable parameters via low-rank transformations	LoRA
Additive	Keep original weights frozen, add new trainable components	Adapters, Soft Prompts (Prompt Tuning)

6. LoRA — Low-Rank Adaptation

The most widely used PEFT technique. Falls under reparameterisation.

How It Works

Freeze all original model weights
Inject two small rank-decomposition matrices (A and B) alongside each weight matrix
Train only the small matrices A and B
At inference: multiply A × B, add result to frozen original weights, replace in model

Original weight matrix W: dimensions 512 × 64 = 32,768 parameters

LoRA with rank = 8:
  Matrix A: 8 × 64   = 512 parameters
  Matrix B: 512 × 8  = 4,096 parameters
  Total LoRA params:   4,608 parameters

Reduction: 86% fewer trainable parameters

Key Properties

No inference latency increase — same number of parameters as original after merging
Multiple task adapters — train different A/B pairs per task, swap at inference time
Typically applied to self-attention layers — where most LLM parameters live
Rank range 4–32 — good trade-off between reduction and performance. Ranks above 16 show diminishing returns.

7. Model Evaluation Metrics

The challenge: LLM output is non-deterministic and language-based. You cannot use simple accuracy metrics. Two sentences with only one word difference can have opposite meanings.

ROUGE — For Summarisation

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) compares generated summaries to human reference summaries.

Terminology: Unigram = 1 word | Bigram = 2 consecutive words | N-gram = N consecutive words | LCS = Longest Common Subsequence

ROUGE-1 (Unigram)

Generated:  "the cat sat on the big red mat"    (8 words)
Reference:  "the cat sat on the mat"            (6 words)
Matches:     the, cat, sat, on, the, mat        (6 matches — "big" and "red" not in reference)

Recall    = matches / reference words = 6/6 = 1.000
Precision = matches / generated words = 6/8 = 0.750
F1        = 2 × (1.000 × 0.750) / (1.000 + 0.750) = 0.857

ROUGE-2 (Bigram)

Generated bigrams:  the cat, cat sat, sat on, on the, the big, big red, red mat
Reference bigrams:  the cat, cat sat, sat on, on the, the mat
Matching bigrams:   the cat, cat sat, sat on, on the  (4 matches)

Recall    = 4/5 = 0.800
Precision = 4/7 = 0.571
F1        = 2 × (0.800 × 0.571) / (0.800 + 0.571) = 0.667

ROUGE-L (Longest Common Subsequence)

LCS = "the cat sat on the mat" (length = 6)

Recall    = LCS length / reference length = 6/6 = 1.000
Precision = LCS length / generated length = 6/8 = 0.750
F1 (ROUGE-L) = 0.857

ROUGE Pitfall — Repetition

Repetition like "cold cold cold cold" can score highly. Fix: apply a clipping function capping unigram matches at their count in the reference. Also note: ROUGE scores are only comparable across the same task — don't compare summarisation ROUGE to translation ROUGE.

BLEU — For Translation

BLEU (Bilingual Evaluation Understudy) evaluates machine translation quality. Key difference from ROUGE: BLEU is precision-based only — it does not consider recall.

Generated:  "the cat sat on the big red mat"    (8 words)
Reference:  "the cat sat on the mat"            (6 words)

Unigram precision = 6/8 = 0.750
Bigram precision  = 4/7 = 0.571  (the cat, cat sat, sat on, on the — 4 of 7 bigrams match)
BLEU ≈ average of unigram, bigram, trigram, 4-gram precisions

Metric	Measures	Use For
ROUGE	Recall — how much of the reference is captured	Summarisation
BLEU	Precision — how much of the generated text appears in the reference	Translation

Perplexity

This was listed in the course evaluation metrics alongside BLEU and ROUGE, and I want to be upfront that I haven't fully internalised it yet. My rough understanding: perplexity measures how "surprised" the model is by a piece of text. A low perplexity score means the model found the text predictable — so it's a measure of how well the model's probability distribution matches real language. Lower is generally better. It's used more for evaluating the base language model quality rather than specific task performance like summarisation or translation. I'll come back to this one.

8. Evaluation Benchmarks

For overall model evaluation, use standardised benchmarks rather than task-specific metrics alone:

Benchmark	Full Name	What It Tests
GLUE	General Language Understanding Evaluation	Range of NLU tasks
SuperGLUE	Super GLUE	Harder NLU — models exceeded human baseline on GLUE
MMLU	Massive Multitask Language Understanding	57 subjects across STEM, humanities, social sciences
HELM	Holistic Evaluation of Language Models	7 metrics: accuracy, calibration, robustness, fairness, bias, toxicity, efficiency
SQuAD	Stanford Question Answering Dataset	Reading comprehension + Q&A

9. Transformer Architecture Quick Reference

Type	Also Known As	Architecture	Best For	Examples
Encoder only	Autoencoding	Bidirectional — sees full sequence	Classification, NER, sentiment	BERT, RoBERTa, DistilBERT
Encoder-Decoder	Seq2seq	Encoder processes input, decoder generates output	Translation, summarisation, Q&A	T5, BART
Decoder only	Autoregressive	Unidirectional — predicts next token	Text generation, chatbots	GPT family, LLaMA

Why Transformers Replaced RNNs and CNNs

Before transformers, sequence tasks were handled by RNNs (Recurrent Neural Networks) and CNNs (Convolutional Neural Networks). The course touched on why transformers won, and the answer that clicked for me was: RNNs process tokens one at a time, sequentially. That means they struggle with long-range dependencies — by the time the model gets to the end of a long sentence, information about the beginning has faded. CNNs were faster but had limited receptive fields. Transformers introduced the self-attention mechanism, which lets every token attend to every other token in the sequence simultaneously. That's the key shift — parallel, not sequential. Which also makes them much faster to train on modern GPU hardware.

Activation Functions — Softmax, Sigmoid, Tanh, ReLU

These came up in the architecture discussion. I'll be honest — I'm going to need to revisit this properly. My current rough understanding:

Softmax — converts raw scores into a probability distribution (all values sum to 1). Used in the output layer of the transformer to produce token probabilities.
Sigmoid — squashes values to 0–1. Useful for binary classification.
Tanh — squashes values to -1 to 1. Similar to sigmoid but centred at zero.
ReLU (Rectified Linear Unit) — outputs 0 for negative values, the value itself for positive. Very common in feed-forward layers inside transformers.

Interesting stat from the course: The new Moore's Law for LLMs — parameter count roughly doubles every 9.8 months. This was cited to show how fast the field is moving. The original Moore's Law was about transistor count on chips. The LLM equivalent tracks model scale. Worth keeping in mind when reading papers with dates on them — a 2022 "large" model may be medium-sized by 2024 standards.

Key Concepts

Tokenisation — converts raw text to token IDs
Embeddings — maps token IDs to dense vectors (meaning in vector space)
Positional encoding — adds position information to embeddings (transformers have no inherent sequence order)
Self-attention — each token attends to every other token in the sequence
Multi-head attention — multiple attention mechanisms run in parallel, each learning different relationships
Cross-attention — decoder attends to encoder output (in encoder-decoder models)

What Is Inference?

Inference is the act of using a trained model to generate a response. When you type a message to ChatGPT and it replies — that's inference. The model is running your input through all its learned weights and producing an output.

The term that trips people up is the difference between inference and a prompt:

Prompt — the text you send in. Your question, instruction, or context. It's just a string.
Inference — the process of the model computing a response from that prompt. It's the neural network doing work.

Quick example:

Prompt: "Translate 'hello' to French."
Inference: the model runs that prompt through its layers, samples from a probability distribution over tokens, and produces "Bonjour."
The prompt is your input. Inference is everything that happens in between.

This also clears up why "inference cost" is different from "training cost." Training is done once and updates the weights. Inference happens every single time someone sends a prompt — which at scale is millions of times a day. That's why model optimisation (quantisation, distillation) focuses so heavily on making inference faster and cheaper.

Inference Parameters

Parameter	Effect
Temperature	Controls randomness. Low (0.1) = deterministic. High (1.5) = creative/random.
Top-p (nucleus)	Sample from smallest set of tokens whose probabilities sum to p
Top-k	Sample from the k most likely tokens only
Max tokens	Maximum output length
Repetition penalty	Reduces probability of repeating tokens already in output

Note: Inference parameters ≠ training parameters. Training parameters (weights, biases) are learned during training. Inference parameters are set at runtime and influence how the model samples from its probability distribution.

10. LLM Optimisation for Deployment

Quantisation — Reduce Memory Footprint

Format	Bytes per parameter	Notes
FP32	4 bytes	Baseline full precision
FP16	2 bytes	Half precision
BFLOAT16	2 bytes	FP16 on steroids — better range, used in training
Int8	1 byte	Good compression, minimal quality loss
Int4	0.5 bytes	Aggressive — reduces size and speeds inference

CUDA

CUDA keeps getting mentioned in the context of GPU training and it finally made sense to note it down. CUDA stands for Compute Unified Device Architecture — it's NVIDIA's platform and API for using GPUs for general-purpose computing (not just graphics). Basically, every time you hear about training on a GPU, CUDA is the underlying thing making that possible. PyTorch, TensorFlow, and most deep learning frameworks rely on CUDA under the hood. As an infra person this is the layer I sit above — but it's good to know what it is when it comes up in errors or performance discussions.

Other Optimisation Techniques

Distillation — train a smaller "student" model to mimic a larger "teacher" model
Pruning — remove low-importance weights (near-zero values) to reduce model size

Distributed Training

Method	How It Works	When to Use
DDP (Distributed Data Parallel)	Full model replicated on each GPU	Model fits on single GPU
FSDP (Fully Sharded Data Parallel)	Model sharded across GPUs	Model too large for single GPU — reduces memory up to 80%

11. RLHF — Reinforcement Learning from Human Feedback

Aligns the model with human preferences: helpful, harmless, honest (HHH).

1. Collect human feedback on model outputs (rank completions)
2. Train a Reward Model on human rankings
3. Use PPO (Proximal Policy Optimisation) to fine-tune the LLM
   → Maximise reward score while staying close to original model
4. KL Divergence — constrains how far the updated policy can deviate
   from the reference model (prevents catastrophic forgetting)

PPO (Proximal Policy Optimisation) — the "proximal" refers to a constraint that limits the distance between old and new policy, preventing large, destabilising weight updates.
KL Divergence — measures difference between two probability distributions. Used in PPO to ensure the fine-tuned LLM doesn't drift too far from the original.

Constitutional AI (RLAIF)

AI-generated feedback instead of human feedback. Reduces cost of RLHF.

Red teaming — prompt model to generate harmful responses
Critique — model critiques its own harmful output using constitutional principles
Revision — model revises output to be safe
Use revised outputs to train the reward model (replacing human feedback)

Constitutional principles = a short text file with rules the model should follow (analogous to a constitution for the AI's behaviour).

12. Advanced Techniques — Overcoming LLM Limitations

RAG — Retrieval-Augmented Generation

Overcomes hallucination by retrieving relevant context from an external knowledge base and providing it to the model as additional input.

User query → Retriever → External knowledge base → Retrieved context
                                                          ↓
                                              LLM generates answer
                                              grounded in retrieved facts

ReAct — Reasoning + Acting

Framework combining reasoning traces and action steps. The model reasons about which action to take, takes the action (e.g. search, calculate), observes the result, reasons again. Interleaves reasoning and acting to solve multi-step problems that require external tools.

Chain of Thought Prompting

Include step-by-step reasoning in the prompt to guide the model through complex reasoning tasks. Improves performance on maths, logic, and multi-step problems.

PAL — Program-Aided Language Models

Model generates code to solve problems rather than reasoning in natural language. The code is executed and the result returned.

LangChain

Framework for building LLM-powered applications. Chains together prompts, models, memory, tools, and agents. Memory in LangChain stores conversation history to maintain context across turns.

13. Responsible AI

Issue	Description
Hallucination	Model generates false or fabricated information confidently
Toxicity	Harmful, offensive, or dangerous outputs
Bias	Reflects and amplifies biases present in training data
Fairness	Inconsistent performance across demographic groups
IP/Copyright	Model may reproduce copyrighted training data

Plausibility vs Accuracy: A model can generate text that sounds plausible but is factually wrong. This is the core challenge of LLM evaluation — outputs can be fluent and confident while being incorrect.

Quick Reference — Key Formulas

ROUGE-1 Recall    = unigram matches / reference unigrams
ROUGE-1 Precision = unigram matches / generated unigrams
ROUGE-1 F1        = 2 × (Recall × Precision) / (Recall + Precision)

ROUGE-L           = uses LCS length in place of unigram matches

BLEU              = average precision across n-gram sizes 1 to 4

LoRA param reduction example (rank=8, W=512×64):
  Original W: 32,768 params
  LoRA A+B:    4,608 params  →  86% reduction

Quantisation memory (7B parameter model):
  FP32:  7B × 4 bytes = 28 GB
  FP16:  7B × 2 bytes = 14 GB
  Int8:  7B × 1 byte  =  7 GB
  Int4:  7B × 0.5B    =  3.5 GB

Key Distinctions to Remember

Comparison	Left	Right
Fine-tuning vs Prompt Tuning	Updates model weights using labelled data	Trains soft prompt vectors only — model weights stay frozen
ROUGE vs BLEU	Recall-based. How much of the reference is captured. Use for summarisation.	Precision-based. How much of the generated text appears in the reference. Use for translation.
LoRA vs Full Fine-tuning	86% fewer trainable parameters, same inference latency, task adapters swappable	All weights updated, full model copies per task, higher memory cost
Training params vs Inference params	Learned weights and biases — not changed at inference	Temperature, top-p, top-k — set at runtime, not learned
RLHF vs Constitutional AI	Uses human feedback to train reward model	Uses AI-generated feedback guided by written principles — cheaper and more scalable

The Core Lesson

"Full fine-tuning is rarely the right first move. Start with prompt engineering — zero cost, immediate feedback. If the model needs task-specific improvement, LoRA gives you 86% fewer trainable parameters with minimal performance loss and full inference compatibility. The goal is the smallest intervention that achieves the required behaviour — because smaller interventions are cheaper, faster to iterate, and preserve what the model already knows."

Tags: GenAI LLM Fine-tuning LoRA PEFT ROUGE BLEU AWS DeepLearning.AI

Ankush Panday

Specializing in highly scalable AWS infrastructure and automated quality engineering.

Connect on LinkedIn