Back to Blog Hub
GenAI Learning Series | Week 2

Generative AI with Large Language Models — Week 2 Study Notes

Course: AWS & DeepLearning.AI (Coursera)  |  Focus: Fine-tuning, Evaluation, PEFT and LLM Optimisation

June 2, 2026 15 min read GenAI LLM Fine-tuning LoRA PEFT AWS

Learning in Public — Study notes from a 24-week Generative AI course, written from an infrastructure engineer's perspective.

Week 2 — GenAI with LLMs: Fine-tuning, PEFT, LoRA, ROUGE and model evaluation

1. The GenAI Project Lifecycle

The lifecycle maps out every stage from idea to production deployment.

1. Scope           → Define what the LLM needs to do (narrow = cheaper)
2. Select Model    → Existing foundation model or train from scratch
3. Adapt & Align   → Prompt engineering → Fine-tuning → RLHF (iterative)
4. Evaluate        → Metrics, benchmarks, human evaluation
5. Deploy & Optimise → Quantisation, distillation, pruning
6. Augment         → RAG, agents, tool use (overcome LLM limitations)
Key insight: Pre-training is the most complex and expensive stage — most teams skip it entirely and start with an existing foundation model. Prompt engineering is tried first (no training required), then fine-tuning if needed.
StageComplexityTypical Duration
Pre-trainingVery HighMonths
Prompt EngineeringLowDays
Fine-tuningMediumDays to weeks
RLHFMedium-HighWeeks (depends on human feedback availability)
OptimisationMediumDays

2. In-Context Learning (ICL)

Teaching the model via examples in the prompt — no weight updates.

TypeDescriptionExample
Zero-shotNo examples — just instruction"Classify sentiment: [text]"
One-shotOne example providedOne labelled example + new input
Few-shotMultiple examples (2–5)Multiple labelled examples + new input
Limitation: Large models handle zero/few-shot well. Smaller models may still need fine-tuning even with good examples. ICL uses up context window space.

Context Window

The context window is the maximum amount of text the model can "see" at once — both your prompt and the model's response have to fit inside it. I kept seeing this term and not really having a clean definition for it. In short: the bigger the model, generally the larger the context window. When you add examples for few-shot learning, those examples eat into that space. So there's a real trade-off between giving the model more examples and leaving room for the actual input and output.

3. Prompt Engineering vs Prompt Tuning

Prompt EngineeringPrompt Tuning
What changesThe language/wording of the promptTrainable embedding vectors (soft prompts)
Training requiredNoYes — supervised learning
Human-readableYesNo — vectors, not words
CostLowMedium
PerformanceGood for large modelsApproaches fine-tuning for large models

Hard Prompts vs Soft Prompts

  • Hard prompts — human-written text, interpretable, flexible. Used in prompt engineering.
  • Soft prompts — learned embedding vectors prepended to the input. Same length as token embeddings. Task-specific. Not interpretable. Trained end-to-end.

Prompt Optimization

A term I came across that I want to make sure I don't confuse with prompt engineering. Prompt engineering is about manually choosing better words. Prompt optimization is more systematic — it's about addressing the inconsistency and unreliability of outputs. The question it tries to answer: how do you make the model give you a good answer reliably, not just occasionally? That might involve structuring prompts differently, adding constraints, or using techniques like chain-of-thought. I'll dig into this more in later weeks.

4. Fine-Tuning

Supervised learning that updates model weights using labelled examples. Uses a much smaller dataset than pre-training (hundreds to thousands of examples vs billions of tokens).

Pre-training is self-supervised learning — something I hadn't connected before. The model learns by predicting the next token from unlabelled text. No human labels required. That's what makes it possible to train on the entire internet. Fine-tuning after that is supervised — you need labelled prompt-completion pairs.

4a. Instruction Fine-Tuning

Trains the model to follow instructions rather than just predict next tokens. Uses prompt-completion pairs. Improves generalisation across tasks. Dataset is instruction-formatted examples — much smaller than pre-training data.

4b. Single-Task Fine-Tuning

Fine-tune on only one task (e.g. sentiment analysis, summarisation). Can achieve good results with just 500–1,000 examples.

Catastrophic Forgetting

The model forgets how to do other tasks because full fine-tuning modifies all weights. Example: after fine-tuning for sentiment analysis, the model may lose its ability to do named entity recognition — it knew how before fine-tuning.

How to avoid it:
  • Multi-task fine-tuning — train on 50,000–100,000 examples across many tasks simultaneously
  • PEFT — only update a small subset of weights, preserving original model capabilities

4c. Multi-Task Instruction Fine-Tuning

Fine-tune on many tasks at once. Requires significantly more data (50k–100k examples). Avoids catastrophic forgetting. FLAN models (e.g. FLAN-T5) use this approach.

Cross Entropy Loss

This came up when the course explained how the model actually learns during training. I'm still not fully comfortable with the maths here, but my understanding is: the model outputs a probability distribution across all tokens in its vocabulary. Cross entropy loss measures how wrong that distribution is compared to the actual next token. The model is trying to minimise this loss across all the training examples — that's what drives the weight updates. The reason cross entropy works here is that the LLM is producing probabilities, not a single class prediction like in simpler classifiers.

5. PEFT — Parameter-Efficient Fine-Tuning

Full fine-tuning updates every parameter. PEFT updates only 15–20% of parameters (or fewer). Most original weights are frozen.

  • Dramatically lower memory requirements (often fits on a single GPU)
  • Less prone to catastrophic forgetting
  • Small trained weights (~MBs) can be swapped in/out per task — no need to store full model copies
  • Multiple task adapters can share one base model
Compute budget matters too, not just memory. The course asked: "besides memory, what key resource must be sufficient for full fine-tuning?" The answer is compute. Even if you can technically fit the model in memory, full fine-tuning requires enough GPU compute to run all the forward and backward passes. PEFT helps on both fronts — less memory AND less compute because fewer parameters need gradients.
CategoryHow It WorksExample
SelectiveFine-tune a subset of existing parameters (specific layers)Mixed results — trade-off between parameter and compute efficiency
ReparameterisationReduce trainable parameters via low-rank transformationsLoRA
AdditiveKeep original weights frozen, add new trainable componentsAdapters, Soft Prompts (Prompt Tuning)

6. LoRA — Low-Rank Adaptation

The most widely used PEFT technique. Falls under reparameterisation.

How It Works

  1. Freeze all original model weights
  2. Inject two small rank-decomposition matrices (A and B) alongside each weight matrix
  3. Train only the small matrices A and B
  4. At inference: multiply A × B, add result to frozen original weights, replace in model
Original weight matrix W: dimensions 512 × 64 = 32,768 parameters

LoRA with rank = 8:
  Matrix A: 8 × 64   = 512 parameters
  Matrix B: 512 × 8  = 4,096 parameters
  Total LoRA params:   4,608 parameters

Reduction: 86% fewer trainable parameters

Key Properties

  • No inference latency increase — same number of parameters as original after merging
  • Multiple task adapters — train different A/B pairs per task, swap at inference time
  • Typically applied to self-attention layers — where most LLM parameters live
  • Rank range 4–32 — good trade-off between reduction and performance. Ranks above 16 show diminishing returns.

7. Model Evaluation Metrics

The challenge: LLM output is non-deterministic and language-based. You cannot use simple accuracy metrics. Two sentences with only one word difference can have opposite meanings.

ROUGE — For Summarisation

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) compares generated summaries to human reference summaries.

Terminology: Unigram = 1 word  |  Bigram = 2 consecutive words  |  N-gram = N consecutive words  |  LCS = Longest Common Subsequence

ROUGE-1 (Unigram)

Generated:  "the cat sat on the big red mat"    (8 words)
Reference:  "the cat sat on the mat"            (6 words)
Matches:     the, cat, sat, on, the, mat        (6 matches — "big" and "red" not in reference)

Recall    = matches / reference words = 6/6 = 1.000
Precision = matches / generated words = 6/8 = 0.750
F1        = 2 × (1.000 × 0.750) / (1.000 + 0.750) = 0.857

ROUGE-2 (Bigram)

Generated bigrams:  the cat, cat sat, sat on, on the, the big, big red, red mat
Reference bigrams:  the cat, cat sat, sat on, on the, the mat
Matching bigrams:   the cat, cat sat, sat on, on the  (4 matches)

Recall    = 4/5 = 0.800
Precision = 4/7 = 0.571
F1        = 2 × (0.800 × 0.571) / (0.800 + 0.571) = 0.667

ROUGE-L (Longest Common Subsequence)

LCS = "the cat sat on the mat" (length = 6)

Recall    = LCS length / reference length = 6/6 = 1.000
Precision = LCS length / generated length = 6/8 = 0.750
F1 (ROUGE-L) = 0.857

ROUGE Pitfall — Repetition

Repetition like "cold cold cold cold" can score highly. Fix: apply a clipping function capping unigram matches at their count in the reference. Also note: ROUGE scores are only comparable across the same task — don't compare summarisation ROUGE to translation ROUGE.

BLEU — For Translation

BLEU (Bilingual Evaluation Understudy) evaluates machine translation quality. Key difference from ROUGE: BLEU is precision-based only — it does not consider recall.

Generated:  "the cat sat on the big red mat"    (8 words)
Reference:  "the cat sat on the mat"            (6 words)

Unigram precision = 6/8 = 0.750
Bigram precision  = 4/7 = 0.571  (the cat, cat sat, sat on, on the — 4 of 7 bigrams match)
BLEU ≈ average of unigram, bigram, trigram, 4-gram precisions
MetricMeasuresUse For
ROUGERecall — how much of the reference is capturedSummarisation
BLEUPrecision — how much of the generated text appears in the referenceTranslation

Perplexity

This was listed in the course evaluation metrics alongside BLEU and ROUGE, and I want to be upfront that I haven't fully internalised it yet. My rough understanding: perplexity measures how "surprised" the model is by a piece of text. A low perplexity score means the model found the text predictable — so it's a measure of how well the model's probability distribution matches real language. Lower is generally better. It's used more for evaluating the base language model quality rather than specific task performance like summarisation or translation. I'll come back to this one.

8. Evaluation Benchmarks

For overall model evaluation, use standardised benchmarks rather than task-specific metrics alone:

BenchmarkFull NameWhat It Tests
GLUEGeneral Language Understanding EvaluationRange of NLU tasks
SuperGLUESuper GLUEHarder NLU — models exceeded human baseline on GLUE
MMLUMassive Multitask Language Understanding57 subjects across STEM, humanities, social sciences
HELMHolistic Evaluation of Language Models7 metrics: accuracy, calibration, robustness, fairness, bias, toxicity, efficiency
SQuADStanford Question Answering DatasetReading comprehension + Q&A

9. Transformer Architecture Quick Reference

TypeAlso Known AsArchitectureBest ForExamples
Encoder onlyAutoencodingBidirectional — sees full sequenceClassification, NER, sentimentBERT, RoBERTa, DistilBERT
Encoder-DecoderSeq2seqEncoder processes input, decoder generates outputTranslation, summarisation, Q&AT5, BART
Decoder onlyAutoregressiveUnidirectional — predicts next tokenText generation, chatbotsGPT family, LLaMA

Why Transformers Replaced RNNs and CNNs

Before transformers, sequence tasks were handled by RNNs (Recurrent Neural Networks) and CNNs (Convolutional Neural Networks). The course touched on why transformers won, and the answer that clicked for me was: RNNs process tokens one at a time, sequentially. That means they struggle with long-range dependencies — by the time the model gets to the end of a long sentence, information about the beginning has faded. CNNs were faster but had limited receptive fields. Transformers introduced the self-attention mechanism, which lets every token attend to every other token in the sequence simultaneously. That's the key shift — parallel, not sequential. Which also makes them much faster to train on modern GPU hardware.

Activation Functions — Softmax, Sigmoid, Tanh, ReLU

These came up in the architecture discussion. I'll be honest — I'm going to need to revisit this properly. My current rough understanding:

  • Softmax — converts raw scores into a probability distribution (all values sum to 1). Used in the output layer of the transformer to produce token probabilities.
  • Sigmoid — squashes values to 0–1. Useful for binary classification.
  • Tanh — squashes values to -1 to 1. Similar to sigmoid but centred at zero.
  • ReLU (Rectified Linear Unit) — outputs 0 for negative values, the value itself for positive. Very common in feed-forward layers inside transformers.
Interesting stat from the course: The new Moore's Law for LLMs — parameter count roughly doubles every 9.8 months. This was cited to show how fast the field is moving. The original Moore's Law was about transistor count on chips. The LLM equivalent tracks model scale. Worth keeping in mind when reading papers with dates on them — a 2022 "large" model may be medium-sized by 2024 standards.

Key Concepts

  • Tokenisation — converts raw text to token IDs
  • Embeddings — maps token IDs to dense vectors (meaning in vector space)
  • Positional encoding — adds position information to embeddings (transformers have no inherent sequence order)
  • Self-attention — each token attends to every other token in the sequence
  • Multi-head attention — multiple attention mechanisms run in parallel, each learning different relationships
  • Cross-attention — decoder attends to encoder output (in encoder-decoder models)

What Is Inference?

Inference is the act of using a trained model to generate a response. When you type a message to ChatGPT and it replies — that's inference. The model is running your input through all its learned weights and producing an output.

The term that trips people up is the difference between inference and a prompt:

  • Prompt — the text you send in. Your question, instruction, or context. It's just a string.
  • Inference — the process of the model computing a response from that prompt. It's the neural network doing work.

Quick example:

Prompt: "Translate 'hello' to French."
Inference: the model runs that prompt through its layers, samples from a probability distribution over tokens, and produces "Bonjour."
The prompt is your input. Inference is everything that happens in between.

This also clears up why "inference cost" is different from "training cost." Training is done once and updates the weights. Inference happens every single time someone sends a prompt — which at scale is millions of times a day. That's why model optimisation (quantisation, distillation) focuses so heavily on making inference faster and cheaper.

Inference Parameters

ParameterEffect
TemperatureControls randomness. Low (0.1) = deterministic. High (1.5) = creative/random.
Top-p (nucleus)Sample from smallest set of tokens whose probabilities sum to p
Top-kSample from the k most likely tokens only
Max tokensMaximum output length
Repetition penaltyReduces probability of repeating tokens already in output
Note: Inference parameters ≠ training parameters. Training parameters (weights, biases) are learned during training. Inference parameters are set at runtime and influence how the model samples from its probability distribution.

10. LLM Optimisation for Deployment

Quantisation — Reduce Memory Footprint

FormatBytes per parameterNotes
FP324 bytesBaseline full precision
FP162 bytesHalf precision
BFLOAT162 bytesFP16 on steroids — better range, used in training
Int81 byteGood compression, minimal quality loss
Int40.5 bytesAggressive — reduces size and speeds inference

CUDA

CUDA keeps getting mentioned in the context of GPU training and it finally made sense to note it down. CUDA stands for Compute Unified Device Architecture — it's NVIDIA's platform and API for using GPUs for general-purpose computing (not just graphics). Basically, every time you hear about training on a GPU, CUDA is the underlying thing making that possible. PyTorch, TensorFlow, and most deep learning frameworks rely on CUDA under the hood. As an infra person this is the layer I sit above — but it's good to know what it is when it comes up in errors or performance discussions.

Other Optimisation Techniques

  • Distillation — train a smaller "student" model to mimic a larger "teacher" model
  • Pruning — remove low-importance weights (near-zero values) to reduce model size

Distributed Training

MethodHow It WorksWhen to Use
DDP (Distributed Data Parallel)Full model replicated on each GPUModel fits on single GPU
FSDP (Fully Sharded Data Parallel)Model sharded across GPUsModel too large for single GPU — reduces memory up to 80%

11. RLHF — Reinforcement Learning from Human Feedback

Aligns the model with human preferences: helpful, harmless, honest (HHH).

1. Collect human feedback on model outputs (rank completions)
2. Train a Reward Model on human rankings
3. Use PPO (Proximal Policy Optimisation) to fine-tune the LLM
   → Maximise reward score while staying close to original model
4. KL Divergence — constrains how far the updated policy can deviate
   from the reference model (prevents catastrophic forgetting)
  • PPO (Proximal Policy Optimisation) — the "proximal" refers to a constraint that limits the distance between old and new policy, preventing large, destabilising weight updates.
  • KL Divergence — measures difference between two probability distributions. Used in PPO to ensure the fine-tuned LLM doesn't drift too far from the original.

Constitutional AI (RLAIF)

AI-generated feedback instead of human feedback. Reduces cost of RLHF.

  1. Red teaming — prompt model to generate harmful responses
  2. Critique — model critiques its own harmful output using constitutional principles
  3. Revision — model revises output to be safe
  4. Use revised outputs to train the reward model (replacing human feedback)

Constitutional principles = a short text file with rules the model should follow (analogous to a constitution for the AI's behaviour).

12. Advanced Techniques — Overcoming LLM Limitations

RAG — Retrieval-Augmented Generation

Overcomes hallucination by retrieving relevant context from an external knowledge base and providing it to the model as additional input.

User query → Retriever → External knowledge base → Retrieved context
                                                          ↓
                                              LLM generates answer
                                              grounded in retrieved facts

ReAct — Reasoning + Acting

Framework combining reasoning traces and action steps. The model reasons about which action to take, takes the action (e.g. search, calculate), observes the result, reasons again. Interleaves reasoning and acting to solve multi-step problems that require external tools.

Chain of Thought Prompting

Include step-by-step reasoning in the prompt to guide the model through complex reasoning tasks. Improves performance on maths, logic, and multi-step problems.

PAL — Program-Aided Language Models

Model generates code to solve problems rather than reasoning in natural language. The code is executed and the result returned.

LangChain

Framework for building LLM-powered applications. Chains together prompts, models, memory, tools, and agents. Memory in LangChain stores conversation history to maintain context across turns.

13. Responsible AI

IssueDescription
HallucinationModel generates false or fabricated information confidently
ToxicityHarmful, offensive, or dangerous outputs
BiasReflects and amplifies biases present in training data
FairnessInconsistent performance across demographic groups
IP/CopyrightModel may reproduce copyrighted training data
Plausibility vs Accuracy: A model can generate text that sounds plausible but is factually wrong. This is the core challenge of LLM evaluation — outputs can be fluent and confident while being incorrect.

Quick Reference — Key Formulas

ROUGE-1 Recall    = unigram matches / reference unigrams
ROUGE-1 Precision = unigram matches / generated unigrams
ROUGE-1 F1        = 2 × (Recall × Precision) / (Recall + Precision)

ROUGE-L           = uses LCS length in place of unigram matches

BLEU              = average precision across n-gram sizes 1 to 4

LoRA param reduction example (rank=8, W=512×64):
  Original W: 32,768 params
  LoRA A+B:    4,608 params  →  86% reduction

Quantisation memory (7B parameter model):
  FP32:  7B × 4 bytes = 28 GB
  FP16:  7B × 2 bytes = 14 GB
  Int8:  7B × 1 byte  =  7 GB
  Int4:  7B × 0.5B    =  3.5 GB

Key Distinctions to Remember

ComparisonLeftRight
Fine-tuning vs Prompt Tuning Updates model weights using labelled data Trains soft prompt vectors only — model weights stay frozen
ROUGE vs BLEU Recall-based. How much of the reference is captured. Use for summarisation. Precision-based. How much of the generated text appears in the reference. Use for translation.
LoRA vs Full Fine-tuning 86% fewer trainable parameters, same inference latency, task adapters swappable All weights updated, full model copies per task, higher memory cost
Training params vs Inference params Learned weights and biases — not changed at inference Temperature, top-p, top-k — set at runtime, not learned
RLHF vs Constitutional AI Uses human feedback to train reward model Uses AI-generated feedback guided by written principles — cheaper and more scalable

The Core Lesson

"Full fine-tuning is rarely the right first move. Start with prompt engineering — zero cost, immediate feedback. If the model needs task-specific improvement, LoRA gives you 86% fewer trainable parameters with minimal performance loss and full inference compatibility. The goal is the smallest intervention that achieves the required behaviour — because smaller interventions are cheaper, faster to iterate, and preserve what the model already knows."

Tags: GenAI LLM Fine-tuning LoRA PEFT ROUGE BLEU AWS DeepLearning.AI

Ankush Panday

Specializing in highly scalable AWS infrastructure and automated quality engineering.

Connect on LinkedIn