Crash Course

Technical AI Literacy

A 15-module deep dive into how AI systems actually work —
written for professionals who want genuine technical understanding.

15 modules~140–180 min readNo prior technical knowledge required

Each module stands on its own — you can read end-to-end or jump directly to the topics most relevant to your work. The course assumes no prior technical background: concepts are introduced from first principles and built up progressively. Modules 1–7 cover how AI systems are built and used — from language models and retrieval to prompting, context engineering, and agents; Modules 8–12 cover the infrastructure, economics, privacy, safety, and governance surrounding them; Modules 13–15 cover evaluation, fine-tuning, and LLMops.

Module 01·~30 min·All levels

What is an LLM?

Tokens, training, and how language models generate text

Tokens, not words

LLMs don't process text the way humans read it. They operate on tokens — fragments of text that can be whole words, parts of words, punctuation, or spaces. The word tokenisation splits into roughly two tokens: 'token' and 'isation'. A typical page of prose contains 500–700 tokens. This matters because LLMs have a fixed context window — the amount of text they can process at once — and that limit is measured in tokens, not words or pages.

The standard tokenisation method is Byte-Pair Encoding (BPE): starting from individual bytes or characters, the algorithm repeatedly merges the most frequently co-occurring adjacent pairs until it reaches a target vocabulary size — typically 32,000 to 128,000 tokens. Common English words like 'the' or 'and' become single tokens; rarer words and technical terms fragment into multiple subword units. This is why LLMs perform better on common English than on unusual proper nouns, non-Latin scripts, or domain-specific jargon that appeared rarely in training data — those strings consume more tokens per unit of meaning.

Tokenisation · byte-pair encoding

"The quick brown fox"

The|·quick|·brown|·fox

"tokenisation"

token|isation

"ChatGPT-4o"

Chat|G|PT|-|4|o

Byte-pair encoding iteratively merges the most frequent character pairs to build a subword vocabulary. The interpunct ( · ) marks a preceding space; the rightmost column counts tokens.

Training: learning from vast text

Before an LLM can do anything useful, it undergoes pre-training: processing hundreds of billions of tokens of text — books, websites, code, scientific papers — and learning to predict what comes next. This isn't memorisation; it's pattern recognition at extreme scale. The model adjusts billions of internal numerical parameters (called weights) until it becomes highly reliable at predicting the next token given all previous ones. This process requires enormous compute and typically takes weeks to months of continuous GPU time.

After pre-training, models are usually fine-tuned — trained further on curated, higher-quality examples to follow instructions, be helpful, and avoid harmful outputs. This second stage is where much of what makes a model feel aligned with human expectations comes from.

The Transformer architecture

Almost every modern LLM is built on the Transformer architecture, introduced in a landmark 2017 Google paper ('Attention Is All You Need'). The key innovation is the self-attention mechanism: rather than processing text sequentially left-to-right, every token can attend to every other token simultaneously, regardless of distance.

Attention works by computing three representations for each token: a Query (what this token is looking for), a Key (what this token offers to others), and a Value (what this token contributes when attended to). Attention scores are computed by taking the dot product of each token's Query against every other token's Key, then normalising with softmax so scores sum to 1. Each token's output representation is then a weighted sum of all Values — effectively asking: given what I am looking for, how much should I attend to each other token? This happens in parallel across all tokens simultaneously, which is why Transformers train far faster than the sequential RNN architectures they replaced.

Modern LLMs stack many Transformer layers (32–128 for frontier models), with each layer running multiple attention patterns in parallel (multi-head attention — typically 32 to 128 heads). Earlier layers tend to capture syntactic relationships; later layers encode higher-level semantic concepts. Between attention sub-layers sit feed-forward networks that apply learned non-linear transformations, giving the model its representational depth.

Self-attention · "The cat sat on the mat"

query / key	The	cat	sat	mat
The
cat
sat
mat

Each row shows how strongly a token attends to every other token; darker shading indicates higher weight. Rows sum to one — the model distributes a fixed budget of attention. In practice this runs across thousands of tokens through many parallel heads.

Positional encoding

The attention mechanism has a subtle limitation: it is position-agnostic. When a token computes its Query-Key dot products, it attends based on semantic similarity — not on where in the sequence the other token appears. Left uncorrected, 'The dog bit the man' and 'The man bit the dog' would produce identical attention patterns — same tokens, different meaning.

Positional encodings solve this by adding a position-dependent signal to each token's embedding before it enters the Transformer layers. The original 2017 Transformer used fixed sinusoidal encodings — mathematical functions of position that produce a unique pattern for each sequence slot. Modern LLMs almost universally use RoPE (Rotary Position Embedding), which encodes position by rotating the Query and Key vectors before the dot product. This makes attention scores naturally decay with distance, and crucially enables context window extension: by adjusting the rotation frequency at inference time, models can generalise to longer contexts than they were trained on — one mechanism behind how providers extend 8K base models to 128K or beyond.

Mixture of Experts

Most frontier models today are not dense Transformers — they use a Mixture of Experts (MoE) architecture. In a dense Transformer, every parameter activates for every token. MoE replaces the feed-forward sub-layer in each Transformer block with a set of specialised 'expert' networks — typically 8 or 16 — and a learned router. For each token, the router selects a small subset of experts (usually 2) to process it; the remaining experts do not activate and incur no compute cost.

This decouples total parameter count from active parameter count. Mixtral 8x7B has 46.7 billion total parameters but only 12.9 billion active per token — delivering quality comparable to a ~46B dense model at the inference cost of a ~13B one. GPT-4 is widely understood to use MoE, as do Gemini and most other frontier models. The trade-off: MoE models require more total VRAM to hold all experts in memory simultaneously, and training requires careful load-balancing to prevent certain experts from being underused. For inference throughput, the active-parameter advantage is substantial.

Inference: generating one token at a time

When you send a message to an LLM, it doesn't retrieve a pre-written answer. It generates a response one token at a time, each token being a probabilistic selection based on everything that came before. This is why outputs vary between identical prompts, why models can be confidently wrong (the next token is probable, not necessarily true), and why longer responses take longer to generate — the model produces each token sequentially.

Mechanically: the model passes the full context through all its layers and produces a logit (a raw score) for every token in its vocabulary — typically 50,000 to 128,000 entries. These logits are converted to probabilities via a softmax function. One token is sampled, appended to the context, and the entire process repeats. At temperature 0, the model greedily selects the highest-probability token every time, producing deterministic output. At higher temperatures, logits are divided by T before softmax, flattening the distribution so lower-ranked tokens get a meaningful share of the probability mass. Top-p sampling (nucleus sampling) further trims this to the smallest set of tokens whose cumulative probability exceeds a threshold p — cutting off the long tail of improbable vocabulary entirely.

Next-token distribution · "After a long run, she felt ___"

31%probability the next token is happy

050%100%

happy

31%

glad

19%

pleased

14%

great

11%

good

(all others)

16%

The model scores every token in its vocabulary (~50k–128k tokens). At temperature 0 the highest-probability token always wins; higher temperature flattens the distribution so lower-ranked tokens are more likely to be sampled.

Context windows

The context window is the total amount of text an LLM can see at once — your conversation history, any documents provided, system instructions, and the model's own previous responses. Modern frontier models support windows from 128,000 to over one million tokens. When content exceeds this window, earlier context is dropped. This is a real engineering constraint in enterprise applications that need to reason over long documents or extended conversations.

Parameters: what the numbers actually mean

When a model is described as having '7 billion parameters' or '405 billion parameters', those numbers refer to its weights — individual numerical values (typically stored as 16-bit floats) that encode everything the model learned during training. A 7B model stores roughly 14 GB of data at 2 bytes per parameter; a 70B model requires ~140 GB; a 405B model exceeds 800 GB. This is why running large models locally requires substantial GPU memory: the full weight set must fit in VRAM before any inference can begin.

Scaling laws (established by DeepMind's Chinchilla paper, 2022) showed that optimal model performance depends jointly on parameter count and training token volume. Doubling parameters without also scaling training data yields diminishing returns. This insight drove the shift toward training smaller models on vastly more tokens — producing models like Llama 3 8B that significantly outperform earlier 30B models trained on less data. Parameter count is a proxy for capacity, not a direct measure of capability.

Decoding strategies

Generation parameters shape the probability distribution at each step; a separate choice — the decoding strategy — decides how to turn that sequence of distributions into actual text. The same model with the same parameters can produce very different output depending on how the strategy searches the token space.

Greedy decoding picks the single highest-probability token at every step. It is fast and deterministic but myopic: by optimising each token locally rather than the sequence as a whole, it slides into repetitive, sometimes degenerate text. Sampling instead draws from the distribution — modulated by temperature, top-p, and top-k — trading determinism for diversity. Beam search keeps the top-k partial sequences (the 'beams') alive at each step, expands all of them, and retains the highest-probability prefixes; this approximates maximising the probability of the whole sequence rather than just the next token, which is why it dominates tasks like machine translation where a single near-correct output matters more than creativity. It is rarely used for open-ended chat, where it tends to produce bland, repetitive text at higher compute cost.

Contrastive search is a newer method that penalises candidate tokens too similar to what has already been generated, balancing the model's confidence against a degeneration penalty. It avoids the repetition loops long generations fall into while keeping coherence high. Related refinements adjust the cutoff dynamically — min-p sampling tightens the candidate pool when the model is confident about the top token and loosens it when it is not — or aggregate signals across layers rather than reading only the final layer's logits, which can improve factual grounding at no extra training cost.

Decoding strategies · searching the token space

Greedy

Pick the single highest-probability token at every step

Deterministic, fast — but myopic and prone to repetition

Sampling

Sample from the distribution, shaped by temperature / top-p / top-k

Diverse and creative; the parameters in Module 5 control this

Beam search

Keep the top-k partial sequences alive, expand all, retain the best prefixes

Approximates whole-sequence probability; standard in translation

Contrastive search

Penalise candidates too similar to text already generated

Avoids degeneration loops on long outputs while staying coherent

Generation parameters shape the probability distribution; the decoding strategy decides how to traverse it. The same distribution produces very different text depending on the strategy.

Key takeaways

LLMs predict the next token — they don't understand text the way humans do
Decoding strategy is distinct from generation parameters — greedy, sampling, beam search, and contrastive search traverse the same distribution very differently
Pre-training is expensive and happens once; inference is cheaper and happens with every request
The Transformer's attention mechanism enables coherent reasoning across long texts
Context windows define how much the model can consider at once — content beyond the limit is not seen
Outputs are probabilistic — the model is generating likely text, not retrieving verified facts

Further reading

Module 02·~15 min·Practitioner

The AI Model Landscape

Frontier models, open source, the major labs, and how to evaluate capability claims

Frontier models vs. the rest

Frontier models are the most capable models available at any given time — the ones actively pushing the boundary of what AI can do. As of 2025, these include GPT-4o (OpenAI), Claude 3.5 and 3.7 Sonnet (Anthropic), Gemini 2.0 and 2.5 (Google DeepMind), and Llama 3 (Meta). Below the frontier sit smaller, faster, cheaper models — often distilled or fine-tuned variants — that handle most enterprise tasks at a fraction of the cost. Choosing between frontier and sub-frontier is primarily an economics and capability trade-off, not a prestige decision.

Open-source vs. closed and proprietary

Closed models — GPT-4o, Claude, Gemini — are accessed only via API. You cannot inspect or modify the underlying weights. Open models — Llama 3, Mistral, Gemma — release their weights publicly, allowing anyone to run, inspect, and fine-tune them on their own infrastructure. Open models offer data sovereignty and cost control at the expense of setup complexity. Closed models are typically more capable at the frontier but create vendor dependency and offer less visibility into how the model behaves internally.

Running models locally

Open-source models can run entirely on your own hardware — no API calls, no data leaving your machine, no per-token cost after setup. Ollama is the simplest entry point: a command-line tool that downloads quantised models and runs them with a single command. It exposes an OpenAI-compatible API endpoint, making it a drop-in local replacement for cloud APIs during development. LM Studio provides a desktop GUI for browsing, downloading, and running models — suited to users who prefer not to use the command line. llama.cpp is the underlying C++ inference engine powering most local tools; it supports GGUF-quantised models and can fall back to CPU when GPU VRAM is insufficient.

The practical constraint is hardware. A 7B model at INT4 quantisation requires roughly 4–5 GB of VRAM or unified memory; a 13B model requires ~8 GB; a 70B model at INT4 requires ~40 GB. Modern MacBooks with Apple Silicon (M2/M3/M4) run 7B and 13B models well via their unified memory architecture. Running locally makes most sense for: development and testing without API latency or cost, privacy-sensitive workflows where data cannot leave the machine, high-volume inference where annual API spend would exceed hardware costs, and exploring model behaviour with unrestricted access to generation parameters.

The major labs

OpenAI introduced GPT and ChatGPT, which has roughly 500 million users and the strongest brand recognition in the space. Anthropic, founded by former OpenAI researchers, builds the Claude series with a safety-first approach and strong performance on reasoning and long-document tasks. Google DeepMind produces the Gemini series, with particular strength in multimodality and deep integration across Google's product ecosystem. Meta AI releases the Llama series fully open-source, making it the most widely used base for fine-tuning and research. Mistral is a European lab producing highly efficient models with a strong open-source presence and growing enterprise adoption.

Fine-tuned and specialised models

Base pre-trained models are rarely used directly in products. Most AI applications use fine-tuned variants — models further trained on domain-specific data for coding (GitHub Copilot, Cursor), legal analysis (Harvey), medicine (Med-PaLM), or specific enterprise workflows. When a vendor says they use a customised AI model, this is usually what they mean: a general base model adapted for a specific task.

Benchmarks and capability claims

AI benchmarks — MMLU, HumanEval, GPQA, MATH — measure performance on specific tasks: professional exam questions, coding problems, graduate-level reasoning. They are useful reference points but imperfect guides. Models can be specifically optimised to perform on popular benchmarks without being broadly better. When evaluating a model for your organisation's use case, treat benchmarks as a starting filter, then test against your actual tasks and data.

Key takeaways

Frontier models lead capability; smaller models are often sufficient and significantly cheaper
Open-source models offer flexibility and data control; closed models offer ease of access and top-tier capability
Each major lab has distinct strengths — Anthropic for safety and long context, Google for multimodality, Meta for open research
Fine-tuning is how general models become specialised products
Treat benchmark claims as a starting point, not a final verdict — test on your own use case

Further reading

Module 03·~15 min·Technical

AI System Architecture

From foundation model to user-facing product — the full technical stack

The stack: model to user

Most people interact with AI through a product interface — a chatbot, an embedded assistant, a search tool. What they see represents only the top layer. Beneath it is a stack: the foundation model (the LLM itself), an orchestration layer (code that manages conversations, tools, and context), a data layer (documents and knowledge the model can access), and the application layer (the interface users see). Understanding this stack helps you diagnose problems, evaluate vendor products, and make better architectural decisions.

Enterprise AI stack · surface to substrate

User Interface

Chat UI, embedded widget, API client

Application Layer

Auth, routing, session management, logging

Orchestration Layer

Prompt assembly, tool dispatch, RAG retrieval, context management

Foundation Model (LLM)

GPT-4o, Claude, Gemini — accessed via API

Data Layer

Vector DB, document store, structured databases, knowledge base

Users only see the top layer. Problems that present as "the AI is wrong" usually originate in the orchestration or data layers, not the model itself.

Embeddings

Embeddings are numerical representations of text — vectors of hundreds or thousands of numbers that encode semantic meaning. Two sentences with similar meanings will have similar embeddings, even if they use completely different words. This is the foundation of semantic search and AI retrieval systems. When an AI product searches a knowledge base, it is almost certainly comparing embeddings, not doing keyword matching. This is why you can search for employee holiday entitlement and retrieve a document about annual leave policy.

A typical embedding model produces a vector of 1,536 dimensions (OpenAI's text-embedding-3-small) or 3,072 dimensions (text-embedding-3-large). Similarity is measured using cosine similarity: the cosine of the angle between two vectors. Identical meaning → angle near 0° → cosine similarity near 1.0. Unrelated meaning → angle near 90° → cosine similarity near 0. Finding the most similar vectors across millions of stored embeddings uses approximate nearest-neighbour indexing algorithms — HNSW (Hierarchical Navigable Small World graphs) is the most widely deployed — enabling millisecond search across document stores that would otherwise take seconds with brute-force comparison.

Vector databases

Vector databases — Pinecone, Weaviate, pgvector — are purpose-built for storing and querying embeddings at scale. They can find semantically similar content across millions of documents in milliseconds. Any enterprise AI system that needs to search large bodies of knowledge will include a vector database or equivalent. Understanding this layer also explains why AI knowledge bases have a delay when new content is added — documents must be embedded and indexed before they become searchable.

Retrieval-Augmented Generation (RAG)

RAG is the most important architectural pattern in enterprise AI. The problem it solves: you cannot fit all your organisation's knowledge into a model's context window, and fine-tuning the model on all your data is prohibitively expensive and creates a static snapshot. RAG retrieves the most relevant documents at query time and injects them into the prompt. The model then generates a response grounded in that retrieved content. Most enterprise AI assistants — knowledge bases, document Q&A tools, customer support bots — use RAG.

RAG quality depends heavily on engineering decisions beyond the basic pipeline. Chunking strategy — how documents are split before embedding — is critical: chunks too large dilute retrieval precision; chunks too small lose surrounding context. Typical chunk sizes are 256–1,024 tokens with 10–20% overlap between adjacent chunks to avoid splitting relevant content across boundaries. Retrieval precision can be improved with re-ranking: after the initial vector search returns top-k candidates, a cross-encoder model scores each chunk against the specific query and reorders results by true relevance — more accurate than vector similarity alone but slower to compute.

Retrieval-augmented generation · pipeline

01
User query
"What is our refund policy?"
02
Embed query
Numerical representation of meaning
03
Vector search
Cosine similarity over millions of chunks
04
Retrieve top-k
Three to ten most relevant passages
05
Augment prompt
Query and retrieved context injected
06
LLM generates
Response grounded in retrieved content

RAG keeps knowledge external and updatable. Adding a new document only requires embedding and indexing it — no model retraining.

APIs and function calling

Models are almost always accessed via API — a standard interface for sending text in and receiving text or structured data back. Modern LLM APIs support function calling: the model can request that your code execute a specific function and return the result. This is how AI assistants can search the web, query your database, check calendar availability, or look up a customer record mid-conversation. The API layer is where model capability meets your data and systems.

Key takeaways

Enterprise AI is a multi-layer stack, not a single model
Embeddings enable semantic search — finding meaning, not just matching keywords
Vector databases make large-scale semantic retrieval fast and practical
RAG is how AI products access organisational knowledge without fine-tuning
Function calling via API is what allows AI to take actions, not just generate text

Further reading

Module 04·~25 min·Engineer

Advanced RAG

Chunking, retrieval architectures, and the path from naive RAG to agentic memory

Chunking: the first and most consequential decision

Module 3 introduced RAG as retrieve-then-generate. In practice, RAG quality is decided long before generation — at chunking, the step that splits documents into the units you embed and retrieve. Get it wrong and no amount of model quality recovers it: an oversized chunk dilutes retrieval precision and wastes context, while an undersized chunk severs an idea across boundaries so the relevant passage is never retrieved whole.

Five strategies span the trade-off. Fixed-size splits into uniform token windows with overlap — trivial to implement but blind to meaning. Semantic chunking merges adjacent segments while their embeddings stay similar and breaks when similarity drops, preserving complete ideas at the cost of a tunable threshold. Recursive chunking splits on natural separators (sections, paragraphs) and only sub-splits chunks that exceed a size limit. Document-structure chunking uses headings and tables as boundaries, keeping logical integrity where the document is cleanly structured. LLM-based chunking prompts a model to emit semantically isolated chunks — the highest fidelity and the highest cost.

Chunking strategies · the first RAG decision

Fixed-size

Uniform character / token windows with overlap

Simple and batch-friendly; breaks sentences and ideas mid-stream

Semantic

Merge adjacent segments while embedding similarity stays high; break when it drops

Preserves whole ideas; depends on a tunable similarity threshold

Recursive

Split on natural separators, then sub-split any chunk over the size limit

Structure plus size control; more implementation overhead

Document-structure

Use headings, sections, and tables as boundaries

Logical integrity; assumes clean structure, yields uneven sizes

LLM-based

Prompt a model to emit semantically isolated chunks

Highest fidelity; highest cost, bounded by context window

Chunk too large and retrieval precision drops; too small and ideas are severed across boundaries. Semantic chunking is a strong default, but the right choice depends on your content — test it.

Eight retrieval architectures

Naive RAG — embed the query, retrieve by vector similarity, generate once — is only the simplest point in a large design space. The pattern you choose should follow the shape of your queries and data.

Multimodal RAG embeds and retrieves across text, images, and audio. HyDE generates a hypothetical answer and retrieves against its embedding (below). Corrective RAG validates retrieved documents against a trusted source before generating. Graph RAG builds a knowledge graph so the model can reason over entities and relationships, and Hybrid RAG fuses graph retrieval with dense vectors. Adaptive RAG decides per query whether a single retrieval suffices or a multi-step reasoning chain is needed. Agentic RAG (Module 7) puts an agent in charge of planning, routing, and iterating retrieval. These combine freely — an agentic pipeline might use HyDE retrieval with a corrective validation step.

Retrieval architectures · eight patterns

Naive RAG

Retrieve by vector similarity, generate once. Best for simple fact lookup.

Multimodal RAG

Embed and retrieve across text, images, and audio together.

HyDE

Generate a hypothetical answer first, then retrieve against its embedding.

Corrective RAG

Validate retrieved docs against a trusted source (e.g. web) before generating.

Graph RAG

Build a knowledge graph to capture entities and relationships for reasoning.

Hybrid RAG

Combine dense vector retrieval with graph-based retrieval in one pipeline.

Adaptive RAG

Decide per query: direct retrieval vs a multi-step reasoning chain.

Agentic RAG

An agent plans, routes, and iterates retrieval across multiple sources.

These are not mutually exclusive — production systems combine them (e.g. an agentic pipeline using HyDE retrieval with a corrective validation step). Choice is a design decision driven by query type.

HyDE: bridging the question–answer gap

A subtle flaw undermines naive retrieval: a question is rarely semantically similar to its answer. 'What is our refund policy?' and the paragraph that states the policy share little surface vocabulary, so pure query-embedding search often surfaces text that resembles the question rather than the passage that answers it.

HyDE (Hypothetical Document Embeddings) inverts the problem. An LLM first drafts a hypothetical answer to the query — which need not be factually correct — and that answer is embedded instead of the question. Because a plausible fake answer sits much closer in embedding space to real answers than the question does, retrieval improves sharply. A contrastively-trained encoder (a contriever) does the embedding and acts as a near-lossless filter, discarding the hallucinated specifics while keeping the relevant shape. The cost is an extra LLM call and added latency.

Compressing and caching context: REFRAG and CAG

Most of what classic RAG retrieves never helps the model — yet you pay to encode, transmit, and attend over every retrieved token. Two recent methods attack that waste from opposite directions. REFRAG (Meta) compresses each chunk into a single embedding rather than hundreds of token embeddings, uses a lightweight RL-trained policy to keep only the chunks that matter, and expands just those back to full token representations before the model sees them — reporting roughly 30x faster time-to-first-token and around 16x larger effective context with no accuracy loss.

Cache-Augmented Generation (CAG) targets repeated retrieval of stable knowledge. It splits your corpus into 'cold' data that rarely changes (policies, reference guides) and 'hot' data that updates constantly. The cold layer is preloaded once into the model's key-value cache so it is never re-fetched or re-encoded; only the hot layer is retrieved at query time. Prompt caching in the OpenAI and Anthropic APIs is the productised form of this idea. The discipline is selectivity — cache only stable, high-value knowledge, or you exhaust the context budget.

From RAG to agentic RAG to memory

RAG was a milestone, not a destination. Classic RAG (2020–2023) retrieves once and generates once — read-only, with no ability to recognise that the retrieved context was insufficient. Agentic RAG adds judgement: an agent decides whether retrieval is needed at all, which source to query, and whether the result actually answers the question, looping and rewriting until it can answer or concedes it cannot. It is more robust, but still read-only.

The frontier is AI memory: agents that read and write external knowledge, persisting user preferences, facts, and summaries of past conversations for future sessions. This turns a frozen model into a system that accumulates knowledge from every interaction and improves without retraining. Crucially, memory is not a property of the model — it is a system-design problem of deciding what to keep, what to discard, and what to retrieve before each call.

From retrieval to memory · the trajectory

RAG

2020–2023

Read-only · one-shot

Retrieve once, generate once. No decision-making — and it often retrieves irrelevant context.

Agentic RAG

Read-only · via tool calls

An agent decides if retrieval is needed, which source to query, and whether the result is good enough — looping until it can answer.

AI Memory

Read–write · via tool calls

The agent reads and writes external knowledge — persisting preferences, facts, and past conversations — enabling continual learning between sessions.

The shift from read-only retrieval to read–write memory is what turns a static model into an adaptive system. Memory is a system-design problem, not a property of the model itself.

Key takeaways

Chunking is the highest-leverage RAG decision — bad retrieval cannot be rescued by a better model
There is no single RAG architecture: match the pattern (HyDE, corrective, graph, adaptive, agentic) to your query and data shape
HyDE fixes the question–answer similarity gap by retrieving against a generated hypothetical answer
REFRAG and CAG cut the cost of retrieved context — by compressing and filtering it, and by caching stable knowledge in KV memory
The trajectory runs RAG → agentic RAG → memory: from read-only one-shot retrieval to read–write continual learning

Further reading

Module 05·~20 min·All levels

How Prompting Actually Works

The mechanics behind prompt design — beyond tips and tricks

System prompts vs. user prompts

Every LLM application has two types of input: the system prompt and the user prompt. The system prompt is written by the developer, sets the model's persona, defines its task and constraints, and the end user typically never sees it. The user prompt is what the user actually types. The system prompt is extraordinarily powerful — it frames everything that follows. A customer service chatbot that feels tuned for a specific company is mostly a well-crafted system prompt layered on top of a general model. When you use a company's AI assistant, you are interacting with their system prompt as much as with the underlying model.

Temperature and sampling

When the model selects the next token, it does not always choose the most probable option. Temperature is a parameter that controls randomness: at 0, the model always selects the highest-probability token (deterministic, consistent, sometimes repetitive); at 1, it samples more freely across possibilities (creative, varied, sometimes unpredictable). Most production applications set temperature between 0.2 and 0.7. Use lower temperature for factual extraction and structured outputs; higher temperature for creative generation and brainstorming.

Mechanically, temperature T is applied by dividing the model's raw logits by T before the softmax step. When T < 1, logit differences are amplified: the highest-scoring token becomes even more dominant. When T > 1, differences are compressed: probabilities spread across more candidates. Top-p sampling (nucleus sampling) works alongside temperature by restricting selection to the smallest group of tokens whose cumulative probability mass exceeds threshold p — at p = 0.95, the lowest-probability long tail is excluded entirely, preventing very improbable tokens from ever being selected even at high temperatures.

Sampling temperature · same prompt, same model

T = 0.2concentrated, near-deterministic

050%100%

happy

glad

pleased

great

good

T = 1.2flattened, more creative

050%100%

happy

glad

pleased

great

good

Temperature divides logits before softmax. Lower values amplify differences between scores so one token dominates; higher values compress them, spreading probability across more candidates.

Other generation parameters

Temperature and top-p are the most commonly tuned parameters, but production LLM applications regularly use several others. top_k restricts sampling to the k most probable tokens at each step (typically 40–50), providing a simpler alternative to top-p. repetition_penalty applies a multiplicative discount to tokens that have already appeared in the output — reducing the looping and repetitive phrasing that long generations tend to produce. frequency_penalty and presence_penalty (OpenAI's terminology) are similar but distinguish between tokens that appeared frequently (frequency_penalty reduces their probability proportionally to count) versus tokens that appeared at all (presence_penalty applies a fixed one-time penalty).

stop_sequences are strings or tokens that halt generation immediately when produced — essential for structured outputs where generation must terminate at a defined delimiter. max_tokens caps output length and is a critical cost-control parameter: without it, an unusually verbose response can multiply API costs unexpectedly. seed fixes the random state for deterministic outputs — at temperature 0 with a fixed seed, responses should be identical across calls, enabling reliable automated testing. Understanding these parameters collectively gives you precise control over the consistency, style, length, and cost of model outputs.

Context window management in production

Every token counts against the context window: the system prompt, the full conversation history, retrieved documents, the current message, and the model's own response. In long conversations or document-heavy applications, the window fills. When it does, either earlier content is truncated (the model effectively forgets it), or the application must summarise and compress prior context. This is a genuine engineering challenge — most users never notice because well-designed applications handle it invisibly, but it is a core constraint shaping every enterprise AI deployment.

Why prompts fail

Hallucination occurs when the model generates plausible but incorrect information. This is a fundamental property of the architecture: the model selects probable tokens, not verified facts. It cannot reliably distinguish between what it knows and what it infers. This cannot be fully eliminated with better prompting — it can be mitigated with retrieval grounding (RAG), output verification, and human review processes.

Instruction complexity degrades reliability. Models follow one clear instruction better than five ambiguous ones. Complex multi-step prompts produce less consistent outputs than decomposed, sequential ones. Prompt injection is a distinct security concern: malicious instructions embedded in user inputs or retrieved documents can override system prompts and alter model behaviour — a real risk in agentic and document-processing systems.

Few-shot prompting

Zero-shot prompting gives the model a task with no examples. Few-shot prompting includes two to five examples of the desired input-output pattern before the actual task. Few-shot consistently improves performance on structured, domain-specific, or format-sensitive tasks because it demonstrates the expected reasoning style and output format — reducing the model's uncertainty about what a good response looks like in your specific context.

Chain-of-thought and extended thinking

Chain-of-thought (CoT) prompting elicits step-by-step reasoning by providing examples that demonstrate intermediate reasoning steps, or simply appending 'Let's think step by step' to the prompt. This works because of the auto-regressive nature of LLMs: generating intermediate reasoning tokens makes subsequent correct tokens more probable — the model talks itself through the problem before committing to a final answer. On multi-step reasoning benchmarks, CoT prompting improves accuracy by 20–40 percentage points, but only on models above roughly 10 billion parameters — suggesting that explicit reasoning is an emergent capability that appears at scale.

Extended thinking — implemented in models like Claude 3.7 Sonnet (thinking mode) and OpenAI o3 — is a systematic version of this principle. These models generate extended internal reasoning (sometimes thousands of tokens of scratchpad) before producing a visible response. This shifts compute from training time to inference time: harder problems get more tokens of reasoning. The trade-off is latency and cost — more thinking tokens means slower and more expensive responses — but for complex reasoning tasks the accuracy gains are substantial.

Context engineering

Prompt engineering focuses on the instructions you give the model. Context engineering is the broader discipline of deciding what goes into the context window — and in what form. As windows expand toward one million tokens, the question of what to include becomes as important as how to phrase the instruction. Blindly filling the context with everything potentially relevant degrades performance: models exhibit a 'lost in the middle' failure mode where relevant information buried in a long context is less reliably used than information near the beginning or end.

Context engineering decisions include: conversation history management (how many prior turns to include; when to summarise rather than truncate); document injection strategy (full document vs extracted passages vs summaries); structured vs prose context (tables and JSON can be more token-efficient than natural language for certain data types); and context ordering (placing the most critical content at the start or end rather than the middle). In agentic systems, it also covers what intermediate results to retain across steps and what to discard to prevent context overflow in long-running tasks. As context windows grow, context engineering is increasingly the dominant skill in applied LLM work.

Key takeaways

System prompts define model behaviour — they are as important as the model itself
Temperature controls creativity vs. consistency — tune it for your task type
Context windows fill in production — truncation and compression are real engineering challenges
Hallucination is architectural; mitigate it with retrieval grounding and verification, not prompting alone
Few-shot examples consistently improve output quality for structured and domain-specific tasks

Further reading

Module 06·~20 min·Engineer

Context Engineering

Engineering what the model sees — the discipline beyond prompt wording

Beyond prompt wording

The previous module introduced context engineering as a single idea; it has grown into a discipline of its own. Prompt engineering optimises the wording of an instruction. Context engineering optimises what information enters the window in the first place — and in what form. As context windows expand toward a million tokens, this becomes the dominant lever on quality.

The reason is where systems actually fail. Most LLM applications fail not because the model is weak but because it lacks the right context to succeed. A RAG workflow is roughly 80% retrieval and 20% generation: good retrieval can carry a mediocre model, but bad retrieval defeats the best one. The mental model is simple — if the LLM is the CPU, the context window is its RAM, and context engineering is how you program that RAM with the right information, in the right format, at the right time.

The six types of context

A production agent needs far more than an instruction. Six distinct types of context define its reasoning: instructions (who it is, why it is acting, how it should behave), examples (demonstrations of good, bad, and anti-pattern outputs — models learn patterns more reliably than stated rules), knowledge (domain data, business processes, APIs), memory (short-term reasoning and chat history; long-term facts and preferences), tools (capabilities it can invoke, each with typed parameters), and tool results (outputs fed back so it can self-correct).

Treating these as a multi-dimensional design layer, rather than a line in a prompt, is what separates robust agents from brittle ones. A weak model can succeed when all six are well-supplied; a frontier model cannot compensate for context that is incomplete.

The six types of context

Instructions

Who the agent is, why it is acting, and how it should behave

Examples

Demonstrations of good, bad, and anti-pattern outputs — models learn patterns better than rules

Knowledge

Domain data, business processes, APIs, and data models

Memory

Short-term reasoning and chat history; long-term facts and preferences

Tools

Capabilities the agent can invoke, each with typed parameters

Tool results

Outputs fed back into context to enable self-correction

Advanced agent architectures treat context as a multi-dimensional design layer, not a line in a prompt. A weak model can succeed with the right context; a frontier model cannot make up for incomplete context.

Four operations on context

Managing context across a long task reduces to four operations. Writing saves context outside the window — to long-term memory or a state object — so it survives beyond the current turn. Selecting (reading) pulls the right context back in from tools, memory, or a knowledge base when the task needs it. Compressing keeps only the tokens a task requires: multi-turn tool calls accumulate duplicate and redundant content quickly, and summarisation reins in both cost and the 'lost in the middle' failure mode. Isolating splits context across sub-agents, sandboxes, or state objects so that no single window overflows and unrelated work does not interfere.

Four operations on context

Write

Save context outside the window — long-term memory, session state — so it survives and can be recalled

Select

Pull the right context in from tools, memory, or a knowledge base when the task needs it

Compress

Keep only the tokens a task needs — summarise and de-duplicate; multi-turn tool calls bloat context fast

Isolate

Split context across sub-agents, sandboxes, or state objects to prevent overflow and cross-talk

If the LLM is the CPU, the context window is the RAM — you are programming that RAM with the right information, in the right format, at the right time.

Context as a pipeline, not a prompt

Real-world context retrieval is an engineering system, not a weekend RAG script. Consider a query like 'what is blocking the Chicago office project, and when is our next meeting about it?' — answering it means searching a project tracker, a calendar, email, and chat at once. No single vector store handles that.

Production context systems are built in three layers. An ingestion layer connects to many sources, processes heterogeneous data types appropriately (email is not code is not a calendar entry), and detects changes to refresh embeddings incrementally. A retrieval layer expands vague queries, routes them to the right sources, layers semantic, keyword, and graph search, enforces per-user permissions, and weighs recent information more heavily than stale. A generation layer returns citation-backed answers. This is the architecture underneath enterprise products like Google's Vertex AI Search, Microsoft 365 Copilot, and Amazon Q Business.

Claude Skills: packaging procedural context

Skills are Anthropic's mechanism for giving an agent reusable, persistent abilities without overloading its window. The problem they solve: an LLM forgets everything unless its instructions, examples, and edge cases are restated each time. A Skill packages that procedure once — think of it as a standard operating procedure for the agent — so it can be reused indefinitely.

The scalability comes from progressive disclosure across three layers. The main context is always present (project configuration). Skill metadata is a tiny YAML descriptor of a few hundred tokens that the model uses only to decide whether a skill is relevant. The full skill body and its supporting files load on demand, and scripts or templates are fetched only when actually used, consuming zero tokens until then. This lets an agent carry hundreds of skills while keeping its active context lightweight. Skills complement rather than replace the rest of the stack: Projects organise the workspace, MCP connects tools, subagents handle delegated reasoning, and Skills package the reusable expertise all of them draw on.

Key takeaways

Prompt engineering optimises wording; context engineering optimises what enters the window — and it is becoming the dominant applied skill
Most LLM failures are context failures, not model failures — retrieval quality usually matters more than model choice
Six context types define an agent: instructions, examples, knowledge, memory, tools, and tool results
Four operations manage context over a task: write, select, compress, isolate
Production context retrieval is a multi-layer pipeline (ingestion, retrieval, generation), not a single vector store
Claude Skills package procedural knowledge and load it progressively, so an agent can hold hundreds of skills without exhausting context

Further reading

Module 07·~30 min·Technical

Agents & Agentic Systems

From single responses to AI that plans, acts, and iterates

What makes a system agentic

For the first years of the LLM era, most AI interactions were transactional: a user sends a message, the model replies, done. Agentic systems change this fundamentally. An agent is an LLM that can take actions, not just generate text. It operates in a loop: observe input, plan a response or action, execute that action (call a tool, search the web, write code, send a message), observe the result, and repeat — until the task is complete or the model determines it is done. The key distinction is that agentic systems persist and act across multiple steps rather than responding in a single shot.

The agentic loop · ReAct pattern

01
Observe
Receive task, context, previous results
02
Think
Reason about next action (chain-of-thought)
03
Act
Call a tool or produce output
04
Observe result
Read tool output, append to context

continues until the task is complete

Available tools

Web search·Code execution·Database query·Email send·File read/write·API call

The model never directly executes anything — it requests that your application code calls a tool and returns the result, which is appended to the context for the next step.

Five levels of agency

'Agent' covers a wide spectrum of autonomy, and conflating its ends causes most of the confusion about what agents can and cannot do. It is clearer to think in five levels, distinguished by how much control passes from the human to the model. At the first level, a basic responder, the human drives the entire flow and the LLM simply turns input into output. A router lets the model make a basic decision about which predefined path or function to take. Tool calling hands the model a set of tools and lets it decide which to invoke and with what arguments.

The higher levels are qualitatively different. In a multi-agent system, a manager agent coordinates sub-agents and decides the next step iteratively, within a hierarchy the human laid out. At the most advanced level, an autonomous agent generates and executes new code on its own, effectively acting as an independent developer. Each step up the ladder adds capability and removes a human checkpoint — which is exactly why robust systems deliberately operate at the lowest level that solves the problem.

Five levels of agency · who controls the flow

Basic responder

Human controls flow

The LLM receives input and produces output, with little control over the program flow.

Router

Human defines paths

The LLM makes basic decisions about which predefined path or function to take.

Tool calling

LLM decides tool + args

The human defines a tool set; the LLM decides when to call each and with what arguments.

Multi-agent

LLM controls execution

A manager agent coordinates sub-agents and decides next steps within a human-defined hierarchy.

Autonomous

LLM generates + runs code

The model writes and executes new code independently — acting as an autonomous developer.

Higher levels hand more control to the model. Each step up adds capability — and surface area for error — so production systems sit at the lowest level that solves the problem.

Tool use and function calling

Modern LLMs support tool use: you provide the model with a list of available tools — web search, database query, email send, code execution, calendar check — with descriptions of what each does. The model decides which tools to call, when to call them, and how to interpret their outputs. It then incorporates those results into its ongoing reasoning. This is how AI assistants move beyond conversation into genuine task completion.

Function calling works at the protocol level: tool definitions are included in the API request as a JSON schema specifying each tool's name, description, and parameter types. When the model determines it should use a tool, it returns structured JSON — rather than prose — specifying the function name and arguments. Your application code executes the function and returns the result, which is appended to the context for the model's next call. The model never directly executes anything: it only requests that your code does. This separation is important for security — it means you control exactly what actions an agent can take, and you can log, audit, and gate every tool invocation.

Agentic design patterns

Several repeating patterns have emerged in how capable agent systems are structured. ReAct (Reason + Act) is the most common: the model alternates between reasoning about its situation and taking an action, making its thinking visible and auditable in the trace. Reflection adds a second step where the model reviews its own output and iterates — effectively self-editing before returning a result. Planning agents decompose a complex task into a structured sequence of subtasks before executing any of them, reducing compounding errors on long-horizon work.

Parallelisation runs multiple agent calls simultaneously — useful when subtasks are independent (e.g. researching several topics at once) — then merges results. Specialisation splits work across purpose-built agents (a research agent, a writing agent, a code-review agent), each with a focused system prompt and limited tool set. An orchestrator agent routes tasks and synthesises results. These patterns are often combined: a planning orchestrator that fans out parallel specialised agents and then applies reflection before returning a final answer.

Reliability degrades quickly as agent chains grow longer. Each LLM call has some probability of error; errors compound multiplicatively across steps. This means the most robust production agents are either short-chained (two or three hops) or include explicit verification steps — another agent checking the output, a structured self-critique loop, or a deterministic validation function before the result is accepted.

The Model Context Protocol (MCP)

Until 2024, every agent framework had its own proprietary way of defining tools, connecting to external services, and passing results back to the model. This created fragmentation: a tool built for LangChain could not be used in AutoGen, and every new integration required custom glue code. The Model Context Protocol (MCP), introduced by Anthropic in late 2024, is an open standard that defines a uniform interface between AI models and the data sources or tools they connect to.

MCP works as a client–server protocol. An MCP server exposes a set of tools, resources, or prompts through a standardised JSON-RPC interface. An MCP client — which could be Claude Desktop, an IDE plugin, or a custom agent framework — discovers and calls those tools through the same protocol regardless of what the server is built on. This means a single MCP server wrapping, say, a company's CRM can be used by any MCP-compatible model or host without rewriting the integration. The ecosystem is growing rapidly: hundreds of community-built MCP servers now exist for databases, APIs, file systems, browser control, and SaaS platforms.

For developers, MCP is significant because it standardises the interface that has historically been the most painful part of building agents. Rather than writing bespoke function-calling schemas for every capability, you build or install an MCP server once and it becomes available to any compliant model. The protocol also defines resource exposure (streaming file contents, database rows) and prompt templates — making it broader than pure tool-calling. MCP is increasingly treated as infrastructure-level for agentic systems the way HTTP is for web services.

The agent protocol stack: MCP, A2A, AG-UI

MCP standardises one connection — agent to tools. Two further protocols complete the picture, and the three are converging into layers of a single stack rather than competing standards. Agent2Agent (A2A), introduced by Google and now stewarded by the Linux Foundation, standardises agent-to-agent collaboration: agents work together without sharing their internal memory, thoughts, or tools, instead exchanging context, task updates, and results. Each agent publishes a JSON 'Agent Card' describing its capabilities and authentication, so others can discover and delegate to the right agent — even one built on a different framework.

The missing third piece is agent-to-user. AG-UI (Agent-User Interaction Protocol), an open standard from CopilotKit, streams structured events from the backend agent to the frontend over Server-Sent Events — token-by-token output, tool-execution progress, shared-state deltas, and handoffs between agents — so an interface can show a running agent, accept mid-run interruptions, and swap the underlying model or framework without a rewrite. The three compose cleanly: tool outputs (MCP) and multi-agent collaboration (A2A) both flow up to the user through AG-UI. Together they are doing for agents what HTTP and REST did for the web.

The agent protocol stack · three layers, not three competitors

AG-UI

Agent ↔ User

Streams structured events (token output, tool progress, state deltas, agent handoff) to the frontend over SSE — so the UI is not locked to one agent framework.

A2A

Agent ↔ Agent

Lets agents collaborate without sharing internal memory or tools — they exchange context and task updates, and discover each other via published Agent Cards.

MCP

Agent ↔ Tools

The open standard for connecting agents to tools, data, and workflows. Started by Anthropic, now adopted across the ecosystem.

The three protocols compose: tool outputs (MCP) and multi-agent collaboration (A2A) can both flow up to the user interface (AG-UI). Together they are converging the fragmented agent ecosystem onto a shared stack.

Multi-agent systems

Complex tasks can be distributed across multiple specialised agents. A research agent gathers information, a writing agent drafts content, a critique agent reviews it, and an orchestrator agent coordinates the workflow. This pattern allows more reliable completion of long, multi-step tasks by decomposing them — but it introduces coordination complexity, higher cost (each agent makes its own LLM calls), and compounding error risk: a mistake early in the chain can propagate and amplify through subsequent steps.

Multi-agent orchestration patterns

When you do split work across agents, seven orchestration patterns recur. In a parallel pattern, agents tackle independent subtasks at once and their outputs merge — cutting latency in high-throughput pipelines. A sequential pattern passes work down a chain where each agent adds value (generate, then review, then deploy). A loop pattern has agents refine their own output until a quality bar is met. A router pattern puts a controller agent in front, directing each task to the right specialist.

The remaining three handle coordination at scale. An aggregator collects partial results from many agents into a single consensus. A network pattern drops hierarchy entirely — agents talk freely and share context, useful in simulations and collective reasoning. A hierarchical pattern mirrors a manager and team: a planner delegates to workers, tracks progress, and makes the final call. Choosing among them is less about which looks most sophisticated and more about which minimises friction — ensuring no two agents duplicate work, each knows when to act or wait, and the system is collectively smarter than any single part.

Multi-agent orchestration · seven patterns

Parallel

Agents tackle independent subtasks at once; outputs merge. Cuts latency in high-throughput pipelines.

Sequential

Each agent adds value in turn — generate, review, deploy. Common in workflow automation.

Loop

Agents refine their own output until a quality bar is met. Good for proofreading and iteration.

Router

A controller routes each task to the right specialist (finance → FinAgent, legal → LawAgent).

Aggregator

Many agents produce partial results a central agent combines into a consensus.

Network

No hierarchy — agents talk freely and share context. Used in simulations and collective reasoning.

Hierarchical

A planner delegates to workers, tracks progress, and makes the final call — like a manager and team.

The hard part is not spinning up agents — it is designing the communication flow so no two duplicate work, each knows when to act or wait, and the system is collectively smarter than any single agent.

Memory architecture

Short-term memory is the context window — everything the model has seen in the current session. Long-term memory is external storage — facts, preferences, or summaries of past interactions persisted in a database and retrieved when relevant. Episodic memory refers to summaries of previous sessions injected at the start of new ones. Most current agent frameworks have limited and unreliable long-term memory. This remains one of the most actively researched problems in the field, and a major practical limitation for agents that need to learn from experience over time.

Production frameworks refine this further into named types: entity memory tracks facts about specific people, objects, or organisations the agent encounters; contextual memory holds the working state of the current task; and user memory persists an individual's preferences and history across sessions. Whatever the taxonomy, the underlying point is the same — memory is not a property of the model but a system that must explicitly decide what to keep, what to discard, and what to retrieve before each call.

Real-world examples

You have already encountered agentic systems: GitHub Copilot completing a multi-file refactor autonomously, Perplexity chaining multiple web searches to answer a complex question, Claude Projects maintaining context across sessions, or an AI sales assistant that researches a prospect, drafts an outreach email, and schedules a follow-up without human intervention. Enterprise agentic deployments — AI that takes actions inside business systems — are accelerating significantly through 2025 and beyond.

Key takeaways

Agents act in loops: observe, plan, act, repeat — not a single response
Tool use is what separates agentic AI from conversational chatbots
Multi-agent systems increase capability but add coordination complexity and compounding error risk
Long-term memory is still an open problem — most agents do not reliably learn between sessions
Agentic failures compound — errors early in a task propagate forward through subsequent steps

Further reading

Module 08·~15 min·Technical

Hardware & Infrastructure

GPUs, data centres, custom silicon, and the energy question

Why GPUs?

Graphics Processing Units were designed for the parallel computations required to render graphics — performing thousands of simple operations simultaneously rather than a smaller number of complex ones sequentially. Training and running neural networks requires almost identical mathematical operations (matrix multiplications at very large scale). This is why NVIDIA, originally a gaming hardware company, became the most important infrastructure supplier in the AI industry. A modern frontier model training run uses thousands of GPUs operating in parallel for weeks or months.

The core operation in a neural network is a matrix multiplication — thousands of simple numerical operations applied simultaneously across large grids of numbers. CPUs are designed for sequential, complex tasks and have a small number of powerful cores. GPUs have thousands of simpler cores designed to run in parallel, which is exactly what neural network workloads require. This makes a modern GPU not just faster but fundamentally better suited to AI than a CPU — and it is why NVIDIA, a gaming hardware company, became the most important infrastructure supplier in the AI industry. Beyond raw processing speed, the speed at which data can be moved between GPU memory and its processing cores matters enormously — and frontier GPUs are optimised for this as much as for raw computation.

NVIDIA's dominance and the CUDA moat

NVIDIA's H100 and H200 GPUs are the gold standard for AI training and inference. But the hardware alone does not explain NVIDIA's position — the CUDA software ecosystem does. Built over 15 years, CUDA is a parallel computing platform that most AI frameworks, libraries, and tools are optimised for. Switching away from NVIDIA GPUs requires not just different hardware but porting significant amounts of software. This creates a durable moat that competitors from AMD, Intel, and Qualcomm are working to bridge.

Custom silicon

Google's Tensor Processing Units are purpose-built chips for neural network workloads — highly efficient for training and inference at Google's scale, accessible exclusively through Google Cloud. Other major players — Amazon (Trainium, Inferentia), Microsoft (Maia), Meta (MTIA) — have developed custom silicon to reduce dependence on NVIDIA and control their own cost structures. These chips remain largely internal tools, but their existence signals the scale at which the major labs are operating.

Data centres and energy

A single NVIDIA H100 GPU draws approximately 700 watts of power. A cluster of 10,000 GPUs — modest by frontier training standards — consumes roughly 7 megawatts, equivalent to powering a small town. Training a frontier model like GPT-4 is estimated to have consumed 50–100 gigawatt-hours. At scale, energy availability and cost have become the primary constraint on AI development. Microsoft, Google, and Amazon have all signed agreements with nuclear power providers to secure dedicated generation capacity for AI data centres.

Cloud vs. on-premise inference

Most organisations run AI inference in the cloud — AWS, Azure, Google Cloud, or specialist providers like CoreWeave. On-premise inference makes sense in two scenarios: data sovereignty requirements that prohibit cloud processing, or query volumes so high that annual API spend would exceed the capital cost of owning and operating hardware. For most organisations at current scale, cloud inference is the right default. The break-even point for on-premise typically requires substantial usage and a dedicated ML infrastructure team.

Key takeaways

GPUs are the primary compute substrate for AI — NVIDIA dominates, with alternatives growing
The CUDA software ecosystem creates a durable moat beyond hardware alone
Custom silicon from major cloud providers reduces NVIDIA dependency at extreme scale
Energy is the new binding constraint on AI infrastructure growth
Cloud inference is the correct default for most organisations — on-premise is for specific compliance or scale scenarios

Further reading

Module 09·~15 min·Business

Cost & Economics

Training vs. inference, API pricing, make vs. buy, and the cost trajectory

Training vs. inference — two very different cost centres

AI has two distinct cost structures. Training is a largely one-time expense — the compute required to create the model. Training GPT-4 is estimated to have cost $50–100 million. Llama 3 70B cost Meta roughly $10–20 million. These numbers are falling with algorithmic improvements, but frontier training remains a capital expenditure accessible only to well-funded labs. Inference is the ongoing cost of running the model to serve requests — what you pay every time someone uses an AI product. Inference costs have fallen dramatically and are the relevant budget line for almost every organisation.

API pricing in practice

Most organisations access AI via API, paying per token. Pricing varies substantially by model tier. Frontier models such as GPT-4o and Claude 3.5 Sonnet typically cost $3–15 per million input tokens. Mid-tier models such as GPT-4o mini and Claude Haiku cost $0.15–1 per million tokens. Open-source models hosted via third-party providers often cost under $0.10 per million tokens.

For context: one million tokens is approximately 750,000 words — a substantial volume of text. Most enterprise applications consuming AI at scale spend thousands to tens of thousands of dollars per month on API calls. Understanding token consumption is essential for accurate AI budget planning, since usage-based pricing requires different financial controls than per-seat SaaS.

Make vs. buy decisions

Should your organisation fine-tune its own model or use a commercial API? For the overwhelming majority of organisations, the answer is: buy. Fine-tuning requires ML engineering expertise, data curation, training infrastructure, and ongoing maintenance. It makes economic sense in a narrow set of circumstances: data privacy requirements that prevent external API use, query volumes so high that annual API spend exceeds the cost of owning and operating a model, or capability gaps that commercial models genuinely cannot fill. Start with a commercial API. Revisit when scale and requirements justify the investment.

The cost trajectory

AI inference costs have followed a consistent pattern: roughly 10x cheaper every 12–18 months. This trajectory has significant strategic implications. Use cases that are economically unviable at today's pricing — processing millions of customer records, running real-time analysis across all support interactions — may be trivially affordable in 18 to 24 months. Evaluate AI investments against the expected cost trajectory, not just today's pricing. The business case for automation that looks marginal now may be compelling within two years.

Key takeaways

Training is a one-time capital cost; inference is the ongoing operational cost — focus budget planning on inference
Inference costs have fallen roughly 10x per year — today's pricing is not the floor
For almost all organisations, buying via API beats training your own model
Model tier selection has major cost implications — frontier capability is rarely needed for most tasks
Evaluate AI investments against the cost trajectory, not just current pricing

Further reading

Module 10·~15 min·All levels

Data, Privacy & Enterprise Risk

Training data, input handling, enterprise agreements, and prompt injection

Training data and copyright

Frontier models are trained on vast datasets scraped from the internet, including content that may be under copyright. This has led to significant litigation — most notably The New York Times versus OpenAI — with outcomes still being determined in courts. For enterprise users deploying AI products, the practical exposure is limited. You are not redistributing training data; you are using the model's generated outputs. The legal risk sits primarily with the model providers, not their API customers. Monitor for developments, but this should not be a reason to avoid AI deployment.

What happens to your inputs

This is the question every enterprise procurement and legal team asks. The answer depends on contract type. By default, OpenAI, Anthropic, and Google do not use API inputs to train future models — the API channel is treated differently from consumer products. Enterprise agreements — Azure OpenAI Service, Anthropic's enterprise tier, Google Vertex AI — typically include contractual guarantees that your data is not used for model training.

Consumer products (ChatGPT free, Gemini free) operate under different terms where inputs may be used for training unless opted out. Do not extrapolate consumer product terms to enterprise API agreements — they are not the same document, and the distinction matters significantly for compliance.

Fine-tuning on proprietary data

Fine-tuning a model on your organisation's internal data — documents, emails, customer records — raises specific questions: Who holds the resulting model weights? What data was exposed during the training process, and to whom? Is the fine-tuned model itself a security risk if accessed without authorisation? These are solvable problems but require deliberate data governance decisions before beginning. Establish data classification, access controls, and retention policies before engaging in any proprietary fine-tuning project.

Data residency and sovereignty

Regulatory frameworks in the EU, financial services, healthcare, and government require that certain categories of data remain within specific geographic boundaries. Major cloud providers offer region-locked deployments — Azure OpenAI in EU regions, AWS Bedrock with data residency guarantees, Google Cloud's regional endpoints — that satisfy most regulatory requirements. On-premise deployment of open-source models is the most restrictive-compliant option but carries operational overhead. Data residency is now a procurement decision as much as a technical one.

Prompt injection

Prompt injection is a security vulnerability specific to AI systems: malicious instructions embedded in user inputs or retrieved content can override system prompts and cause the model to behave in unintended ways. An AI assistant that processes incoming emails could be manipulated by a carefully crafted email body designed to exfiltrate information, override access controls, or produce harmful outputs. Mitigations include input validation, output monitoring for anomalous behaviour, limiting the actions agents can take autonomously, and applying the principle of least privilege to AI system permissions.

Key takeaways

Enterprise API agreements typically prohibit training on your data — consumer product terms do not offer the same guarantee
Read the data processing agreement; marketing claims are not contractual commitments
Fine-tuning on proprietary data requires data governance decisions before you begin
Data residency requirements are solvable via major cloud providers' regional deployments
Prompt injection is a real and underappreciated security risk in agentic and document-processing applications

Further reading

Module 11·~10 min·Policy & Leadership

AI Safety & Guardrails

Alignment, RLHF, Constitutional AI, red-teaming, and defence-in-depth

Alignment: the core challenge

Alignment refers to the challenge of ensuring AI systems do what humans actually want — not merely what they have been literally instructed to do. An LLM trained purely to predict text can produce harmful, false, or manipulative outputs without any intent to do so. Alignment research attempts to close the gap between following the literal instruction and doing what was genuinely intended, in context, including situations the instruction did not anticipate.

This problem scales with capability. A more capable but misaligned system can cause more harm. This is the foundational concern driving AI safety research, and it becomes increasingly important as systems gain greater autonomy.

RLHF: teaching models human preferences

Reinforcement Learning from Human Feedback is the primary alignment technique used by frontier model developers. Human raters are shown pairs of model outputs and asked to indicate which is better. This preference data trains a reward model — a separate neural network that learns to assign a scalar score to any (prompt, response) pair, predicting what human raters would prefer. The reward model then drives a reinforcement learning loop: the LLM is fine-tuned using Proximal Policy Optimisation (PPO) to generate responses that earn higher scores. A KL-divergence penalty prevents the policy from drifting too far from the base pre-trained distribution — without it, the model would 'collapse' into a narrow set of reward-hacking responses that score well without being genuinely better.

RLHF is effective but imperfect. It encodes the biases and blind spots of its human raters, can lead to reward hacking (the model learns to produce outputs that score highly on the reward model without being genuinely better), and can make models overly cautious or deferential in ways that reduce usefulness. The rater pool — who they are, what they are instructed to optimise for, and how they are paid — has substantial influence on the resulting model's behaviour, making it one of the least transparent aspects of frontier model development.

RLHF training pipeline

Phase 1

Collect human preference data

Human raters compare pairs of model responses to the same prompt and indicate which is better. Thousands of labelled comparisons build a preference dataset.

Phase 2

Train a reward model

A separate neural network is trained on the preference dataset. It learns to assign a scalar reward score to any (prompt, response) pair — predicting what human raters would prefer.

Phase 3

Fine-tune LLM with reinforcement learning (PPO)

The LLM is updated using Proximal Policy Optimisation. Responses that earn high reward model scores are reinforced; low-scoring responses are suppressed. A KL-divergence penalty prevents the model drifting too far from the base pre-trained distribution.

The human rater pool — their demographics, instructions, and incentives — significantly shapes the resulting model. This is one of the least transparent aspects of frontier model development.

Constitutional AI

Anthropic's Constitutional AI approach provides the model with a set of principles and uses AI feedback rather than human feedback to evaluate outputs against those principles. The model critiques and revises its own outputs based on this constitution before producing a final response. This scales more efficiently than human feedback, produces models that can articulate why they declined a request, and is the primary alignment technique behind the Claude model series.

Red-teaming and jailbreaks

Red-teaming is the deliberate, systematic attempt to find failure modes in AI systems before deployment — adversarially prompting the model to produce harmful outputs, bypass safety measures, or behave unexpectedly. Major labs conduct extensive internal red-teaming and commission external red teams before major model releases. Jailbreaks are successful circumventions of safety guardrails. Despite extensive red-teaming, jailbreaks continue to emerge. Safety measures should be understood as a significant reduction in risk, not an elimination of it.

Output filtering and defence-in-depth

Model-level alignment is one layer of safety. Most production deployments add additional filtering layers: classifiers that detect harmful or policy-violating content in model outputs, input monitoring for known attack patterns, and human review pipelines for high-stakes automated decisions. The right approach is defence-in-depth — multiple independent safety layers, so the failure of any single layer does not produce an unacceptable outcome.

Key takeaways

Alignment is the problem of ensuring AI does what we actually want, not just what we literally instructed
RLHF is the primary alignment technique — effective but imperfect, encoding rater biases
Constitutional AI scales alignment feedback more efficiently and enables explainable refusals
Red-teaming finds failure modes before deployment — assume safety measures are meaningful but not absolute
Defence-in-depth (model-level combined with application-level filtering) is the correct production architecture

Further reading

Module 12·~10 min·All levels

Governance & Regulation

EU AI Act, US frameworks, enterprise governance, and what professionals need to act on now

The EU AI Act

The EU AI Act — the world's first comprehensive AI regulation — establishes a tiered risk framework. Unacceptable-risk applications are prohibited: AI-based social scoring, real-time biometric surveillance in public spaces, subliminal manipulation of vulnerable groups. High-risk applications face substantial obligations: conformity assessments, mandatory human oversight, transparency requirements, logging and audit trails, and registration in an EU database. High-risk categories include AI used in hiring and employment decisions, credit scoring, healthcare diagnosis, educational assessment, law enforcement, and critical infrastructure.

Limited-risk applications face disclosure obligations: chatbots must identify themselves as AI systems. Minimal-risk applications — most AI tools in common use — face no specific obligations. Any organisation deploying AI that affects EU citizens or operates within the EU needs to assess where their applications fall. Fines for non-compliance reach up to €35 million or 7% of global annual turnover for the most serious violations.

US frameworks

The US approach has been more fragmented: executive orders, voluntary commitments from major AI developers, and sector-specific guidance from financial, healthcare, and other regulators. The NIST AI Risk Management Framework provides a voluntary governance structure — govern, map, measure, manage — that has been widely adopted as a de facto enterprise standard. Sector-specific regulators including the SEC, FDA, CFPB, OCC, and EEOC have issued or are actively developing AI-specific guidance for their domains. Comprehensive federal AI legislation remains pending, but the regulatory direction is clear: disclosure, accountability, and human oversight requirements will grow.

Enterprise AI governance

Regardless of regulatory requirements, robust AI governance is a risk management priority. An enterprise AI governance programme typically covers five areas: an inventory of all AI systems deployed, by whom, for what purpose, and on what data; risk assessment covering what decisions each system influences and the impact of errors; accountability with clear ownership and shutdown authority; audit trails logging inputs, outputs, and decisions; and human oversight defining which decisions require human review rather than full automation, built into system design from the start.

What professionals need to act on now

If your organisation operates in or sells to EU markets, audit your AI deployments for high-risk classification under the AI Act. Adopt the NIST AI RMF as your internal governance baseline if you do not already have one. Review vendor contracts for AI-specific data processing terms, audit rights, and liability clauses — standard SaaS agreements rarely cover AI adequately. Establish an AI system inventory now, even if it is just a spreadsheet. The governance infrastructure that takes a week to build today will take months to reconstruct under regulatory scrutiny.

Key takeaways

The EU AI Act creates binding obligations for high-risk AI applications — non-compliance carries significant financial penalties
High-risk classifications include hiring, credit, healthcare, education, and law enforcement applications
The US relies on voluntary frameworks and sector-specific guidance — NIST AI RMF is the de facto enterprise standard
Enterprise AI governance requires inventory, risk assessment, accountability, audit trails, and human oversight mechanisms
Build your AI inventory and review vendor contracts now — regulatory expectations will only increase

Further reading

Module 13·~20 min·Engineer

LLM Evaluation

Why evaluating AI is hard, LLM-as-a-Judge, component-level RAG evals, and building a production evaluation framework

Why LLM evaluation is fundamentally different

Traditional ML models produce discrete, verifiable outputs: a classifier predicts a label; a regression model predicts a number. Correctness is a binary, computable property. LLMs produce open-ended text where quality is multidimensional — accuracy, helpfulness, fluency, tone, format compliance, and safety matter simultaneously. There is rarely a single correct answer. Two responses can both be correct while differing substantially in quality.

Standard metrics inherited from NLP — BLEU and ROUGE — measure surface-level token overlap between a generated response and a reference string. They fail badly for modern LLM outputs: a response using entirely different words from the reference but semantically superior will score poorly; a response that paraphrases the reference closely but is factually wrong can score highly. BLEU correlates weakly with human judgement in conversational AI contexts and should not be used as a primary signal for evaluating LLM systems.

LLM-as-a-Judge

The dominant approach to reference-free LLM evaluation is LLM-as-a-Judge: using a capable frontier model (typically GPT-4o or Claude Opus) to evaluate the outputs of another model. The judge receives the original prompt, the response being evaluated, and a structured rubric specifying criteria — accuracy, helpfulness, faithfulness, format compliance. It produces a numeric score and a written rationale. G-eval (2023) formalised this approach, demonstrating that LLM judges correlate with human judgements at rates comparable to inter-annotator agreement between humans.

Two modes are standard: pointwise scoring (evaluate one response on a 1–5 scale against a rubric) and pairwise preference (show two responses, ask which is better and why). Pairwise preference is generally more reliable — it is easier to rank two responses than to assign an absolute score. Known failure modes include position bias (preference for whichever response is shown first), verbosity bias (preference for longer responses regardless of quality), and self-preference (a model subtly favours outputs stylistically similar to its own). Mitigate these by randomising response order across evaluation calls and averaging scores over multiple runs.

LLM-as-a-judge · evaluation pattern

Inputs

Prompt

"Explain transformer attention in simple terms"

Response A

Generated output from the model under evaluation

Response B

Reference answer or competing model output (optional)

Judge model

GPT-4o, Claude Opus, or similar frontier model

Evaluates against a structured rubric — accuracy, helpfulness, faithfulness, format, safety — and produces a numeric score and a written rationale.

Pointwise

4.2 / 5

with written rationale

Pairwise

A > B

response A preferred

Known biases: position bias (prefers whichever response appears first), verbosity bias (prefers longer responses), self-preference (a model favours outputs from similar models). Mitigate by randomising response order and averaging across multiple judge runs.

Component-level evaluation for RAG

A RAG pipeline has two stages that fail independently: retrieval and generation. Evaluating only end-to-end output quality obscures which stage is causing problems — a poor answer could result from retrieving wrong documents (retrieval failure) or from the model misrepresenting the retrieved documents (generation failure). These require different fixes. Evaluating components separately is what makes RAG systems debuggable.

The RAGAS framework defines four metrics that decompose RAG evaluation. Context precision measures whether retrieved chunks are relevant to the query. Context recall measures whether all relevant documents are being retrieved. Faithfulness is the RAG-specific hallucination metric: does every claim in the generated answer have supporting evidence in the retrieved context? A model that generates accurate-sounding statements not found in the retrieved documents is hallucinating, even if those statements happen to be factually correct in the world. Answer relevance measures whether the answer addresses the question asked. All four metrics can be computed automatically using an LLM judge, making them practical at scale.

Component-level RAG evaluation · RAGAS framework

Stage

Retrieval

What happens

Query is embedded and searched against the vector DB. Top-k chunks are returned.

Metrics

Context precision · are retrieved chunks actually relevant to the query?
Context recall · are all relevant documents being found?

Stage

Generation

What happens

LLM generates an answer using the query and retrieved chunks as context.

Metrics

Faithfulness · is every claim in the answer supported by retrieved context?
Answer relevance · does the answer actually address the question?

Low faithfulness points to generation hallucination; low context recall points to retrieval missing relevant documents. Fixing the wrong component wastes engineering time — component-level metrics tell you exactly where the pipeline is failing.

Multi-turn and task-completion evaluation

Single-turn evals measure one response in isolation. Most production applications are multi-turn or agentic — they span conversations or multi-step workflows. Multi-turn evals assess whether the model maintains coherent context across a conversation, resolves ambiguities gracefully, and handles contradictions without losing track of prior context.

For agentic systems, task-completion rate is the most important metric: given a defined goal, does the agent reach it? This requires constructing test suites with verifiable end states. Coding agents can be evaluated against test suites; research agents against verifiable factual claims; customer support agents against resolution criteria. Trajectory evaluation goes further — assessing not just whether the agent succeeded but whether it did so efficiently, without unnecessary tool calls or redundant reasoning steps. A correct answer reached via 30 LLM calls when 5 would suffice is an engineering problem.

Building a production evaluation framework

Evaluation in production has two distinct purposes. Offline evaluation runs test suites before deployment to catch regressions and validate that prompt changes or model upgrades actually improve quality. Online evaluation samples live traffic to detect distributional shift and real-world failure modes that test suites did not anticipate. Both are necessary; neither alone is sufficient.

A practical eval stack for a production LLM application typically combines: a curated golden dataset of representative examples that must not regress; automated LLM-as-a-Judge scoring on a sample of live traffic; component-level metrics for RAG systems; and human review queues for low-confidence or flagged outputs. This turns evaluation from a one-time pre-launch check into a continuous engineering practice. Frameworks that operationalise this include RAGAS (RAG-specific metrics), DeepEval (unit-test-style assertions for LLM outputs with CI integration), and Promptfoo (prompt regression testing in CI/CD pipelines).

Key takeaways

BLEU and ROUGE measure token overlap — they do not reflect output quality for open-ended LLM responses and should not be primary signals
LLM-as-a-Judge correlates with human judgement at human-level rates — it is the standard approach for scalable reference-free evaluation
Pairwise preference is more reliable than pointwise scoring — randomise response order to mitigate position bias
RAG systems require component-level evaluation: measure retrieval (precision, recall) and generation (faithfulness, answer relevance) separately
Faithfulness is the key RAG metric — a model that generates claims not supported by its retrieved context is hallucinating, regardless of factual accuracy
Production eval requires both offline test suites (regression detection) and online sampling (real-world failure detection) — neither alone is sufficient

Further reading

Module 14·~25 min·Engineer

Fine-tuning & Adaptation

LoRA, SFT, DPO, GRPO, and the decision framework for when to adapt a model vs. prompt or retrieve

The full training pipeline

The model you interact with via API is the product of up to four distinct training stages, each building on the last. Stage 1 (pre-training) trains on trillions of tokens of raw text, producing a base model that can continue any piece of text but is not conversational. Stage 2 (supervised fine-tuning, SFT) trains on curated instruction-response pairs, making the model helpful and conversational. Stage 3 (preference fine-tuning, RLHF or DPO) uses human preference comparisons to push the model toward helpful, harmless, calibrated outputs. Stage 4 (reasoning fine-tuning, GRPO or similar RL methods) trains on verifiable reasoning tasks — this is what distinguishes models like o3, DeepSeek-R1, and Claude's extended thinking mode from standard instruction-tuned models.

When a vendor describes a model as 'fine-tuned', they almost always mean Stages 2 and 3. When your organisation fine-tunes a model, you are performing an additional Stage 2 (or 3) on top of a base that has already completed the full pipeline — you are adapting a highly capable foundation, not training from scratch.

Four-stage LLM training pipeline

Stage 1

Pre-training

Input

Trillions of tokens of raw text (internet, books, code)

Output

Base model — predicts next token, not conversational

Stage 2

Supervised Fine-tuning (SFT)

Input

Curated instruction-response pairs (~10k–100k examples)

Output

Instruction-following model — conversational and helpful

Stage 3

Preference Fine-tuning (RLHF / DPO)

Input

Human preference comparisons between response pairs

Output

Aligned model — helpful, harmless, calibrated

Stage 4

Reasoning Fine-tuning (GRPO / RL)

Input

Verifiable reasoning tasks with definitive correct answers

Output

Reasoning model — extended step-by-step thinking

Most frontier models pass through all four stages. When your organisation fine-tunes a model, you are running an additional Stage 2 or 3 on top of a model that has already completed the full pipeline.

Why full fine-tuning is impractical for most organisations

Full fine-tuning means updating every single weight in the model — all 7 billion of them in a 7B model, or 70 billion in a 70B model. Doing this requires storing not just the model itself but additional data tracking how each weight should change, which multiplies the memory requirement several times over. A single full fine-tuning run on a 7B model requires roughly 120 GB of GPU memory; a 70B model requires over a terabyte. Very few organisations have hardware at that scale, and renting it is expensive. This is what makes full fine-tuning impractical for most teams.

Beyond hardware, full fine-tuning risks catastrophic forgetting: overwriting the general capabilities the base model acquired during pre-training in order to specialise for a narrow task. The result can be a model that performs well on your specific use case but has degraded on everything else.

LoRA: parameter-efficient fine-tuning

Low-Rank Adaptation (LoRA) solves the compute problem with a clever shortcut: rather than updating the original model weights, it freezes them entirely and adds a small set of new trainable parameters alongside them. These additions are tiny — think of them as a thin layer of adjustments on top of a frozen foundation. Only the adjustments are trained; the original model is untouched. The result is that instead of updating billions of parameters, you are updating a fraction of a percent of that number while achieving comparable task performance.

This reduction is dramatic in practice. A full fine-tuning run on a 7B model might update billions of parameters; LoRA on the same model might update a few million — roughly a 99% reduction. After training, the adjustments can be merged back into the base model with no slowdown at inference time. QLoRA extends this further by also compressing the frozen base model to use less memory, making it possible to fine-tune a 7B model on a single consumer GPU rather than a specialised cluster. This is why fine-tuning has become accessible to individual practitioners and small teams in a way it was not even two years ago.

LoRA vs full fine-tuning · trainable parameter share

050%100% of model weights

Full fine-tuning

100% trainable

All weights updated. Billions of parameters in motion — requires a large GPU cluster.

LoRA

~1% trainable

frozen base model, low-rank adapters added alongside

The base model is frozen; tiny rank-decomposed matrices are trained alongside and merged back at inference — no runtime overhead.

LoRA trains a small fraction of the parameters needed for full fine-tuning, achieving comparable results at a fraction of the compute cost. QLoRA extends this further by also quantising the frozen base model, enabling fine-tuning on a single consumer GPU.

SFT vs preference fine-tuning vs reasoning fine-tuning

Supervised fine-tuning (SFT) trains the model on (prompt, ideal response) pairs using standard cross-entropy loss — the same objective as pre-training, just on your curated data. The model learns to imitate the provided responses. SFT is effective for teaching output format, domain tone, task-specific vocabulary, and consistent response structure. Its ceiling is the quality of your training data: the model cannot produce outputs better than its examples.

Preference fine-tuning (RLHF or DPO) moves beyond imitation. DPO (Direct Preference Optimisation) presents the model with pairs of responses to the same prompt and updates it to prefer the better one — without a separate reward model or RL training loop, making it significantly simpler than RLHF while achieving comparable results. Reasoning fine-tuning (GRPO) takes a different approach: rather than showing correct outputs, it uses verifiable tasks (maths, code, logic puzzles) and rewards the model for reaching correct answers, allowing it to develop its own reasoning strategies through trial and error. This is the technique behind DeepSeek-R1 and is now widely used to produce models with strong chain-of-thought reasoning.

Knowledge distillation: training a model with another model

Fine-tuning adapts a model to your data; distillation transfers capability from one model into another. A large, capable 'teacher' supervises a smaller 'student', so the student approaches the teacher's quality at a fraction of the parameters and inference cost. DistilBERT, distilled from BERT, is about 40% smaller and 60% faster while retaining roughly 97% of its language-understanding performance — and distillation is much of why today's small models are so capable.

There are three common variants. Soft-label distillation trains the student to match the teacher's full probability distribution over the vocabulary; those relative confidences carry rich 'dark knowledge', but it requires access to the teacher's weights and the storage cost is enormous. Hard-label distillation matches only the teacher's top output token — far cheaper, and how DeepSeek distilled R1's reasoning into Qwen and Llama 3.1. Co-distillation trains teacher and student together from scratch, the student matching the teacher's evolving distribution plus ground-truth labels (early teacher signals are noisy); Llama 4 trained Scout and Maverick from Behemoth this way. Distillation is mostly a model-provider technique, but it is increasingly accessible: you can distil a frontier model's behaviour on your task into a cheaper open model by generating synthetic training data from the teacher.

Knowledge distillation · teacher to student

Soft-label

Match the full distribution

The student is trained to reproduce the teacher's softmax probabilities across the whole vocabulary. The relative confidences carry rich 'dark knowledge', but you need the teacher's weights and the storage cost is enormous.

Hard-label

Match the top token only

The student matches just the teacher's final output token. Cheaper, no need to store full distributions. DeepSeek distilled R1's reasoning into Qwen and Llama 3.1 this way.

Co-distillation

Train teacher and student together

Both train from scratch; the student matches the teacher's evolving distribution plus ground-truth labels (early teacher signals are noisy). Llama 4 trained Scout and Maverick from Behemoth like this.

DistilBERT, distilled from BERT, is ~40% smaller and ~60% faster while retaining ~97% of its language-understanding performance. Distillation is much of why small models are now so capable.

When to fine-tune: the decision framework

Prompting, RAG, and fine-tuning address different failure modes and should be attempted in that order. Prompting costs nothing and enables instant iteration — exhaust it first. RAG solves knowledge gaps cheaply and keeps your knowledge base current without retraining. Fine-tuning changes how the model behaves at a fundamental level — its style, format, reasoning approach, and implicit assumptions.

Fine-tune when: the model consistently produces incorrect format or tone despite well-crafted system prompts; the task requires domain-specific reasoning patterns that prompting cannot reliably elicit; or latency constraints prevent a retrieval step. Do not fine-tune to add factual knowledge — the model will memorise your facts imperfectly and that knowledge becomes stale the moment training ends. For knowledge, RAG is always the better answer.

Prompting · RAG · fine-tuning · decision matrix

Prompting only

Use when

Task is within the model's capability; you need fast iteration

Not a fit when

Model lacks domain knowledge; output format is highly specific

Cost: Near zeroKnowledge: Static (training cutoff)

RAG

Use when

Model needs current, proprietary, or large-volume knowledge

Not a fit when

You need to change reasoning style or output behaviour

Cost: Low–mediumKnowledge: Dynamic — update without retraining

Fine-tuning (SFT)

Use when

Consistent format, domain tone, or behaviour the model gets wrong despite good prompting and RAG

Not a fit when

You just need the model to know more facts — use RAG instead

Cost: Medium (LoRA) to high (full)Knowledge: Static — baked into weights at training time

Try in order: prompting, then RAG, then fine-tuning. Each step adds cost and complexity — only escalate when the previous approach genuinely cannot meet the requirement.

Key takeaways

Models go through up to 4 training stages: pre-training → SFT → preference fine-tuning → reasoning fine-tuning
Full fine-tuning requires 120 GB+ GPU VRAM for a 7B model — impractical without significant ML infrastructure
LoRA reduces trainable parameters by ~99% by training small rank-decomposed matrices alongside frozen weights
QLoRA combines LoRA with 4-bit quantisation — enabling 7B model fine-tuning on a single consumer GPU
DPO is now the preferred alternative to RLHF for preference alignment — simpler, no separate reward model needed
Distillation transfers a teacher's capability into a smaller student (soft-label, hard-label, or co-distillation) — a key reason today's small models are so strong
Correct sequence: prompting first, RAG for knowledge gaps, fine-tuning only for persistent behaviour and format problems

Further reading

Module 15·~25 min·Engineer

LLMops

Serving, optimisation, and observability — running LLMs reliably and cost-effectively in production

Why LLM deployment is different

Deploying an LLM is not like deploying a conventional API. Standard web services are largely stateless: a request arrives, the server processes it in milliseconds, a response is returned. LLMs are autoregressive: they generate one token at a time, sequentially, with each token depending on all previous ones. A 500-token response requires 500 forward passes through a multi-billion-parameter model. Response latency is measured in seconds, not milliseconds, and scales with output length — a simple question with a long answer takes longer than a complex question with a short one.

This creates engineering constraints that differ fundamentally from conventional software. GPU memory is the binding resource rather than CPU or RAM. Throughput and latency are in direct tension: serving more concurrent requests increases throughput but may increase individual response latency. Cost scales directly with token volume, creating financial exposure that per-seat SaaS pricing does not. These constraints demand a specialised operational discipline — LLMops — that conventional DevOps does not fully cover.

KV caching

When a model generates text, it processes every token in the context at each generation step — not just the new token it is about to produce. Without caching, this means re-processing the entire preceding conversation or document on every single step. For a long prompt generating a long response, the redundant computation adds up quickly.

KV caching solves this by storing the model's internal representation of all previously processed tokens in GPU memory. On each new generation step, only the newest token needs to be computed fresh; everything prior is retrieved from the cache. This makes generation significantly faster and cheaper — the longer the context, the greater the benefit. The trade-off is memory: keeping a large cache for many concurrent users consumes significant GPU memory, which is why managing it efficiently is one of the core challenges of LLM serving. This is the problem that vLLM's PagedAttention was specifically designed to solve.

KV cache · computed vs reused per generation step

Without cache

System promptcomputed · ~500 tokens

Conversation historycomputed · ~800 tokens

Retrieved context (RAG)computed · ~1,200 tokens

New token generatedcomputed · 1 token

All ~2,501 tokens recomputed every step.

With cache

System promptreused · ~500 tokens

Conversation historyreused · ~800 tokens

Retrieved context (RAG)reused · ~1,200 tokens

New token generatedcomputed · 1 token

Only one token computed per step; the rest is reused from cache.

KV cache stores the Key and Value attention matrices for all prior tokens in GPU VRAM. The trade-off: longer contexts consume more VRAM per concurrent request, limiting how many requests can be served simultaneously.

Quantisation

Model weights are typically stored as 16-bit floats (FP16 or BF16) during and after training. At inference, weights can be reduced to lower precision — 8-bit integers (INT8) or 4-bit integers (INT4) — with minimal quality degradation for most tasks. A 7B model in FP16 requires 14 GB of VRAM; the same model at INT8 requires 7 GB; at INT4, approximately 3.5 GB. This is why quantised 7B models run on consumer GPUs with 8 GB VRAM, and why 70B models can be served on a single high-end GPU at INT4 rather than requiring a multi-GPU cluster.

Modern quantisation methods — GPTQ, AWQ, and GGUF — minimise quality loss by identifying weight-sensitive layers and preserving them at higher precision while aggressively quantising less sensitive layers. The practical trade-off: INT8 has negligible quality loss for most production tasks; INT4 shows measurable degradation on complex multi-step reasoning but remains usable for retrieval, summarisation, classification, and general Q&A. For organisations running open-source models, quantisation is the single highest-leverage lever for reducing inference hardware costs.

Quantisation · VRAM by precision

050%100% of FP32 footprint

FP3232-bit

7B 28 GB70B 280 GB

Full precision — used in training. Rarely needed for inference.

FP16 / BF1616-bit

7B 14 GB70B 140 GB

Standard inference precision — negligible quality loss vs FP32.

INT88-bit

7B 7 GB70B 70 GB

Minimal quality loss on most tasks. A 7B model fits in an 8 GB GPU.

INT4 / NF44-bit

7B 3.5 GB70B 35 GB

Some quality loss on complex reasoning. A 70B model fits in a 48 GB GPU.

Modern quantisation methods (GPTQ, AWQ, GGUF) minimise quality loss by identifying weight-sensitive layers and preserving them at higher precision. QLoRA uses NF4 (4-bit) quantisation for the frozen base during fine-tuning.

Model compression beyond quantisation

Quantisation is one of four complementary compression levers, all aimed at making a model smaller, cheaper, and faster to serve while preserving most of its quality. Pruning removes parameters that contribute little: weight pruning zeroes out individual connections, producing sparse matrices that save memory, while neuron pruning removes whole units, shrinking the matrices themselves and speeding inference. Low-rank factorization approximates a layer's weight matrix as the product of two smaller matrices (via SVD or a similar decomposition), cutting parameter count and per-layer compute — the same low-rank insight LoRA applies to the weight update during fine-tuning.

Knowledge distillation (Module 14) is the fourth lever: train a small student to mimic a large teacher. The levers compose — a production model might be distilled, then pruned, then quantised, and finally served with the optimisations below. The recurring trade-off is precision versus footprint: each technique exchanges a measurable, usually small, drop in quality for substantial gains in memory, latency, and cost.

Model compression · four levers

Knowledge distillation

Train a smaller student to mimic a larger teacher

Fewer parameters, comparable quality

Pruning

Remove low-contribution weights (sparse matrices) or whole neurons (smaller matrices)

Lower memory; neuron pruning also speeds inference

Low-rank factorization

Approximate weight matrices as products of smaller matrices (SVD)

Fewer parameters and less compute per layer

Quantisation

Store weights at lower precision (FP16 → INT8 → INT4)

50–75% less VRAM with minimal quality loss

The four levers are complementary — production systems often distil, then quantise, then serve. LoRA (Module 14) applies the same low-rank idea, but to the weight update rather than the weights.

Inference engines and continuous batching

Naive LLM serving processes one request at a time or groups requests into fixed batches. This is GPU-inefficient: utilisation collapses between requests, and within a fixed batch some sequences finish generating before others, leaving their GPU allocations idle while the batch completes. vLLM addressed both problems. PagedAttention manages KV cache memory using non-contiguous memory pages — similar to how operating systems handle virtual memory — rather than requiring a contiguous block reserved per request upfront. This dramatically increases the number of concurrent requests that can be served from the same hardware.

Continuous batching (iteration-level scheduling) adds new requests to the running batch as soon as existing ones finish generating, keeping GPU utilisation near 100% rather than waiting for an entire batch to complete. Together, PagedAttention and continuous batching give vLLM 10–20x higher throughput than naive serving on the same hardware. For teams self-hosting open-source models, vLLM is the standard inference engine. For cloud API users these optimisations are handled by the provider — but understanding them explains why throughput-optimised API tiers (e.g. batch APIs) are significantly cheaper than real-time endpoints.

Serving at scale: disaggregation and routing

Beyond batching, three techniques separate serious LLM serving from naive replication. LLM inference has two phases with opposite resource profiles: prefill processes the whole prompt at once and is compute-bound, while decode generates one token at a time and is latency-bound. Running both on the same GPUs lets heavy prefills stall latency-sensitive decodes, so prefill–decode disaggregation assigns each phase its own pool of GPUs, letting them scale independently with tailored parallelism.

Because KV caching makes requests stateful, load balancing also changes. A new request that shares a system-prompt prefix already cached on one replica should be sent there — routing it to a less busy replica instead forces the entire prefix's KV cache to be recomputed. Prefix-aware routing tracks which prefixes are cached where and routes accordingly. Finally, Mixture-of-Experts models cannot be served by simple replication: expert parallelism shards the experts across GPUs (replicating only the attention layers), and the gating network dynamically routes each token to whichever GPU holds its experts — a routing problem that demands a purpose-built inference engine.

Observability: what to instrument

Observability for LLM systems means being able to answer: which calls failed and why, what is the cost per user or feature, where is latency coming from, and are outputs degrading over time? Standard application observability — request logs, error rates, p99 latency — captures part of this but misses the LLM-specific signals that actually matter for debugging and cost control.

Every LLM call should log: the full prompt (system and user), the full response, input and output token counts, latency broken down into time-to-first-token and total generation time, cost, model name and version, and any relevant user or session identifiers. For agentic and RAG systems, traces should capture the entire execution chain — retrieval steps, tool calls, intermediate model outputs — not just the final response. Without this, diagnosing whether a poor output came from a bad retrieval result, a malformed prompt, or model behaviour is guesswork. LangFuse (open-source, full trace support), Helicone (lightweight API proxy requiring no code changes), and Arize (enterprise-grade with drift detection) are the main tools that operationalise this.

LLMops stack · replaceable layers

Observability & Monitoring

Trace every LLM call — prompt, response, latency, token count, cost, model version

LangFuse·Helicone·Arize·W&B Weave

Evaluation Pipeline

Offline test suites and online LLM-as-judge sampling on live traffic

RAGAS·DeepEval·Promptfoo

Orchestration Layer

Prompt management, RAG pipelines, tool dispatch, agent loops

LangChain·LlamaIndex·custom code

Inference Engine

Optimised serving — continuous batching, KV cache management, quantisation

vLLM·TGI·LitServe·cloud provider APIs

Model & Weights

Foundation model — quantised or full precision, self-hosted or via API

Llama 3·Mistral·GPT-4o API·Claude API

Each layer is independently replaceable. Many teams start with a cloud provider API (bypassing the inference engine layer entirely) and add observability and evaluation as the system matures.

Speculative decoding

Speculative decoding is a latency optimisation that exploits a fundamental asymmetry in autoregressive generation: verifying that a token is correct is faster than generating it from scratch. A small, fast draft model generates several candidate tokens ahead; the large target model then verifies all of them in a single parallel forward pass. Accepted tokens are kept; rejected tokens trigger regeneration from the point of mismatch. The final output is guaranteed to be identical to what the target model would have produced alone — there is no quality trade-off.

When the draft model's predictions are accurate — which is common for routine or predictable text — speculative decoding achieves 2–3x throughput improvement with no quality loss. It is most effective for tasks with predictable output patterns: code completion, template-following, structured data extraction. Applied alongside KV caching, quantisation, and continuous batching, it forms part of the complete optimisation stack that makes large-model inference commercially viable at scale.

Key takeaways

LLM inference is autoregressive and sequential — latency scales with output length, not just input complexity
KV caching reuses computed attention matrices for prior tokens, reducing per-step computation by ~99% for long contexts
Quantisation cuts VRAM requirements by 50–75% with minimal quality loss — INT8 is standard; INT4 enables 70B models on a single GPU
Compression has four complementary levers — distillation, pruning, low-rank factorization, and quantisation — that compose
vLLM's PagedAttention and continuous batching achieve 10–20x higher throughput than naive serving on the same hardware
At scale, serving disaggregates compute-bound prefill from latency-bound decode, routes by cached prefix, and shards MoE experts across GPUs
Every LLM call should log the full prompt, response, token counts, latency breakdown, cost, and model version — traces should capture the full execution chain
Speculative decoding uses a draft model to propose tokens the target model verifies in parallel — 2–3x speedup with guaranteed-identical outputs

Further reading

Glossary

Key terms used throughout this course, defined concisely.

Token: The unit of text an LLM processes. Roughly ¾ of a word on average; a 1,000-word article is ~1,300 tokens.
BPE (Byte-Pair Encoding): The most common tokenisation algorithm. Builds a subword vocabulary by iteratively merging frequent character pairs.
Foundation model: A large model trained on broad data at scale, intended as a general-purpose base for adaptation. GPT-4, Claude, and Gemini are all foundation models.
Multimodal: Capable of processing and generating across multiple input types — text, images, audio, or video — within a single model.
Benchmark: A standardised test used to measure model capability on a specific task (e.g. MMLU for knowledge, HumanEval for coding). Results can be gamed by training on benchmark data.
Context window: The maximum number of tokens an LLM can process in a single call — including both the prompt and the generated response.
System prompt: Instructions prepended to a conversation that define the model's role, persona, constraints, and output format. Invisible to end users in most deployed products.
Inference: Running a trained model to generate outputs. Distinct from training: no weights are updated. Measured in latency (time per response) and throughput (requests per second).
Parameters / weights: The billions of learnable numerical values inside a model. Training adjusts these values; inference uses them frozen.
Pre-training: The initial, large-scale training phase where a model learns from vast text corpora via next-token prediction.
Fine-tuning: Additional training on a smaller, curated dataset to adapt a pre-trained model to a specific task or style.
PEFT (Parameter-Efficient Fine-Tuning): Techniques that adapt a pre-trained model by training only a small fraction of its parameters, avoiding the cost of full fine-tuning. LoRA is the most widely used example.
LoRA (Low-Rank Adaptation): A parameter-efficient fine-tuning method that trains small adapter matrices rather than the full model weights.
QLoRA: Quantised LoRA. Fine-tunes a quantised (typically 4-bit) base model using LoRA adapters, enabling fine-tuning on consumer-grade GPUs.
Knowledge distillation: Training a smaller 'student' model to mimic a larger 'teacher', transferring capability at a fraction of the size and inference cost.
Pruning: Compressing a model by removing low-contribution weights (sparse matrices) or whole neurons (smaller matrices) with minimal accuracy loss.
Low-rank factorization: Approximating a weight matrix as the product of two smaller matrices (e.g. via SVD) to reduce parameter count and compute.
Catastrophic forgetting: The tendency for a neural network to lose previously learned capabilities when trained on new data. A key risk in fine-tuning.
SFT (Supervised Fine-Tuning): Fine-tuning on labelled input–output pairs. The foundational step before alignment training.
RLHF: Reinforcement Learning from Human Feedback. Training a reward model on human preference rankings and using it to optimise the LLM via PPO.
DPO (Direct Preference Optimisation): An alignment method that skips the separate reward model and optimises preferences directly in the LLM.
Temperature: A sampling parameter (0–2) that controls output randomness. Lower = more deterministic; higher = more varied.
Top-p (nucleus sampling): Restricts token sampling to the smallest set of tokens whose cumulative probability exceeds p. Improves diversity over pure top-k.
Decoding strategy: The method used to turn the model's per-step probability distribution into a token sequence — greedy, sampling, beam search, or contrastive search.
Beam search: A decoding strategy that keeps the top-k partial sequences alive at each step to approximate maximising whole-sequence probability. Standard in machine translation.
Contrastive search: A decoding strategy that penalises candidate tokens too similar to already-generated text, reducing repetition while keeping coherence.
Embedding: A dense numerical vector representing the meaning of a token, sentence, or document. Semantically similar content has nearby vectors.
RAG (Retrieval-Augmented Generation): An architecture that retrieves relevant documents from an external store and injects them into the prompt before generation.
Chunking: Splitting documents into smaller units before embedding for retrieval. The strategy (fixed-size, semantic, recursive, structure-based, LLM-based) strongly affects RAG quality.
Reranking: A second retrieval stage where a cross-encoder rescores the top-k candidates against the query for true relevance — more accurate than vector similarity alone.
HyDE: Hypothetical Document Embeddings. Generate a hypothetical answer with an LLM and retrieve against its embedding, closing the semantic gap between questions and answers.
Agentic RAG: RAG in which an agent decides whether to retrieve, which source to query, and whether the result suffices — looping until it can answer.
CAG (Cache-Augmented Generation): Preloading stable knowledge into the model's KV cache so it is never re-retrieved; only dynamic data is fetched at query time.
Grounding: Connecting a model's outputs to verifiable external facts or data sources to reduce hallucination. RAG is the most common grounding technique.
Vector database: A database optimised for storing and querying embeddings by similarity (e.g. nearest-neighbour search).
HNSW: Hierarchical Navigable Small World. The most common approximate nearest-neighbour index algorithm used in vector databases.
Attention / self-attention: The mechanism by which a transformer relates each token to every other token in the context to compute contextualised representations.
Q, K, V (Query, Key, Value): The three projections computed in attention. The query token attends over keys; values are the information retrieved.
MoE (Mixture of Experts): An architecture where only a subset of the model's parameters (experts) are activated per token, enabling larger models at lower inference cost.
KV cache: A cache of the key and value tensors for previously processed tokens, avoiding redundant recomputation during autoregressive generation.
Prefill vs decode: The two phases of LLM inference: prefill processes the whole prompt at once (compute-bound); decode generates tokens one at a time (latency-bound).
Expert parallelism: A serving strategy for Mixture-of-Experts models that shards experts across GPUs and routes each token to the GPU holding its experts.
GPU / VRAM: Graphics Processing Units are the dominant hardware for AI training and inference. VRAM (video RAM) is the on-chip memory that determines how large a model can be loaded.
FLOPs: Floating-point operations. The standard unit for measuring AI compute. Training a large model requires trillions of FLOPs; often expressed in petaFLOP-days.
Scaling laws: Empirical relationships showing that model performance improves predictably as compute, data, and parameter count increase — enabling cost and capability forecasting.
Quantisation: Reducing model weight precision (e.g. FP32 → INT4) to shrink memory footprint and speed up inference, with a small accuracy trade-off.
Speculative decoding: Using a small draft model to propose several tokens at once, then verifying them in parallel with the large model — speeding up generation.
vLLM / PagedAttention: vLLM is a high-throughput inference server. PagedAttention is its technique for managing KV cache memory in non-contiguous pages.
Latency: The time elapsed between sending a request and receiving its first token (time-to-first-token) or complete response. The primary user-facing performance metric.
Throughput: The number of requests or tokens a system can process per unit of time. The key metric for production serving capacity and cost efficiency.
Prompt engineering: The practice of crafting input text to elicit better model outputs — including instruction phrasing, few-shot examples, and role setting.
Zero-shot / few-shot: Zero-shot: asking a model to perform a task with no examples in the prompt. Few-shot: providing a small number of input–output examples to guide behaviour without any weight updates.
Chain-of-thought (CoT): Prompting the model to reason step-by-step before answering, improving accuracy on complex tasks.
Context engineering: The broader discipline of designing what information goes into the context window — system prompts, retrieved documents, conversation history, tool outputs — to maximise model performance.
Prompt injection: An attack where malicious content in the environment (a web page, document, or user message) hijacks the model's instructions to take unintended actions.
Agent: An LLM configured to operate in a loop: observe, plan, act (via tools), observe results, repeat — until a task is complete.
ReAct: A prompting pattern that interleaves reasoning traces and tool-use actions, making agent behaviour interpretable and auditable.
MCP (Model Context Protocol): An open standard by Anthropic for connecting AI models to tools and data sources via a uniform client–server interface.
A2A (Agent2Agent): An open protocol for agent-to-agent collaboration — agents exchange context and tasks and discover each other via published Agent Cards, without sharing internal state.
AG-UI: Agent-User Interaction Protocol. Streams structured agent events (tokens, tool progress, state deltas, handoffs) to a frontend over Server-Sent Events.
Function calling: A protocol feature where the model returns structured JSON requesting a specific tool invocation, rather than prose. The application executes the function and returns results.
LLM-as-a-Judge: Using a separate LLM to score or rank outputs — either comparatively or against a rubric — as an automated evaluation method.
Evals: Short for evaluations: the test suites and scoring methods used to measure LLM performance on specific tasks or safety criteria. The foundation of disciplined model development.
RAGAS: A framework for evaluating RAG pipelines, measuring faithfulness, context precision, context recall, and answer relevance.
Faithfulness: In RAG evaluation: whether the generated answer is factually grounded in the retrieved context (no hallucinations).
Hallucination: When a model generates confident-sounding content that is factually incorrect or unsupported by its context.
LLMops: The operational discipline of deploying, monitoring, versioning, and iterating on LLM-based systems in production.
Observability: The ability to inspect and understand what an LLM system is doing in production — via traces, logs, latency metrics, and cost tracking.
Alignment: The problem of ensuring a model's behaviour reliably matches human intentions and values — covering helpfulness, honesty, and harmlessness.
Guardrail: A safety layer applied to model inputs or outputs — such as a classifier detecting harmful content — independent of the model's own alignment training.
Jailbreak: A prompt or technique designed to bypass a model's safety guardrails and elicit restricted or harmful content.
Red teaming: Structured adversarial testing of an AI system to discover failure modes, harmful outputs, or exploitable behaviours before deployment.
Constitutional AI: An Anthropic alignment technique where the model critiques and revises its own outputs according to a set of principles, reducing harmful responses.
EU AI Act: The European Union's risk-tiered regulatory framework for AI systems — the first binding AI law globally. Classifies AI by risk level, with corresponding obligations for providers and deployers.

Last updated June 2026. Updated as the AI landscape evolves.

← Back to resources