← Back to resources
Crash Course

Technical AI Literacy

A 13-module deep dive into how AI systems actually work — written for professionals who want genuine technical understanding.

13 modules~100–130 min readNo prior technical knowledge required

Each module stands on its own — you can read end-to-end or jump directly to the topics most relevant to your work. The course assumes no prior technical background: concepts are introduced from first principles and built up progressively. Modules 1–5 cover how AI systems are built and used; Modules 6–10 cover the infrastructure, economics, and governance surrounding them; Modules 11–13 cover fine-tuning, evaluation, and LLMops.

Module 01·~25 min·All levels

What is an LLM?

Tokens, training, and how language models generate text

Tokens, not words

LLMs don't process text the way humans read it. They operate on tokens — fragments of text that can be whole words, parts of words, punctuation, or spaces. The word tokenisation splits into roughly two tokens: 'token' and 'isation'. A typical page of prose contains 500–700 tokens. This matters because LLMs have a fixed context window — the amount of text they can process at once — and that limit is measured in tokens, not words or pages.

The standard tokenisation method is Byte-Pair Encoding (BPE): starting from individual bytes or characters, the algorithm repeatedly merges the most frequently co-occurring adjacent pairs until it reaches a target vocabulary size — typically 32,000 to 128,000 tokens. Common English words like 'the' or 'and' become single tokens; rarer words and technical terms fragment into multiple subword units. This is why LLMs perform better on common English than on unusual proper nouns, non-Latin scripts, or domain-specific jargon that appeared rarely in training data — those strings consume more tokens per unit of meaning.

Tokenisation  ·  byte-pair encoding
"The quick brown fox"
The|·quick|·brown|·fox
4
"tokenisation"
token|isation
2
"ChatGPT-4o"
Chat|G|PT|-|4|o
6

Byte-pair encoding iteratively merges the most frequent character pairs to build a subword vocabulary. The interpunct ( · ) marks a preceding space; the rightmost column counts tokens.

Training: learning from vast text

Before an LLM can do anything useful, it undergoes pre-training: processing hundreds of billions of tokens of text — books, websites, code, scientific papers — and learning to predict what comes next. This isn't memorisation; it's pattern recognition at extreme scale. The model adjusts billions of internal numerical parameters (called weights) until it becomes highly reliable at predicting the next token given all previous ones. This process requires enormous compute and typically takes weeks to months of continuous GPU time.

After pre-training, models are usually fine-tuned — trained further on curated, higher-quality examples to follow instructions, be helpful, and avoid harmful outputs. This second stage is where much of what makes a model feel aligned with human expectations comes from.

The Transformer architecture

Almost every modern LLM is built on the Transformer architecture, introduced in a landmark 2017 Google paper ('Attention Is All You Need'). The key innovation is the self-attention mechanism: rather than processing text sequentially left-to-right, every token can attend to every other token simultaneously, regardless of distance.

Attention works by computing three representations for each token: a Query (what this token is looking for), a Key (what this token offers to others), and a Value (what this token contributes when attended to). Attention scores are computed by taking the dot product of each token's Query against every other token's Key, then normalising with softmax so scores sum to 1. Each token's output representation is then a weighted sum of all Values — effectively asking: given what I am looking for, how much should I attend to each other token? This happens in parallel across all tokens simultaneously, which is why Transformers train far faster than the sequential RNN architectures they replaced.

Modern LLMs stack many Transformer layers (32–128 for frontier models), with each layer running multiple attention patterns in parallel (multi-head attention — typically 32 to 128 heads). Earlier layers tend to capture syntactic relationships; later layers encode higher-level semantic concepts. Between attention sub-layers sit feed-forward networks that apply learned non-linear transformations, giving the model its representational depth.

Self-attention  ·  "The cat sat on the mat"
query / keyThecatsatmat
The
cat
sat
mat

Each row shows how strongly a token attends to every other token; darker shading indicates higher weight. Rows sum to one — the model distributes a fixed budget of attention. In practice this runs across thousands of tokens through many parallel heads.

Positional encoding

The attention mechanism has a subtle limitation: it is position-agnostic. When a token computes its Query-Key dot products, it attends based on semantic similarity — not on where in the sequence the other token appears. Left uncorrected, 'The dog bit the man' and 'The man bit the dog' would produce identical attention patterns — same tokens, different meaning.

Positional encodings solve this by adding a position-dependent signal to each token's embedding before it enters the Transformer layers. The original 2017 Transformer used fixed sinusoidal encodings — mathematical functions of position that produce a unique pattern for each sequence slot. Modern LLMs almost universally use RoPE (Rotary Position Embedding), which encodes position by rotating the Query and Key vectors before the dot product. This makes attention scores naturally decay with distance, and crucially enables context window extension: by adjusting the rotation frequency at inference time, models can generalise to longer contexts than they were trained on — one mechanism behind how providers extend 8K base models to 128K or beyond.

Mixture of Experts

Most frontier models today are not dense Transformers — they use a Mixture of Experts (MoE) architecture. In a dense Transformer, every parameter activates for every token. MoE replaces the feed-forward sub-layer in each Transformer block with a set of specialised 'expert' networks — typically 8 or 16 — and a learned router. For each token, the router selects a small subset of experts (usually 2) to process it; the remaining experts do not activate and incur no compute cost.

This decouples total parameter count from active parameter count. Mixtral 8x7B has 46.7 billion total parameters but only 12.9 billion active per token — delivering quality comparable to a ~46B dense model at the inference cost of a ~13B one. GPT-4 is widely understood to use MoE, as do Gemini and most other frontier models. The trade-off: MoE models require more total VRAM to hold all experts in memory simultaneously, and training requires careful load-balancing to prevent certain experts from being underused. For inference throughput, the active-parameter advantage is substantial.

Inference: generating one token at a time

When you send a message to an LLM, it doesn't retrieve a pre-written answer. It generates a response one token at a time, each token being a probabilistic selection based on everything that came before. This is why outputs vary between identical prompts, why models can be confidently wrong (the next token is probable, not necessarily true), and why longer responses take longer to generate — the model produces each token sequentially.

Mechanically: the model passes the full context through all its layers and produces a logit (a raw score) for every token in its vocabulary — typically 50,000 to 128,000 entries. These logits are converted to probabilities via a softmax function. One token is sampled, appended to the context, and the entire process repeats. At temperature 0, the model greedily selects the highest-probability token every time, producing deterministic output. At higher temperatures, logits are divided by T before softmax, flattening the distribution so lower-ranked tokens get a meaningful share of the probability mass. Top-p sampling (nucleus sampling) further trims this to the smallest set of tokens whose cumulative probability exceeds a threshold p — cutting off the long tail of improbable vocabulary entirely.

Next-token distribution  ·  "After a long run, she felt ___"
31%probability the next token is happy
050%100%
happy
31%
glad
19%
pleased
14%
great
11%
good
9%
(all others)
16%

The model scores every token in its vocabulary (~50k–128k tokens). At temperature 0 the highest-probability token always wins; higher temperature flattens the distribution so lower-ranked tokens are more likely to be sampled.

Context windows

The context window is the total amount of text an LLM can see at once — your conversation history, any documents provided, system instructions, and the model's own previous responses. Modern frontier models support windows from 128,000 to over one million tokens. When content exceeds this window, earlier context is dropped. This is a real engineering constraint in enterprise applications that need to reason over long documents or extended conversations.

Parameters: what the numbers actually mean

When a model is described as having '7 billion parameters' or '405 billion parameters', those numbers refer to its weights — individual numerical values (typically stored as 16-bit floats) that encode everything the model learned during training. A 7B model stores roughly 14 GB of data at 2 bytes per parameter; a 70B model requires ~140 GB; a 405B model exceeds 800 GB. This is why running large models locally requires substantial GPU memory: the full weight set must fit in VRAM before any inference can begin.

Scaling laws (established by DeepMind's Chinchilla paper, 2022) showed that optimal model performance depends jointly on parameter count and training token volume. Doubling parameters without also scaling training data yields diminishing returns. This insight drove the shift toward training smaller models on vastly more tokens — producing models like Llama 3 8B that significantly outperform earlier 30B models trained on less data. Parameter count is a proxy for capacity, not a direct measure of capability.

Key takeaways

  • LLMs predict the next token — they don't understand text the way humans do
  • Pre-training is expensive and happens once; inference is cheaper and happens with every request
  • The Transformer's attention mechanism enables coherent reasoning across long texts
  • Context windows define how much the model can consider at once — content beyond the limit is not seen
  • Outputs are probabilistic — the model is generating likely text, not retrieving verified facts
Module 02·~15 min·Practitioner

The AI Model Landscape

Frontier models, open source, the major labs, and how to evaluate capability claims

Frontier models vs. the rest

Frontier models are the most capable models available at any given time — the ones actively pushing the boundary of what AI can do. As of 2025, these include GPT-4o (OpenAI), Claude 3.5 and 3.7 Sonnet (Anthropic), Gemini 2.0 and 2.5 (Google DeepMind), and Llama 3 (Meta). Below the frontier sit smaller, faster, cheaper models — often distilled or fine-tuned variants — that handle most enterprise tasks at a fraction of the cost. Choosing between frontier and sub-frontier is primarily an economics and capability trade-off, not a prestige decision.

Open-source vs. closed and proprietary

Closed models — GPT-4o, Claude, Gemini — are accessed only via API. You cannot inspect or modify the underlying weights. Open models — Llama 3, Mistral, Gemma — release their weights publicly, allowing anyone to run, inspect, and fine-tune them on their own infrastructure. Open models offer data sovereignty and cost control at the expense of setup complexity. Closed models are typically more capable at the frontier but create vendor dependency and offer less visibility into how the model behaves internally.

Running models locally

Open-source models can run entirely on your own hardware — no API calls, no data leaving your machine, no per-token cost after setup. Ollama is the simplest entry point: a command-line tool that downloads quantised models and runs them with a single command. It exposes an OpenAI-compatible API endpoint, making it a drop-in local replacement for cloud APIs during development. LM Studio provides a desktop GUI for browsing, downloading, and running models — suited to users who prefer not to use the command line. llama.cpp is the underlying C++ inference engine powering most local tools; it supports GGUF-quantised models and can fall back to CPU when GPU VRAM is insufficient.

The practical constraint is hardware. A 7B model at INT4 quantisation requires roughly 4–5 GB of VRAM or unified memory; a 13B model requires ~8 GB; a 70B model at INT4 requires ~40 GB. Modern MacBooks with Apple Silicon (M2/M3/M4) run 7B and 13B models well via their unified memory architecture. Running locally makes most sense for: development and testing without API latency or cost, privacy-sensitive workflows where data cannot leave the machine, high-volume inference where annual API spend would exceed hardware costs, and exploring model behaviour with unrestricted access to generation parameters.

The major labs

OpenAI introduced GPT and ChatGPT, which has roughly 500 million users and the strongest brand recognition in the space. Anthropic, founded by former OpenAI researchers, builds the Claude series with a safety-first approach and strong performance on reasoning and long-document tasks. Google DeepMind produces the Gemini series, with particular strength in multimodality and deep integration across Google's product ecosystem. Meta AI releases the Llama series fully open-source, making it the most widely used base for fine-tuning and research. Mistral is a European lab producing highly efficient models with a strong open-source presence and growing enterprise adoption.

Fine-tuned and specialised models

Base pre-trained models are rarely used directly in products. Most AI applications use fine-tuned variants — models further trained on domain-specific data for coding (GitHub Copilot, Cursor), legal analysis (Harvey), medicine (Med-PaLM), or specific enterprise workflows. When a vendor says they use a customised AI model, this is usually what they mean: a general base model adapted for a specific task.

Benchmarks and capability claims

AI benchmarks — MMLU, HumanEval, GPQA, MATH — measure performance on specific tasks: professional exam questions, coding problems, graduate-level reasoning. They are useful reference points but imperfect guides. Models can be specifically optimised to perform on popular benchmarks without being broadly better. When evaluating a model for your organisation's use case, treat benchmarks as a starting filter, then test against your actual tasks and data.

Key takeaways

  • Frontier models lead capability; smaller models are often sufficient and significantly cheaper
  • Open-source models offer flexibility and data control; closed models offer ease of access and top-tier capability
  • Each major lab has distinct strengths — Anthropic for safety and long context, Google for multimodality, Meta for open research
  • Fine-tuning is how general models become specialised products
  • Treat benchmark claims as a starting point, not a final verdict — test on your own use case
Module 03·~15 min·Technical

AI System Architecture

From foundation model to user-facing product — the full technical stack

The stack: model to user

Most people interact with AI through a product interface — a chatbot, an embedded assistant, a search tool. What they see represents only the top layer. Beneath it is a stack: the foundation model (the LLM itself), an orchestration layer (code that manages conversations, tools, and context), a data layer (documents and knowledge the model can access), and the application layer (the interface users see). Understanding this stack helps you diagnose problems, evaluate vendor products, and make better architectural decisions.

Enterprise AI stack  ·  surface to substrate
01

User Interface

Chat UI, embedded widget, API client

02

Application Layer

Auth, routing, session management, logging

03

Orchestration Layer

Prompt assembly, tool dispatch, RAG retrieval, context management

04

Foundation Model (LLM)

GPT-4o, Claude, Gemini — accessed via API

05

Data Layer

Vector DB, document store, structured databases, knowledge base

Users only see the top layer. Problems that present as "the AI is wrong" usually originate in the orchestration or data layers, not the model itself.

Embeddings

Embeddings are numerical representations of text — vectors of hundreds or thousands of numbers that encode semantic meaning. Two sentences with similar meanings will have similar embeddings, even if they use completely different words. This is the foundation of semantic search and AI retrieval systems. When an AI product searches a knowledge base, it is almost certainly comparing embeddings, not doing keyword matching. This is why you can search for employee holiday entitlement and retrieve a document about annual leave policy.

A typical embedding model produces a vector of 1,536 dimensions (OpenAI's text-embedding-3-small) or 3,072 dimensions (text-embedding-3-large). Similarity is measured using cosine similarity: the cosine of the angle between two vectors. Identical meaning → angle near 0° → cosine similarity near 1.0. Unrelated meaning → angle near 90° → cosine similarity near 0. Finding the most similar vectors across millions of stored embeddings uses approximate nearest-neighbour indexing algorithms — HNSW (Hierarchical Navigable Small World graphs) is the most widely deployed — enabling millisecond search across document stores that would otherwise take seconds with brute-force comparison.

Vector databases

Vector databases — Pinecone, Weaviate, pgvector — are purpose-built for storing and querying embeddings at scale. They can find semantically similar content across millions of documents in milliseconds. Any enterprise AI system that needs to search large bodies of knowledge will include a vector database or equivalent. Understanding this layer also explains why AI knowledge bases have a delay when new content is added — documents must be embedded and indexed before they become searchable.

Retrieval-Augmented Generation (RAG)

RAG is the most important architectural pattern in enterprise AI. The problem it solves: you cannot fit all your organisation's knowledge into a model's context window, and fine-tuning the model on all your data is prohibitively expensive and creates a static snapshot. RAG retrieves the most relevant documents at query time and injects them into the prompt. The model then generates a response grounded in that retrieved content. Most enterprise AI assistants — knowledge bases, document Q&A tools, customer support bots — use RAG.

RAG quality depends heavily on engineering decisions beyond the basic pipeline. Chunking strategy — how documents are split before embedding — is critical: chunks too large dilute retrieval precision; chunks too small lose surrounding context. Typical chunk sizes are 256–1,024 tokens with 10–20% overlap between adjacent chunks to avoid splitting relevant content across boundaries. Retrieval precision can be improved with re-ranking: after the initial vector search returns top-k candidates, a cross-encoder model scores each chunk against the specific query and reorders results by true relevance — more accurate than vector similarity alone but slower to compute.

Retrieval-augmented generation  ·  pipeline
  1. 01

    User query

    "What is our refund policy?"

  2. 02

    Embed query

    Numerical representation of meaning

  3. 03

    Vector search

    Cosine similarity over millions of chunks

  4. 04

    Retrieve top-k

    Three to ten most relevant passages

  5. 05

    Augment prompt

    Query and retrieved context injected

  6. 06

    LLM generates

    Response grounded in retrieved content

RAG keeps knowledge external and updatable. Adding a new document only requires embedding and indexing it — no model retraining.

APIs and function calling

Models are almost always accessed via API — a standard interface for sending text in and receiving text or structured data back. Modern LLM APIs support function calling: the model can request that your code execute a specific function and return the result. This is how AI assistants can search the web, query your database, check calendar availability, or look up a customer record mid-conversation. The API layer is where model capability meets your data and systems.

Key takeaways

  • Enterprise AI is a multi-layer stack, not a single model
  • Embeddings enable semantic search — finding meaning, not just matching keywords
  • Vector databases make large-scale semantic retrieval fast and practical
  • RAG is how AI products access organisational knowledge without fine-tuning
  • Function calling via API is what allows AI to take actions, not just generate text
Module 04·~20 min·All levels

How Prompting Actually Works

The mechanics behind prompt design — beyond tips and tricks

System prompts vs. user prompts

Every LLM application has two types of input: the system prompt and the user prompt. The system prompt is written by the developer, sets the model's persona, defines its task and constraints, and the end user typically never sees it. The user prompt is what the user actually types. The system prompt is extraordinarily powerful — it frames everything that follows. A customer service chatbot that feels tuned for a specific company is mostly a well-crafted system prompt layered on top of a general model. When you use a company's AI assistant, you are interacting with their system prompt as much as with the underlying model.

Temperature and sampling

When the model selects the next token, it does not always choose the most probable option. Temperature is a parameter that controls randomness: at 0, the model always selects the highest-probability token (deterministic, consistent, sometimes repetitive); at 1, it samples more freely across possibilities (creative, varied, sometimes unpredictable). Most production applications set temperature between 0.2 and 0.7. Use lower temperature for factual extraction and structured outputs; higher temperature for creative generation and brainstorming.

Mechanically, temperature T is applied by dividing the model's raw logits by T before the softmax step. When T < 1, logit differences are amplified: the highest-scoring token becomes even more dominant. When T > 1, differences are compressed: probabilities spread across more candidates. Top-p sampling (nucleus sampling) works alongside temperature by restricting selection to the smallest group of tokens whose cumulative probability mass exceeds threshold p — at p = 0.95, the lowest-probability long tail is excluded entirely, preventing very improbable tokens from ever being selected even at high temperatures.

Sampling temperature  ·  same prompt, same model
T = 0.2concentrated, near-deterministic
050%100%
happy
72
glad
14
pleased
7
great
4
good
3
T = 1.2flattened, more creative
050%100%
happy
31
glad
24
pleased
19
great
14
good
12

Temperature divides logits before softmax. Lower values amplify differences between scores so one token dominates; higher values compress them, spreading probability across more candidates.

Other generation parameters

Temperature and top-p are the most commonly tuned parameters, but production LLM applications regularly use several others. top_k restricts sampling to the k most probable tokens at each step (typically 40–50), providing a simpler alternative to top-p. repetition_penalty applies a multiplicative discount to tokens that have already appeared in the output — reducing the looping and repetitive phrasing that long generations tend to produce. frequency_penalty and presence_penalty (OpenAI's terminology) are similar but distinguish between tokens that appeared frequently (frequency_penalty reduces their probability proportionally to count) versus tokens that appeared at all (presence_penalty applies a fixed one-time penalty).

stop_sequences are strings or tokens that halt generation immediately when produced — essential for structured outputs where generation must terminate at a defined delimiter. max_tokens caps output length and is a critical cost-control parameter: without it, an unusually verbose response can multiply API costs unexpectedly. seed fixes the random state for deterministic outputs — at temperature 0 with a fixed seed, responses should be identical across calls, enabling reliable automated testing. Understanding these parameters collectively gives you precise control over the consistency, style, length, and cost of model outputs.

Context window management in production

Every token counts against the context window: the system prompt, the full conversation history, retrieved documents, the current message, and the model's own response. In long conversations or document-heavy applications, the window fills. When it does, either earlier content is truncated (the model effectively forgets it), or the application must summarise and compress prior context. This is a genuine engineering challenge — most users never notice because well-designed applications handle it invisibly, but it is a core constraint shaping every enterprise AI deployment.

Why prompts fail

Hallucination occurs when the model generates plausible but incorrect information. This is a fundamental property of the architecture: the model selects probable tokens, not verified facts. It cannot reliably distinguish between what it knows and what it infers. This cannot be fully eliminated with better prompting — it can be mitigated with retrieval grounding (RAG), output verification, and human review processes.

Instruction complexity degrades reliability. Models follow one clear instruction better than five ambiguous ones. Complex multi-step prompts produce less consistent outputs than decomposed, sequential ones. Prompt injection is a distinct security concern: malicious instructions embedded in user inputs or retrieved documents can override system prompts and alter model behaviour — a real risk in agentic and document-processing systems.

Few-shot prompting

Zero-shot prompting gives the model a task with no examples. Few-shot prompting includes two to five examples of the desired input-output pattern before the actual task. Few-shot consistently improves performance on structured, domain-specific, or format-sensitive tasks because it demonstrates the expected reasoning style and output format — reducing the model's uncertainty about what a good response looks like in your specific context.

Chain-of-thought and extended thinking

Chain-of-thought (CoT) prompting elicits step-by-step reasoning by providing examples that demonstrate intermediate reasoning steps, or simply appending 'Let's think step by step' to the prompt. This works because of the auto-regressive nature of LLMs: generating intermediate reasoning tokens makes subsequent correct tokens more probable — the model talks itself through the problem before committing to a final answer. On multi-step reasoning benchmarks, CoT prompting improves accuracy by 20–40 percentage points, but only on models above roughly 10 billion parameters — suggesting that explicit reasoning is an emergent capability that appears at scale.

Extended thinking — implemented in models like Claude 3.7 Sonnet (thinking mode) and OpenAI o3 — is a systematic version of this principle. These models generate extended internal reasoning (sometimes thousands of tokens of scratchpad) before producing a visible response. This shifts compute from training time to inference time: harder problems get more tokens of reasoning. The trade-off is latency and cost — more thinking tokens means slower and more expensive responses — but for complex reasoning tasks the accuracy gains are substantial.

Context engineering

Prompt engineering focuses on the instructions you give the model. Context engineering is the broader discipline of deciding what goes into the context window — and in what form. As windows expand toward one million tokens, the question of what to include becomes as important as how to phrase the instruction. Blindly filling the context with everything potentially relevant degrades performance: models exhibit a 'lost in the middle' failure mode where relevant information buried in a long context is less reliably used than information near the beginning or end.

Context engineering decisions include: conversation history management (how many prior turns to include; when to summarise rather than truncate); document injection strategy (full document vs extracted passages vs summaries); structured vs prose context (tables and JSON can be more token-efficient than natural language for certain data types); and context ordering (placing the most critical content at the start or end rather than the middle). In agentic systems, it also covers what intermediate results to retain across steps and what to discard to prevent context overflow in long-running tasks. As context windows grow, context engineering is increasingly the dominant skill in applied LLM work.

Key takeaways

  • System prompts define model behaviour — they are as important as the model itself
  • Temperature controls creativity vs. consistency — tune it for your task type
  • Context windows fill in production — truncation and compression are real engineering challenges
  • Hallucination is architectural; mitigate it with retrieval grounding and verification, not prompting alone
  • Few-shot examples consistently improve output quality for structured and domain-specific tasks
Module 05·~25 min·Technical

Agents & Agentic Systems

From single responses to AI that plans, acts, and iterates

What makes a system agentic

For the first years of the LLM era, most AI interactions were transactional: a user sends a message, the model replies, done. Agentic systems change this fundamentally. An agent is an LLM that can take actions, not just generate text. It operates in a loop: observe input, plan a response or action, execute that action (call a tool, search the web, write code, send a message), observe the result, and repeat — until the task is complete or the model determines it is done. The key distinction is that agentic systems persist and act across multiple steps rather than responding in a single shot.

The agentic loop  ·  ReAct pattern
  1. 01

    Observe

    Receive task, context, previous results

  2. 02

    Think

    Reason about next action (chain-of-thought)

  3. 03

    Act

    Call a tool or produce output

  4. 04

    Observe result

    Read tool output, append to context

continues until the task is complete

Available tools

Web search·Code execution·Database query·Email send·File read/write·API call

The model never directly executes anything — it requests that your application code calls a tool and returns the result, which is appended to the context for the next step.

Tool use and function calling

Modern LLMs support tool use: you provide the model with a list of available tools — web search, database query, email send, code execution, calendar check — with descriptions of what each does. The model decides which tools to call, when to call them, and how to interpret their outputs. It then incorporates those results into its ongoing reasoning. This is how AI assistants move beyond conversation into genuine task completion.

Function calling works at the protocol level: tool definitions are included in the API request as a JSON schema specifying each tool's name, description, and parameter types. When the model determines it should use a tool, it returns structured JSON — rather than prose — specifying the function name and arguments. Your application code executes the function and returns the result, which is appended to the context for the model's next call. The model never directly executes anything: it only requests that your code does. This separation is important for security — it means you control exactly what actions an agent can take, and you can log, audit, and gate every tool invocation.

Agentic design patterns

Several repeating patterns have emerged in how capable agent systems are structured. ReAct (Reason + Act) is the most common: the model alternates between reasoning about its situation and taking an action, making its thinking visible and auditable in the trace. Reflection adds a second step where the model reviews its own output and iterates — effectively self-editing before returning a result. Planning agents decompose a complex task into a structured sequence of subtasks before executing any of them, reducing compounding errors on long-horizon work.

Parallelisation runs multiple agent calls simultaneously — useful when subtasks are independent (e.g. researching several topics at once) — then merges results. Specialisation splits work across purpose-built agents (a research agent, a writing agent, a code-review agent), each with a focused system prompt and limited tool set. An orchestrator agent routes tasks and synthesises results. These patterns are often combined: a planning orchestrator that fans out parallel specialised agents and then applies reflection before returning a final answer.

Reliability degrades quickly as agent chains grow longer. Each LLM call has some probability of error; errors compound multiplicatively across steps. This means the most robust production agents are either short-chained (two or three hops) or include explicit verification steps — another agent checking the output, a structured self-critique loop, or a deterministic validation function before the result is accepted.

The Model Context Protocol (MCP)

Until 2024, every agent framework had its own proprietary way of defining tools, connecting to external services, and passing results back to the model. This created fragmentation: a tool built for LangChain could not be used in AutoGen, and every new integration required custom glue code. The Model Context Protocol (MCP), introduced by Anthropic in late 2024, is an open standard that defines a uniform interface between AI models and the data sources or tools they connect to.

MCP works as a client–server protocol. An MCP server exposes a set of tools, resources, or prompts through a standardised JSON-RPC interface. An MCP client — which could be Claude Desktop, an IDE plugin, or a custom agent framework — discovers and calls those tools through the same protocol regardless of what the server is built on. This means a single MCP server wrapping, say, a company's CRM can be used by any MCP-compatible model or host without rewriting the integration. The ecosystem is growing rapidly: hundreds of community-built MCP servers now exist for databases, APIs, file systems, browser control, and SaaS platforms.

For developers, MCP is significant because it standardises the interface that has historically been the most painful part of building agents. Rather than writing bespoke function-calling schemas for every capability, you build or install an MCP server once and it becomes available to any compliant model. The protocol also defines resource exposure (streaming file contents, database rows) and prompt templates — making it broader than pure tool-calling. MCP is increasingly treated as infrastructure-level for agentic systems the way HTTP is for web services.

Multi-agent systems

Complex tasks can be distributed across multiple specialised agents. A research agent gathers information, a writing agent drafts content, a critique agent reviews it, and an orchestrator agent coordinates the workflow. This pattern allows more reliable completion of long, multi-step tasks by decomposing them — but it introduces coordination complexity, higher cost (each agent makes its own LLM calls), and compounding error risk: a mistake early in the chain can propagate and amplify through subsequent steps.

Memory architecture

Short-term memory is the context window — everything the model has seen in the current session. Long-term memory is external storage — facts, preferences, or summaries of past interactions persisted in a database and retrieved when relevant. Episodic memory refers to summaries of previous sessions injected at the start of new ones. Most current agent frameworks have limited and unreliable long-term memory. This remains one of the most actively researched problems in the field, and a major practical limitation for agents that need to learn from experience over time.

Real-world examples

You have already encountered agentic systems: GitHub Copilot completing a multi-file refactor autonomously, Perplexity chaining multiple web searches to answer a complex question, Claude Projects maintaining context across sessions, or an AI sales assistant that researches a prospect, drafts an outreach email, and schedules a follow-up without human intervention. Enterprise agentic deployments — AI that takes actions inside business systems — are accelerating significantly through 2025 and beyond.

Key takeaways

  • Agents act in loops: observe, plan, act, repeat — not a single response
  • Tool use is what separates agentic AI from conversational chatbots
  • Multi-agent systems increase capability but add coordination complexity and compounding error risk
  • Long-term memory is still an open problem — most agents do not reliably learn between sessions
  • Agentic failures compound — errors early in a task propagate forward through subsequent steps
Module 06·~15 min·Technical

Hardware & Infrastructure

GPUs, data centres, custom silicon, and the energy question

Why GPUs?

Graphics Processing Units were designed for the parallel computations required to render graphics — performing thousands of simple operations simultaneously rather than a smaller number of complex ones sequentially. Training and running neural networks requires almost identical mathematical operations (matrix multiplications at very large scale). This is why NVIDIA, originally a gaming hardware company, became the most important infrastructure supplier in the AI industry. A modern frontier model training run uses thousands of GPUs operating in parallel for weeks or months.

The core operation in a neural network is a matrix multiplication — thousands of simple numerical operations applied simultaneously across large grids of numbers. CPUs are designed for sequential, complex tasks and have a small number of powerful cores. GPUs have thousands of simpler cores designed to run in parallel, which is exactly what neural network workloads require. This makes a modern GPU not just faster but fundamentally better suited to AI than a CPU — and it is why NVIDIA, a gaming hardware company, became the most important infrastructure supplier in the AI industry. Beyond raw processing speed, the speed at which data can be moved between GPU memory and its processing cores matters enormously — and frontier GPUs are optimised for this as much as for raw computation.

NVIDIA's dominance and the CUDA moat

NVIDIA's H100 and H200 GPUs are the gold standard for AI training and inference. But the hardware alone does not explain NVIDIA's position — the CUDA software ecosystem does. Built over 15 years, CUDA is a parallel computing platform that most AI frameworks, libraries, and tools are optimised for. Switching away from NVIDIA GPUs requires not just different hardware but porting significant amounts of software. This creates a durable moat that competitors from AMD, Intel, and Qualcomm are working to bridge.

Custom silicon

Google's Tensor Processing Units are purpose-built chips for neural network workloads — highly efficient for training and inference at Google's scale, accessible exclusively through Google Cloud. Other major players — Amazon (Trainium, Inferentia), Microsoft (Maia), Meta (MTIA) — have developed custom silicon to reduce dependence on NVIDIA and control their own cost structures. These chips remain largely internal tools, but their existence signals the scale at which the major labs are operating.

Data centres and energy

A single NVIDIA H100 GPU draws approximately 700 watts of power. A cluster of 10,000 GPUs — modest by frontier training standards — consumes roughly 7 megawatts, equivalent to powering a small town. Training a frontier model like GPT-4 is estimated to have consumed 50–100 gigawatt-hours. At scale, energy availability and cost have become the primary constraint on AI development. Microsoft, Google, and Amazon have all signed agreements with nuclear power providers to secure dedicated generation capacity for AI data centres.

Cloud vs. on-premise inference

Most organisations run AI inference in the cloud — AWS, Azure, Google Cloud, or specialist providers like CoreWeave. On-premise inference makes sense in two scenarios: data sovereignty requirements that prohibit cloud processing, or query volumes so high that annual API spend would exceed the capital cost of owning and operating hardware. For most organisations at current scale, cloud inference is the right default. The break-even point for on-premise typically requires substantial usage and a dedicated ML infrastructure team.

Key takeaways

  • GPUs are the primary compute substrate for AI — NVIDIA dominates, with alternatives growing
  • The CUDA software ecosystem creates a durable moat beyond hardware alone
  • Custom silicon from major cloud providers reduces NVIDIA dependency at extreme scale
  • Energy is the new binding constraint on AI infrastructure growth
  • Cloud inference is the correct default for most organisations — on-premise is for specific compliance or scale scenarios
Module 07·~15 min·Business

Cost & Economics

Training vs. inference, API pricing, make vs. buy, and the cost trajectory

Training vs. inference — two very different cost centres

AI has two distinct cost structures. Training is a largely one-time expense — the compute required to create the model. Training GPT-4 is estimated to have cost $50–100 million. Llama 3 70B cost Meta roughly $10–20 million. These numbers are falling with algorithmic improvements, but frontier training remains a capital expenditure accessible only to well-funded labs. Inference is the ongoing cost of running the model to serve requests — what you pay every time someone uses an AI product. Inference costs have fallen dramatically and are the relevant budget line for almost every organisation.

API pricing in practice

Most organisations access AI via API, paying per token. Pricing varies substantially by model tier. Frontier models such as GPT-4o and Claude 3.5 Sonnet typically cost $3–15 per million input tokens. Mid-tier models such as GPT-4o mini and Claude Haiku cost $0.15–1 per million tokens. Open-source models hosted via third-party providers often cost under $0.10 per million tokens.

For context: one million tokens is approximately 750,000 words — a substantial volume of text. Most enterprise applications consuming AI at scale spend thousands to tens of thousands of dollars per month on API calls. Understanding token consumption is essential for accurate AI budget planning, since usage-based pricing requires different financial controls than per-seat SaaS.

Make vs. buy decisions

Should your organisation fine-tune its own model or use a commercial API? For the overwhelming majority of organisations, the answer is: buy. Fine-tuning requires ML engineering expertise, data curation, training infrastructure, and ongoing maintenance. It makes economic sense in a narrow set of circumstances: data privacy requirements that prevent external API use, query volumes so high that annual API spend exceeds the cost of owning and operating a model, or capability gaps that commercial models genuinely cannot fill. Start with a commercial API. Revisit when scale and requirements justify the investment.

The cost trajectory

AI inference costs have followed a consistent pattern: roughly 10x cheaper every 12–18 months. This trajectory has significant strategic implications. Use cases that are economically unviable at today's pricing — processing millions of customer records, running real-time analysis across all support interactions — may be trivially affordable in 18 to 24 months. Evaluate AI investments against the expected cost trajectory, not just today's pricing. The business case for automation that looks marginal now may be compelling within two years.

Key takeaways

  • Training is a one-time capital cost; inference is the ongoing operational cost — focus budget planning on inference
  • Inference costs have fallen roughly 10x per year — today's pricing is not the floor
  • For almost all organisations, buying via API beats training your own model
  • Model tier selection has major cost implications — frontier capability is rarely needed for most tasks
  • Evaluate AI investments against the cost trajectory, not just current pricing
Module 08·~15 min·All levels

Data, Privacy & Enterprise Risk

Training data, input handling, enterprise agreements, and prompt injection

Training data and copyright

Frontier models are trained on vast datasets scraped from the internet, including content that may be under copyright. This has led to significant litigation — most notably The New York Times versus OpenAI — with outcomes still being determined in courts. For enterprise users deploying AI products, the practical exposure is limited. You are not redistributing training data; you are using the model's generated outputs. The legal risk sits primarily with the model providers, not their API customers. Monitor for developments, but this should not be a reason to avoid AI deployment.

What happens to your inputs

This is the question every enterprise procurement and legal team asks. The answer depends on contract type. By default, OpenAI, Anthropic, and Google do not use API inputs to train future models — the API channel is treated differently from consumer products. Enterprise agreements — Azure OpenAI Service, Anthropic's enterprise tier, Google Vertex AI — typically include contractual guarantees that your data is not used for model training.

Consumer products (ChatGPT free, Gemini free) operate under different terms where inputs may be used for training unless opted out. Do not extrapolate consumer product terms to enterprise API agreements — they are not the same document, and the distinction matters significantly for compliance.

Fine-tuning on proprietary data

Fine-tuning a model on your organisation's internal data — documents, emails, customer records — raises specific questions: Who holds the resulting model weights? What data was exposed during the training process, and to whom? Is the fine-tuned model itself a security risk if accessed without authorisation? These are solvable problems but require deliberate data governance decisions before beginning. Establish data classification, access controls, and retention policies before engaging in any proprietary fine-tuning project.

Data residency and sovereignty

Regulatory frameworks in the EU, financial services, healthcare, and government require that certain categories of data remain within specific geographic boundaries. Major cloud providers offer region-locked deployments — Azure OpenAI in EU regions, AWS Bedrock with data residency guarantees, Google Cloud's regional endpoints — that satisfy most regulatory requirements. On-premise deployment of open-source models is the most restrictive-compliant option but carries operational overhead. Data residency is now a procurement decision as much as a technical one.

Prompt injection

Prompt injection is a security vulnerability specific to AI systems: malicious instructions embedded in user inputs or retrieved content can override system prompts and cause the model to behave in unintended ways. An AI assistant that processes incoming emails could be manipulated by a carefully crafted email body designed to exfiltrate information, override access controls, or produce harmful outputs. Mitigations include input validation, output monitoring for anomalous behaviour, limiting the actions agents can take autonomously, and applying the principle of least privilege to AI system permissions.

Key takeaways

  • Enterprise API agreements typically prohibit training on your data — consumer product terms do not offer the same guarantee
  • Read the data processing agreement; marketing claims are not contractual commitments
  • Fine-tuning on proprietary data requires data governance decisions before you begin
  • Data residency requirements are solvable via major cloud providers' regional deployments
  • Prompt injection is a real and underappreciated security risk in agentic and document-processing applications
Module 09·~10 min·Policy & Leadership

AI Safety & Guardrails

Alignment, RLHF, Constitutional AI, red-teaming, and defence-in-depth

Alignment: the core challenge

Alignment refers to the challenge of ensuring AI systems do what humans actually want — not merely what they have been literally instructed to do. An LLM trained purely to predict text can produce harmful, false, or manipulative outputs without any intent to do so. Alignment research attempts to close the gap between following the literal instruction and doing what was genuinely intended, in context, including situations the instruction did not anticipate.

This problem scales with capability. A more capable but misaligned system can cause more harm. This is the foundational concern driving AI safety research, and it becomes increasingly important as systems gain greater autonomy.

RLHF: teaching models human preferences

Reinforcement Learning from Human Feedback is the primary alignment technique used by frontier model developers. Human raters are shown pairs of model outputs and asked to indicate which is better. This preference data trains a reward model — a separate neural network that learns to assign a scalar score to any (prompt, response) pair, predicting what human raters would prefer. The reward model then drives a reinforcement learning loop: the LLM is fine-tuned using Proximal Policy Optimisation (PPO) to generate responses that earn higher scores. A KL-divergence penalty prevents the policy from drifting too far from the base pre-trained distribution — without it, the model would 'collapse' into a narrow set of reward-hacking responses that score well without being genuinely better.

RLHF is effective but imperfect. It encodes the biases and blind spots of its human raters, can lead to reward hacking (the model learns to produce outputs that score highly on the reward model without being genuinely better), and can make models overly cautious or deferential in ways that reduce usefulness. The rater pool — who they are, what they are instructed to optimise for, and how they are paid — has substantial influence on the resulting model's behaviour, making it one of the least transparent aspects of frontier model development.

RLHF training pipeline

Phase 1

Collect human preference data

Human raters compare pairs of model responses to the same prompt and indicate which is better. Thousands of labelled comparisons build a preference dataset.

Phase 2

Train a reward model

A separate neural network is trained on the preference dataset. It learns to assign a scalar reward score to any (prompt, response) pair — predicting what human raters would prefer.

Phase 3

Fine-tune LLM with reinforcement learning (PPO)

The LLM is updated using Proximal Policy Optimisation. Responses that earn high reward model scores are reinforced; low-scoring responses are suppressed. A KL-divergence penalty prevents the model drifting too far from the base pre-trained distribution.

The human rater pool — their demographics, instructions, and incentives — significantly shapes the resulting model. This is one of the least transparent aspects of frontier model development.

Constitutional AI

Anthropic's Constitutional AI approach provides the model with a set of principles and uses AI feedback rather than human feedback to evaluate outputs against those principles. The model critiques and revises its own outputs based on this constitution before producing a final response. This scales more efficiently than human feedback, produces models that can articulate why they declined a request, and is the primary alignment technique behind the Claude model series.

Red-teaming and jailbreaks

Red-teaming is the deliberate, systematic attempt to find failure modes in AI systems before deployment — adversarially prompting the model to produce harmful outputs, bypass safety measures, or behave unexpectedly. Major labs conduct extensive internal red-teaming and commission external red teams before major model releases. Jailbreaks are successful circumventions of safety guardrails. Despite extensive red-teaming, jailbreaks continue to emerge. Safety measures should be understood as a significant reduction in risk, not an elimination of it.

Output filtering and defence-in-depth

Model-level alignment is one layer of safety. Most production deployments add additional filtering layers: classifiers that detect harmful or policy-violating content in model outputs, input monitoring for known attack patterns, and human review pipelines for high-stakes automated decisions. The right approach is defence-in-depth — multiple independent safety layers, so the failure of any single layer does not produce an unacceptable outcome.

Key takeaways

  • Alignment is the problem of ensuring AI does what we actually want, not just what we literally instructed
  • RLHF is the primary alignment technique — effective but imperfect, encoding rater biases
  • Constitutional AI scales alignment feedback more efficiently and enables explainable refusals
  • Red-teaming finds failure modes before deployment — assume safety measures are meaningful but not absolute
  • Defence-in-depth (model-level combined with application-level filtering) is the correct production architecture
Module 10·~10 min·All levels

Governance & Regulation

EU AI Act, US frameworks, enterprise governance, and what professionals need to act on now

The EU AI Act

The EU AI Act — the world's first comprehensive AI regulation — establishes a tiered risk framework. Unacceptable-risk applications are prohibited: AI-based social scoring, real-time biometric surveillance in public spaces, subliminal manipulation of vulnerable groups. High-risk applications face substantial obligations: conformity assessments, mandatory human oversight, transparency requirements, logging and audit trails, and registration in an EU database. High-risk categories include AI used in hiring and employment decisions, credit scoring, healthcare diagnosis, educational assessment, law enforcement, and critical infrastructure.

Limited-risk applications face disclosure obligations: chatbots must identify themselves as AI systems. Minimal-risk applications — most AI tools in common use — face no specific obligations. Any organisation deploying AI that affects EU citizens or operates within the EU needs to assess where their applications fall. Fines for non-compliance reach up to €35 million or 7% of global annual turnover for the most serious violations.

US frameworks

The US approach has been more fragmented: executive orders, voluntary commitments from major AI developers, and sector-specific guidance from financial, healthcare, and other regulators. The NIST AI Risk Management Framework provides a voluntary governance structure — govern, map, measure, manage — that has been widely adopted as a de facto enterprise standard. Sector-specific regulators including the SEC, FDA, CFPB, OCC, and EEOC have issued or are actively developing AI-specific guidance for their domains. Comprehensive federal AI legislation remains pending, but the regulatory direction is clear: disclosure, accountability, and human oversight requirements will grow.

Enterprise AI governance

Regardless of regulatory requirements, robust AI governance is a risk management priority. An enterprise AI governance programme typically covers five areas: an inventory of all AI systems deployed, by whom, for what purpose, and on what data; risk assessment covering what decisions each system influences and the impact of errors; accountability with clear ownership and shutdown authority; audit trails logging inputs, outputs, and decisions; and human oversight defining which decisions require human review rather than full automation, built into system design from the start.

What professionals need to act on now

If your organisation operates in or sells to EU markets, audit your AI deployments for high-risk classification under the AI Act. Adopt the NIST AI RMF as your internal governance baseline if you do not already have one. Review vendor contracts for AI-specific data processing terms, audit rights, and liability clauses — standard SaaS agreements rarely cover AI adequately. Establish an AI system inventory now, even if it is just a spreadsheet. The governance infrastructure that takes a week to build today will take months to reconstruct under regulatory scrutiny.

Key takeaways

  • The EU AI Act creates binding obligations for high-risk AI applications — non-compliance carries significant financial penalties
  • High-risk classifications include hiring, credit, healthcare, education, and law enforcement applications
  • The US relies on voluntary frameworks and sector-specific guidance — NIST AI RMF is the de facto enterprise standard
  • Enterprise AI governance requires inventory, risk assessment, accountability, audit trails, and human oversight mechanisms
  • Build your AI inventory and review vendor contracts now — regulatory expectations will only increase
Module 11·~20 min·Engineer

LLM Evaluation

Why evaluating AI is hard, LLM-as-a-Judge, component-level RAG evals, and building a production evaluation framework

Why LLM evaluation is fundamentally different

Traditional ML models produce discrete, verifiable outputs: a classifier predicts a label; a regression model predicts a number. Correctness is a binary, computable property. LLMs produce open-ended text where quality is multidimensional — accuracy, helpfulness, fluency, tone, format compliance, and safety matter simultaneously. There is rarely a single correct answer. Two responses can both be correct while differing substantially in quality.

Standard metrics inherited from NLP — BLEU and ROUGE — measure surface-level token overlap between a generated response and a reference string. They fail badly for modern LLM outputs: a response using entirely different words from the reference but semantically superior will score poorly; a response that paraphrases the reference closely but is factually wrong can score highly. BLEU correlates weakly with human judgement in conversational AI contexts and should not be used as a primary signal for evaluating LLM systems.

LLM-as-a-Judge

The dominant approach to reference-free LLM evaluation is LLM-as-a-Judge: using a capable frontier model (typically GPT-4o or Claude Opus) to evaluate the outputs of another model. The judge receives the original prompt, the response being evaluated, and a structured rubric specifying criteria — accuracy, helpfulness, faithfulness, format compliance. It produces a numeric score and a written rationale. G-eval (2023) formalised this approach, demonstrating that LLM judges correlate with human judgements at rates comparable to inter-annotator agreement between humans.

Two modes are standard: pointwise scoring (evaluate one response on a 1–5 scale against a rubric) and pairwise preference (show two responses, ask which is better and why). Pairwise preference is generally more reliable — it is easier to rank two responses than to assign an absolute score. Known failure modes include position bias (preference for whichever response is shown first), verbosity bias (preference for longer responses regardless of quality), and self-preference (a model subtly favours outputs stylistically similar to its own). Mitigate these by randomising response order across evaluation calls and averaging scores over multiple runs.

LLM-as-a-judge  ·  evaluation pattern

Inputs

Prompt

"Explain transformer attention in simple terms"

Response A

Generated output from the model under evaluation

Response B

Reference answer or competing model output (optional)

Judge model

GPT-4o, Claude Opus, or similar frontier model

Evaluates against a structured rubric — accuracy, helpfulness, faithfulness, format, safety — and produces a numeric score and a written rationale.

Pointwise

4.2 / 5

with written rationale

Pairwise

A > B

response A preferred

Known biases: position bias (prefers whichever response appears first), verbosity bias (prefers longer responses), self-preference (a model favours outputs from similar models). Mitigate by randomising response order and averaging across multiple judge runs.

Component-level evaluation for RAG

A RAG pipeline has two stages that fail independently: retrieval and generation. Evaluating only end-to-end output quality obscures which stage is causing problems — a poor answer could result from retrieving wrong documents (retrieval failure) or from the model misrepresenting the retrieved documents (generation failure). These require different fixes. Evaluating components separately is what makes RAG systems debuggable.

The RAGAS framework defines four metrics that decompose RAG evaluation. Context precision measures whether retrieved chunks are relevant to the query. Context recall measures whether all relevant documents are being retrieved. Faithfulness is the RAG-specific hallucination metric: does every claim in the generated answer have supporting evidence in the retrieved context? A model that generates accurate-sounding statements not found in the retrieved documents is hallucinating, even if those statements happen to be factually correct in the world. Answer relevance measures whether the answer addresses the question asked. All four metrics can be computed automatically using an LLM judge, making them practical at scale.

Component-level RAG evaluation  ·  RAGAS framework

Stage

Retrieval

What happens

Query is embedded and searched against the vector DB. Top-k chunks are returned.

Metrics

  • Context precision  ·  are retrieved chunks actually relevant to the query?
  • Context recall  ·  are all relevant documents being found?

Stage

Generation

What happens

LLM generates an answer using the query and retrieved chunks as context.

Metrics

  • Faithfulness  ·  is every claim in the answer supported by retrieved context?
  • Answer relevance  ·  does the answer actually address the question?

Low faithfulness points to generation hallucination; low context recall points to retrieval missing relevant documents. Fixing the wrong component wastes engineering time — component-level metrics tell you exactly where the pipeline is failing.

Multi-turn and task-completion evaluation

Single-turn evals measure one response in isolation. Most production applications are multi-turn or agentic — they span conversations or multi-step workflows. Multi-turn evals assess whether the model maintains coherent context across a conversation, resolves ambiguities gracefully, and handles contradictions without losing track of prior context.

For agentic systems, task-completion rate is the most important metric: given a defined goal, does the agent reach it? This requires constructing test suites with verifiable end states. Coding agents can be evaluated against test suites; research agents against verifiable factual claims; customer support agents against resolution criteria. Trajectory evaluation goes further — assessing not just whether the agent succeeded but whether it did so efficiently, without unnecessary tool calls or redundant reasoning steps. A correct answer reached via 30 LLM calls when 5 would suffice is an engineering problem.

Building a production evaluation framework

Evaluation in production has two distinct purposes. Offline evaluation runs test suites before deployment to catch regressions and validate that prompt changes or model upgrades actually improve quality. Online evaluation samples live traffic to detect distributional shift and real-world failure modes that test suites did not anticipate. Both are necessary; neither alone is sufficient.

A practical eval stack for a production LLM application typically combines: a curated golden dataset of representative examples that must not regress; automated LLM-as-a-Judge scoring on a sample of live traffic; component-level metrics for RAG systems; and human review queues for low-confidence or flagged outputs. This turns evaluation from a one-time pre-launch check into a continuous engineering practice. Frameworks that operationalise this include RAGAS (RAG-specific metrics), DeepEval (unit-test-style assertions for LLM outputs with CI integration), and Promptfoo (prompt regression testing in CI/CD pipelines).

Key takeaways

  • BLEU and ROUGE measure token overlap — they do not reflect output quality for open-ended LLM responses and should not be primary signals
  • LLM-as-a-Judge correlates with human judgement at human-level rates — it is the standard approach for scalable reference-free evaluation
  • Pairwise preference is more reliable than pointwise scoring — randomise response order to mitigate position bias
  • RAG systems require component-level evaluation: measure retrieval (precision, recall) and generation (faithfulness, answer relevance) separately
  • Faithfulness is the key RAG metric — a model that generates claims not supported by its retrieved context is hallucinating, regardless of factual accuracy
  • Production eval requires both offline test suites (regression detection) and online sampling (real-world failure detection) — neither alone is sufficient
Module 12·~20 min·Engineer

Fine-tuning & Adaptation

LoRA, SFT, DPO, GRPO, and the decision framework for when to adapt a model vs. prompt or retrieve

The full training pipeline

The model you interact with via API is the product of up to four distinct training stages, each building on the last. Stage 1 (pre-training) trains on trillions of tokens of raw text, producing a base model that can continue any piece of text but is not conversational. Stage 2 (supervised fine-tuning, SFT) trains on curated instruction-response pairs, making the model helpful and conversational. Stage 3 (preference fine-tuning, RLHF or DPO) uses human preference comparisons to push the model toward helpful, harmless, calibrated outputs. Stage 4 (reasoning fine-tuning, GRPO or similar RL methods) trains on verifiable reasoning tasks — this is what distinguishes models like o3, DeepSeek-R1, and Claude's extended thinking mode from standard instruction-tuned models.

When a vendor describes a model as 'fine-tuned', they almost always mean Stages 2 and 3. When your organisation fine-tunes a model, you are performing an additional Stage 2 (or 3) on top of a base that has already completed the full pipeline — you are adapting a highly capable foundation, not training from scratch.

Four-stage LLM training pipeline

Stage 1

Pre-training

Input

Trillions of tokens of raw text (internet, books, code)

Output

Base model — predicts next token, not conversational

Stage 2

Supervised Fine-tuning (SFT)

Input

Curated instruction-response pairs (~10k–100k examples)

Output

Instruction-following model — conversational and helpful

Stage 3

Preference Fine-tuning (RLHF / DPO)

Input

Human preference comparisons between response pairs

Output

Aligned model — helpful, harmless, calibrated

Stage 4

Reasoning Fine-tuning (GRPO / RL)

Input

Verifiable reasoning tasks with definitive correct answers

Output

Reasoning model — extended step-by-step thinking

Most frontier models pass through all four stages. When your organisation fine-tunes a model, you are running an additional Stage 2 or 3 on top of a model that has already completed the full pipeline.

Why full fine-tuning is impractical for most organisations

Full fine-tuning means updating every single weight in the model — all 7 billion of them in a 7B model, or 70 billion in a 70B model. Doing this requires storing not just the model itself but additional data tracking how each weight should change, which multiplies the memory requirement several times over. A single full fine-tuning run on a 7B model requires roughly 120 GB of GPU memory; a 70B model requires over a terabyte. Very few organisations have hardware at that scale, and renting it is expensive. This is what makes full fine-tuning impractical for most teams.

Beyond hardware, full fine-tuning risks catastrophic forgetting: overwriting the general capabilities the base model acquired during pre-training in order to specialise for a narrow task. The result can be a model that performs well on your specific use case but has degraded on everything else.

LoRA: parameter-efficient fine-tuning

Low-Rank Adaptation (LoRA) solves the compute problem with a clever shortcut: rather than updating the original model weights, it freezes them entirely and adds a small set of new trainable parameters alongside them. These additions are tiny — think of them as a thin layer of adjustments on top of a frozen foundation. Only the adjustments are trained; the original model is untouched. The result is that instead of updating billions of parameters, you are updating a fraction of a percent of that number while achieving comparable task performance.

This reduction is dramatic in practice. A full fine-tuning run on a 7B model might update billions of parameters; LoRA on the same model might update a few million — roughly a 99% reduction. After training, the adjustments can be merged back into the base model with no slowdown at inference time. QLoRA extends this further by also compressing the frozen base model to use less memory, making it possible to fine-tune a 7B model on a single consumer GPU rather than a specialised cluster. This is why fine-tuning has become accessible to individual practitioners and small teams in a way it was not even two years ago.

LoRA vs full fine-tuning  ·  trainable parameter share
050%100% of model weights

Full fine-tuning

100% trainable

All weights updated. Billions of parameters in motion — requires a large GPU cluster.

LoRA

~1% trainable

frozen base model, low-rank adapters added alongside

The base model is frozen; tiny rank-decomposed matrices are trained alongside and merged back at inference — no runtime overhead.

LoRA trains a small fraction of the parameters needed for full fine-tuning, achieving comparable results at a fraction of the compute cost. QLoRA extends this further by also quantising the frozen base model, enabling fine-tuning on a single consumer GPU.

SFT vs preference fine-tuning vs reasoning fine-tuning

Supervised fine-tuning (SFT) trains the model on (prompt, ideal response) pairs using standard cross-entropy loss — the same objective as pre-training, just on your curated data. The model learns to imitate the provided responses. SFT is effective for teaching output format, domain tone, task-specific vocabulary, and consistent response structure. Its ceiling is the quality of your training data: the model cannot produce outputs better than its examples.

Preference fine-tuning (RLHF or DPO) moves beyond imitation. DPO (Direct Preference Optimisation) presents the model with pairs of responses to the same prompt and updates it to prefer the better one — without a separate reward model or RL training loop, making it significantly simpler than RLHF while achieving comparable results. Reasoning fine-tuning (GRPO) takes a different approach: rather than showing correct outputs, it uses verifiable tasks (maths, code, logic puzzles) and rewards the model for reaching correct answers, allowing it to develop its own reasoning strategies through trial and error. This is the technique behind DeepSeek-R1 and is now widely used to produce models with strong chain-of-thought reasoning.

When to fine-tune: the decision framework

Prompting, RAG, and fine-tuning address different failure modes and should be attempted in that order. Prompting costs nothing and enables instant iteration — exhaust it first. RAG solves knowledge gaps cheaply and keeps your knowledge base current without retraining. Fine-tuning changes how the model behaves at a fundamental level — its style, format, reasoning approach, and implicit assumptions.

Fine-tune when: the model consistently produces incorrect format or tone despite well-crafted system prompts; the task requires domain-specific reasoning patterns that prompting cannot reliably elicit; or latency constraints prevent a retrieval step. Do not fine-tune to add factual knowledge — the model will memorise your facts imperfectly and that knowledge becomes stale the moment training ends. For knowledge, RAG is always the better answer.

Prompting · RAG · fine-tuning  ·  decision matrix

Prompting only

Use when

Task is within the model's capability; you need fast iteration

Not a fit when

Model lacks domain knowledge; output format is highly specific

Cost: Near zeroKnowledge: Static (training cutoff)

RAG

Use when

Model needs current, proprietary, or large-volume knowledge

Not a fit when

You need to change reasoning style or output behaviour

Cost: Low–mediumKnowledge: Dynamic — update without retraining

Fine-tuning (SFT)

Use when

Consistent format, domain tone, or behaviour the model gets wrong despite good prompting and RAG

Not a fit when

You just need the model to know more facts — use RAG instead

Cost: Medium (LoRA) to high (full)Knowledge: Static — baked into weights at training time

Try in order: prompting, then RAG, then fine-tuning. Each step adds cost and complexity — only escalate when the previous approach genuinely cannot meet the requirement.

Key takeaways

  • Models go through up to 4 training stages: pre-training → SFT → preference fine-tuning → reasoning fine-tuning
  • Full fine-tuning requires 120 GB+ GPU VRAM for a 7B model — impractical without significant ML infrastructure
  • LoRA reduces trainable parameters by ~99% by training small rank-decomposed matrices alongside frozen weights
  • QLoRA combines LoRA with 4-bit quantisation — enabling 7B model fine-tuning on a single consumer GPU
  • DPO is now the preferred alternative to RLHF for preference alignment — simpler, no separate reward model needed
  • Correct sequence: prompting first, RAG for knowledge gaps, fine-tuning only for persistent behaviour and format problems
Module 13·~20 min·Engineer

LLMops

Serving, optimisation, and observability — running LLMs reliably and cost-effectively in production

Why LLM deployment is different

Deploying an LLM is not like deploying a conventional API. Standard web services are largely stateless: a request arrives, the server processes it in milliseconds, a response is returned. LLMs are autoregressive: they generate one token at a time, sequentially, with each token depending on all previous ones. A 500-token response requires 500 forward passes through a multi-billion-parameter model. Response latency is measured in seconds, not milliseconds, and scales with output length — a simple question with a long answer takes longer than a complex question with a short one.

This creates engineering constraints that differ fundamentally from conventional software. GPU memory is the binding resource rather than CPU or RAM. Throughput and latency are in direct tension: serving more concurrent requests increases throughput but may increase individual response latency. Cost scales directly with token volume, creating financial exposure that per-seat SaaS pricing does not. These constraints demand a specialised operational discipline — LLMops — that conventional DevOps does not fully cover.

KV caching

When a model generates text, it processes every token in the context at each generation step — not just the new token it is about to produce. Without caching, this means re-processing the entire preceding conversation or document on every single step. For a long prompt generating a long response, the redundant computation adds up quickly.

KV caching solves this by storing the model's internal representation of all previously processed tokens in GPU memory. On each new generation step, only the newest token needs to be computed fresh; everything prior is retrieved from the cache. This makes generation significantly faster and cheaper — the longer the context, the greater the benefit. The trade-off is memory: keeping a large cache for many concurrent users consumes significant GPU memory, which is why managing it efficiently is one of the core challenges of LLM serving. This is the problem that vLLM's PagedAttention was specifically designed to solve.

KV cache  ·  computed vs reused per generation step

Without cache

System promptcomputed · ~500 tokens
Conversation historycomputed · ~800 tokens
Retrieved context (RAG)computed · ~1,200 tokens
New token generatedcomputed · 1 token

All ~2,501 tokens recomputed every step.

With cache

System promptreused · ~500 tokens
Conversation historyreused · ~800 tokens
Retrieved context (RAG)reused · ~1,200 tokens
New token generatedcomputed · 1 token

Only one token computed per step; the rest is reused from cache.

KV cache stores the Key and Value attention matrices for all prior tokens in GPU VRAM. The trade-off: longer contexts consume more VRAM per concurrent request, limiting how many requests can be served simultaneously.

Quantisation

Model weights are typically stored as 16-bit floats (FP16 or BF16) during and after training. At inference, weights can be reduced to lower precision — 8-bit integers (INT8) or 4-bit integers (INT4) — with minimal quality degradation for most tasks. A 7B model in FP16 requires 14 GB of VRAM; the same model at INT8 requires 7 GB; at INT4, approximately 3.5 GB. This is why quantised 7B models run on consumer GPUs with 8 GB VRAM, and why 70B models can be served on a single high-end GPU at INT4 rather than requiring a multi-GPU cluster.

Modern quantisation methods — GPTQ, AWQ, and GGUF — minimise quality loss by identifying weight-sensitive layers and preserving them at higher precision while aggressively quantising less sensitive layers. The practical trade-off: INT8 has negligible quality loss for most production tasks; INT4 shows measurable degradation on complex multi-step reasoning but remains usable for retrieval, summarisation, classification, and general Q&A. For organisations running open-source models, quantisation is the single highest-leverage lever for reducing inference hardware costs.

Quantisation  ·  VRAM by precision
050%100% of FP32 footprint
FP3232-bit
7B  28 GB70B  280 GB

Full precision — used in training. Rarely needed for inference.

FP16 / BF1616-bit
7B  14 GB70B  140 GB

Standard inference precision — negligible quality loss vs FP32.

INT88-bit
7B  7 GB70B  70 GB

Minimal quality loss on most tasks. A 7B model fits in an 8 GB GPU.

INT4 / NF44-bit
7B  3.5 GB70B  35 GB

Some quality loss on complex reasoning. A 70B model fits in a 48 GB GPU.

Modern quantisation methods (GPTQ, AWQ, GGUF) minimise quality loss by identifying weight-sensitive layers and preserving them at higher precision. QLoRA uses NF4 (4-bit) quantisation for the frozen base during fine-tuning.

Inference engines and continuous batching

Naive LLM serving processes one request at a time or groups requests into fixed batches. This is GPU-inefficient: utilisation collapses between requests, and within a fixed batch some sequences finish generating before others, leaving their GPU allocations idle while the batch completes. vLLM addressed both problems. PagedAttention manages KV cache memory using non-contiguous memory pages — similar to how operating systems handle virtual memory — rather than requiring a contiguous block reserved per request upfront. This dramatically increases the number of concurrent requests that can be served from the same hardware.

Continuous batching (iteration-level scheduling) adds new requests to the running batch as soon as existing ones finish generating, keeping GPU utilisation near 100% rather than waiting for an entire batch to complete. Together, PagedAttention and continuous batching give vLLM 10–20x higher throughput than naive serving on the same hardware. For teams self-hosting open-source models, vLLM is the standard inference engine. For cloud API users these optimisations are handled by the provider — but understanding them explains why throughput-optimised API tiers (e.g. batch APIs) are significantly cheaper than real-time endpoints.

Observability: what to instrument

Observability for LLM systems means being able to answer: which calls failed and why, what is the cost per user or feature, where is latency coming from, and are outputs degrading over time? Standard application observability — request logs, error rates, p99 latency — captures part of this but misses the LLM-specific signals that actually matter for debugging and cost control.

Every LLM call should log: the full prompt (system and user), the full response, input and output token counts, latency broken down into time-to-first-token and total generation time, cost, model name and version, and any relevant user or session identifiers. For agentic and RAG systems, traces should capture the entire execution chain — retrieval steps, tool calls, intermediate model outputs — not just the final response. Without this, diagnosing whether a poor output came from a bad retrieval result, a malformed prompt, or model behaviour is guesswork. LangFuse (open-source, full trace support), Helicone (lightweight API proxy requiring no code changes), and Arize (enterprise-grade with drift detection) are the main tools that operationalise this.

LLMops stack  ·  replaceable layers
01

Observability & Monitoring

Trace every LLM call — prompt, response, latency, token count, cost, model version

LangFuse·Helicone·Arize·W&B Weave

02

Evaluation Pipeline

Offline test suites and online LLM-as-judge sampling on live traffic

RAGAS·DeepEval·Promptfoo

03

Orchestration Layer

Prompt management, RAG pipelines, tool dispatch, agent loops

LangChain·LlamaIndex·custom code

04

Inference Engine

Optimised serving — continuous batching, KV cache management, quantisation

vLLM·TGI·LitServe·cloud provider APIs

05

Model & Weights

Foundation model — quantised or full precision, self-hosted or via API

Llama 3·Mistral·GPT-4o API·Claude API

Each layer is independently replaceable. Many teams start with a cloud provider API (bypassing the inference engine layer entirely) and add observability and evaluation as the system matures.

Speculative decoding

Speculative decoding is a latency optimisation that exploits a fundamental asymmetry in autoregressive generation: verifying that a token is correct is faster than generating it from scratch. A small, fast draft model generates several candidate tokens ahead; the large target model then verifies all of them in a single parallel forward pass. Accepted tokens are kept; rejected tokens trigger regeneration from the point of mismatch. The final output is guaranteed to be identical to what the target model would have produced alone — there is no quality trade-off.

When the draft model's predictions are accurate — which is common for routine or predictable text — speculative decoding achieves 2–3x throughput improvement with no quality loss. It is most effective for tasks with predictable output patterns: code completion, template-following, structured data extraction. Applied alongside KV caching, quantisation, and continuous batching, it forms part of the complete optimisation stack that makes large-model inference commercially viable at scale.

Key takeaways

  • LLM inference is autoregressive and sequential — latency scales with output length, not just input complexity
  • KV caching reuses computed attention matrices for prior tokens, reducing per-step computation by ~99% for long contexts
  • Quantisation cuts VRAM requirements by 50–75% with minimal quality loss — INT8 is standard; INT4 enables 70B models on a single GPU
  • vLLM's PagedAttention and continuous batching achieve 10–20x higher throughput than naive serving on the same hardware
  • Every LLM call should log the full prompt, response, token counts, latency breakdown, cost, and model version — traces should capture the full execution chain
  • Speculative decoding uses a draft model to propose tokens the target model verifies in parallel — 2–3x speedup with guaranteed-identical outputs

Glossary

Key terms used throughout this course, defined concisely.

Token
The unit of text an LLM processes. Roughly ¾ of a word on average; a 1,000-word article is ~1,300 tokens.
BPE (Byte-Pair Encoding)
The most common tokenisation algorithm. Builds a subword vocabulary by iteratively merging frequent character pairs.
Foundation model
A large model trained on broad data at scale, intended as a general-purpose base for adaptation. GPT-4, Claude, and Gemini are all foundation models.
Multimodal
Capable of processing and generating across multiple input types — text, images, audio, or video — within a single model.
Benchmark
A standardised test used to measure model capability on a specific task (e.g. MMLU for knowledge, HumanEval for coding). Results can be gamed by training on benchmark data.
Context window
The maximum number of tokens an LLM can process in a single call — including both the prompt and the generated response.
System prompt
Instructions prepended to a conversation that define the model's role, persona, constraints, and output format. Invisible to end users in most deployed products.
Inference
Running a trained model to generate outputs. Distinct from training: no weights are updated. Measured in latency (time per response) and throughput (requests per second).
Parameters / weights
The billions of learnable numerical values inside a model. Training adjusts these values; inference uses them frozen.
Pre-training
The initial, large-scale training phase where a model learns from vast text corpora via next-token prediction.
Fine-tuning
Additional training on a smaller, curated dataset to adapt a pre-trained model to a specific task or style.
PEFT (Parameter-Efficient Fine-Tuning)
Techniques that adapt a pre-trained model by training only a small fraction of its parameters, avoiding the cost of full fine-tuning. LoRA is the most widely used example.
LoRA (Low-Rank Adaptation)
A parameter-efficient fine-tuning method that trains small adapter matrices rather than the full model weights.
QLoRA
Quantised LoRA. Fine-tunes a quantised (typically 4-bit) base model using LoRA adapters, enabling fine-tuning on consumer-grade GPUs.
Catastrophic forgetting
The tendency for a neural network to lose previously learned capabilities when trained on new data. A key risk in fine-tuning.
SFT (Supervised Fine-Tuning)
Fine-tuning on labelled input–output pairs. The foundational step before alignment training.
RLHF
Reinforcement Learning from Human Feedback. Training a reward model on human preference rankings and using it to optimise the LLM via PPO.
DPO (Direct Preference Optimisation)
An alignment method that skips the separate reward model and optimises preferences directly in the LLM.
Temperature
A sampling parameter (0–2) that controls output randomness. Lower = more deterministic; higher = more varied.
Top-p (nucleus sampling)
Restricts token sampling to the smallest set of tokens whose cumulative probability exceeds p. Improves diversity over pure top-k.
Embedding
A dense numerical vector representing the meaning of a token, sentence, or document. Semantically similar content has nearby vectors.
RAG (Retrieval-Augmented Generation)
An architecture that retrieves relevant documents from an external store and injects them into the prompt before generation.
Grounding
Connecting a model's outputs to verifiable external facts or data sources to reduce hallucination. RAG is the most common grounding technique.
Vector database
A database optimised for storing and querying embeddings by similarity (e.g. nearest-neighbour search).
HNSW
Hierarchical Navigable Small World. The most common approximate nearest-neighbour index algorithm used in vector databases.
Attention / self-attention
The mechanism by which a transformer relates each token to every other token in the context to compute contextualised representations.
Q, K, V (Query, Key, Value)
The three projections computed in attention. The query token attends over keys; values are the information retrieved.
MoE (Mixture of Experts)
An architecture where only a subset of the model's parameters (experts) are activated per token, enabling larger models at lower inference cost.
KV cache
A cache of the key and value tensors for previously processed tokens, avoiding redundant recomputation during autoregressive generation.
GPU / VRAM
Graphics Processing Units are the dominant hardware for AI training and inference. VRAM (video RAM) is the on-chip memory that determines how large a model can be loaded.
FLOPs
Floating-point operations. The standard unit for measuring AI compute. Training a large model requires trillions of FLOPs; often expressed in petaFLOP-days.
Scaling laws
Empirical relationships showing that model performance improves predictably as compute, data, and parameter count increase — enabling cost and capability forecasting.
Quantisation
Reducing model weight precision (e.g. FP32 → INT4) to shrink memory footprint and speed up inference, with a small accuracy trade-off.
Speculative decoding
Using a small draft model to propose several tokens at once, then verifying them in parallel with the large model — speeding up generation.
vLLM / PagedAttention
vLLM is a high-throughput inference server. PagedAttention is its technique for managing KV cache memory in non-contiguous pages.
Latency
The time elapsed between sending a request and receiving its first token (time-to-first-token) or complete response. The primary user-facing performance metric.
Throughput
The number of requests or tokens a system can process per unit of time. The key metric for production serving capacity and cost efficiency.
Prompt engineering
The practice of crafting input text to elicit better model outputs — including instruction phrasing, few-shot examples, and role setting.
Zero-shot / few-shot
Zero-shot: asking a model to perform a task with no examples in the prompt. Few-shot: providing a small number of input–output examples to guide behaviour without any weight updates.
Chain-of-thought (CoT)
Prompting the model to reason step-by-step before answering, improving accuracy on complex tasks.
Context engineering
The broader discipline of designing what information goes into the context window — system prompts, retrieved documents, conversation history, tool outputs — to maximise model performance.
Prompt injection
An attack where malicious content in the environment (a web page, document, or user message) hijacks the model's instructions to take unintended actions.
Agent
An LLM configured to operate in a loop: observe, plan, act (via tools), observe results, repeat — until a task is complete.
ReAct
A prompting pattern that interleaves reasoning traces and tool-use actions, making agent behaviour interpretable and auditable.
MCP (Model Context Protocol)
An open standard by Anthropic for connecting AI models to tools and data sources via a uniform client–server interface.
Function calling
A protocol feature where the model returns structured JSON requesting a specific tool invocation, rather than prose. The application executes the function and returns results.
LLM-as-a-Judge
Using a separate LLM to score or rank outputs — either comparatively or against a rubric — as an automated evaluation method.
Evals
Short for evaluations: the test suites and scoring methods used to measure LLM performance on specific tasks or safety criteria. The foundation of disciplined model development.
RAGAS
A framework for evaluating RAG pipelines, measuring faithfulness, context precision, context recall, and answer relevance.
Faithfulness
In RAG evaluation: whether the generated answer is factually grounded in the retrieved context (no hallucinations).
Hallucination
When a model generates confident-sounding content that is factually incorrect or unsupported by its context.
LLMops
The operational discipline of deploying, monitoring, versioning, and iterating on LLM-based systems in production.
Observability
The ability to inspect and understand what an LLM system is doing in production — via traces, logs, latency metrics, and cost tracking.
Alignment
The problem of ensuring a model's behaviour reliably matches human intentions and values — covering helpfulness, honesty, and harmlessness.
Guardrail
A safety layer applied to model inputs or outputs — such as a classifier detecting harmful content — independent of the model's own alignment training.
Jailbreak
A prompt or technique designed to bypass a model's safety guardrails and elicit restricted or harmful content.
Red teaming
Structured adversarial testing of an AI system to discover failure modes, harmful outputs, or exploitable behaviours before deployment.
Constitutional AI
An Anthropic alignment technique where the model critiques and revises its own outputs according to a set of principles, reducing harmful responses.
EU AI Act
The European Union's risk-tiered regulatory framework for AI systems — the first binding AI law globally. Classifies AI by risk level, with corresponding obligations for providers and deployers.

Last updated May 2026. Updated as the AI landscape evolves.

← Back to resources