How this programme works

This programme is built around AI-assisted coding. You will use Claude Code or Cursor to handle implementation, which frees the live sessions for concepts, architecture, and decision-making — not syntax. Do not worry about memorising API calls. Focus on understanding what you are building, why the architecture is designed that way, and how to evaluate whether it is working. The agent writes the code. You direct the agent.

Pre-course preparation

Complete this before your first session. The live sessions move fast and assume these foundations are in place. You do not need to master everything here — the goal is familiarity, not expertise.

Environment setup

Install your coding agent — Claude Code (recommended) or Cursor
Get your API keys: Anthropic (console.anthropic.com) and OpenAI (platform.openai.com/api-keys)
Set keys as environment variables, not hardcoded in scripts
Python 3.10+ with `pip install anthropic openai python-dotenv`
Confirm you can run a basic API call before Week 1

Pre-course checklist

Claude Code or Cursor installed and working
Anthropic API key set as environment variable
OpenAI API key set as environment variable
Python 3.10+ with anthropic and openai packages installed
Made at least one successful API call
Read at least 2 conceptual foundation resources
Skimmed at least 1 industry report

Conceptual foundations

ArticleWhat is an AI Engineer? (roadmap.sh)15 min read
ArticleContext Engineering: the emerging discipline (Firecrawl)10 min read
ArticleHow to Break Into AI Engineering in 202620 min read
BookAI Engineering by Chip Huyen — Ch. 1 and 2Buy or borrow
BookThe LLM Engineering Handbook — IntroductionBuy or borrow

Industry context

ReportStanford AI Index 2026Key charts only
ReportState of AI Engineering (Datadog)Exec summary
ReportAI Engineering Report 2026 (Faros)10 min
ReportNVIDIA State of AI 2026Key findings

Weekly curriculum

Foundations, Tooling, and Structured Generation

LLMs · Model Selection · APIs · Token Mechanics · Structured Outputs · Multimodal · Production Prompting

Week 1 establishes the engineering mindset. By the end of this week you will understand how language models actually work (at the level an engineer needs, not a researcher), how to make reliable API calls across multiple providers, and how to get structured, validated outputs you can actually build on.

Learning objectives

Explain what an AI engineer does and how the role differs from ML engineer and data scientist

Understand LLM fundamentals: tokens, context windows, temperature, sampling

Select the right model for a given task using cost, latency, and capability tradeoffs

Make structured, validated API calls across Claude and OpenAI

Design production-grade prompts that handle edge cases and fail gracefully

Work with multimodal inputs including documents, images, and PDFs

Use Claude Code or Cursor effectively as a development environment

Session content

Part 1 · What Is an AI Engineer? (20 min)

AI engineer vs ML engineer vs data scientist: where the lines are and why they matter
The production gap: why most demos fail when they hit real users
The coding-agent-first workflow: how to direct an agent, not just prompt it
Tour of the stack you will be using across this programme

Part 2 · How LLMs Work (30 min)

Tokens, not words: why this matters for cost and behaviour
The context window: what goes in, what gets forgotten, what costs money
Temperature, top-p, and sampling: when to turn the dial and when to leave it
System prompts, user prompts, assistant turns: the message structure
Model families and their tradeoffs: Claude, GPT-4o, Gemini, Mistral, open-source
Cut-off dates and knowledge limitations: how to reason about what a model knows

Part 3 · APIs, Structured Outputs, and Validation (40 min)

Multi-provider API patterns: how to write provider-agnostic code
Managing tokens: counting, budgeting, pricing considerations
Structured outputs and data contracts: getting reliable JSON from a model
Output validation: what to do when the model ignores your schema
Multimodal inputs: sending documents, images, and PDFs to the API
Error handling, retries, and fallbacks

Part 4 · Prompt Design for Production (30 min)

Why prompts that work once often fail at scale
System prompt architecture: separating instructions, context, and persona
Few-shot examples: when to include them and how many
Chain-of-thought and when it actually helps
Negative examples: telling the model what NOT to do
Prompt versioning: treating prompts as code

Key concepts

The Token Mindset

LLMs do not read words. They read tokens — chunks of text roughly 3-4 characters each. The word 'unbelievable' might be 3 tokens. A blank line costs a token. This affects cost, context limits, and model behaviour. Always think in tokens, not characters or words.

Context Window vs Memory

The context window is not memory. It is a sliding window of text the model can see right now. Nothing persists between API calls unless you explicitly put it back in the context. This is the most common source of confusion for engineers new to LLMs.

Structured Outputs Are Not Guaranteed

Even with JSON mode or function calling, models can fail to follow a schema. Always validate outputs against your schema before using them downstream. Never trust that the model returned valid JSON without checking.

Provider-Agnostic Thinking

The best AI engineers are not loyal to one provider. They pick the right model for the task, route traffic based on cost and latency, and abstract provider details behind a clean interface. Build that way from day one.

Assignment 1

Multi-Provider Document Pipeline

Accept a PDF or text document as input
Extract structured data using a defined schema (your choice of domain)
Validate the output against that schema and handle validation failures gracefully
Run against both the Claude and OpenAI APIs and compare outputs
Log token usage and estimated cost per call

Deliverable: Working code in your GitHub repo + a short write-up (max 300 words) comparing the outputs and costs across providers. What did you notice?

Capstone — Phase 1

Environment setup complete: Claude Code or Cursor running, API keys configured
First multi-provider API call working
Basic document ingestion pipeline with typed, validated outputs
Project repo created and structured

Core resources

DocsAnthropic API DocumentationEssential reference
DocsOpenAI API ReferenceEssential reference
ArticleBest AI Resources to Learn AI Engineering 2026 (Firecrawl)Start here
ArticlePrompt Engineering Guide (DAIR.AI)Comprehensive
ArticleIBM Prompt Engineering 2026 GuidePractical
NotebooksClaude Cookbooks (Anthropic GitHub)Hands-on notebooks
DocsClaude Code OverviewSetup guide
BookAI Engineering by Chip Huyen — Ch. 1-3Core text

Advanced reading

PaperAttention Is All You Need (Vaswani et al.)Transformer original
ArticleTokenization deep dive (Hugging Face NLP Course)Token internals
ArticleLLM Pricing and Model Selection Framework 2026Cost engineering
BookThe LLM Engineering Handbook — Ch. 1-2Advanced context
ArticleUnderstanding LLM Sampling ParametersTemperature etc.

Embeddings, RAG, Context Engineering, and Tool Use

Embeddings · Vector DBs · RAG · Context Engineering · Function Calling · MCP · Open-Source Models

Week 2 is the retrieval week. You will learn how language models understand semantic similarity, how to build retrieval systems that find the right information at query time, and how to engineer the context window as a first-class engineering concern. You will also connect your first external tool and MCP server.

Learning objectives

Explain what embeddings are and how they encode semantic meaning

Choose the right embedding model for a given use case

Build a complete RAG pipeline from chunking through generation

Implement hybrid retrieval combining dense and sparse methods

Apply context engineering principles: caching, compression, window budgeting

Connect tools via function calling and MCP

Run open-source models locally with Ollama and understand the tradeoffs

Session content

Part 1 · Embeddings and Semantic Search (25 min)

What embeddings are: text as points in vector space
How similarity search works: cosine, dot product, L2 distance
Embedding use cases: semantic search, recommendations, classification, anomaly detection
Choosing an embedding model in 2026: OpenAI text-embedding-3-large, Cohere, BGE-M3 for self-hosting
Matryoshka embeddings: flexible dimensions for storage-cost tradeoffs

Part 2 · Vector Databases and RAG Architecture (35 min)

What vector databases do and why you need one
Comparing options: Qdrant, Pinecone, Chroma, pgvector, Weaviate
Chunking strategies: fixed-size vs semantic vs late chunking
Dense retrieval, sparse retrieval (BM25), and hybrid approaches
Reranking: why first-pass retrieval is often not enough
The full RAG pipeline: ingest, embed, store, retrieve, generate
GraphRAG: when knowledge graphs improve multi-hop reasoning

Part 3 · Context Engineering (30 min)

The context lifecycle: what goes in, when, and in what order
Prompt caching: how to reduce cost on repeated prefixes
Context compression: summarisation, distillation, selective retrieval
Window budgeting: allocating tokens across system, history, retrieved context, and query
The difference between context engineering and prompt engineering

Part 4 · Function Calling, MCP, and Open-Source (30 min)

Function calling: how tools are registered and invoked
Model Context Protocol (MCP): the universal standard for agent-tool integration
A2A (Agent-to-Agent): how agents discover and delegate to each other
Open-source models with Ollama: running Llama, Mistral, and others locally
Open vs closed source tradeoffs: latency, cost, control, capability

Key concepts

Why RAG Is Not Magic

RAG does not make a model smarter. It makes a model better-informed at query time. If your retrieval is bad, your generation will be bad too. The most common failure is not the LLM — it is the retrieval step returning the wrong chunks. Evaluate retrieval separately from generation.

Chunking Is a Design Decision

How you split documents determines what the model can find. Fixed-size chunks are fast but lose sentence context. Semantic chunking preserves meaning but is slower. Late chunking embeds the full document and retrieves sub-sections. There is no universally right answer. Test on your data.

Context Engineering Is the New Prompt Engineering

The field moved from 'how do I write better prompts' to 'how do I engineer the full context window'. What goes in, in what order, how fresh it is, and how much space it takes — these decisions determine whether a production system is reliable. This is now a core AI engineering skill.

MCP Is the Tool Standard

MCP (Model Context Protocol) has become the universal way to connect AI agents to external tools. By early 2026 every major provider supports it. If you are building an agent that needs to call external services, databases, or APIs, MCP is how you do it cleanly without custom glue code for every integration.

Assignment 2

RAG System with Hybrid Retrieval

Ingest a document set of your choice (at least 20 documents or equivalent)
Implement at least two chunking strategies and compare retrieval quality
Set up a vector database (Qdrant recommended)
Implement both dense and sparse retrieval, then combine them
Add at least one MCP-connected tool (e.g. web search, database lookup)
Evaluate: for 10 test queries, did the right chunks come back?

Deliverable: Working code + an evaluation table showing retrieval quality across chunking strategies. Which approach worked best and why?

Capstone — Phase 2

RAG system running over your chosen domain
At least one tool connected via function calling
MCP server configured
Retrieval evaluation baseline established

Core resources

ArticleVector Databases Guide for RAG 2025 (DEV.to)Comprehensive
ArticleBest Embedding Models for RAG 2026 (StackAI)Comparison guide
Article10 Best Vector Databases for RAG (ZenML)Comparison
ArticleWhat is Context Engineering? (IntuitionLabs 2026)Key concept
ArticleContext Engineering for AI Agents (Neo4j)Practical guide
DocsMCP Documentation (Anthropic)Official reference
DocsQdrant QuickstartVector DB setup
DocsOllama DocumentationLocal models
VideoRAG from Scratch (LangChain YouTube)Video series

Advanced reading

ArticleVector Databases for RAG: Complete Applications Guide 2026GraphRAG, late chunking
ArticleGraphRAG: Combining Vectors and Knowledge GraphsAdvanced retrieval
PaperRAGAS: Automated Evaluation of RAG PipelinesEval framework
ArticleHybrid Search Deep Dive (Weaviate)Dense + sparse
BookAI Engineering by Chip Huyen — Ch. 5-6RAG and retrieval
ArticleContext Window Management PatternsAdvanced context

Agentic Systems, Orchestration, and Evals

ReAct · Agent Patterns · LangGraph · Memory · Failure Modes · n8n · Security · Evals · LLM-as-Judge

Week 3 is the agents and evaluation week. You will build systems that can reason across multiple steps, use tools, maintain state, and recover from failures. The evals session teaches you how to measure whether your agent is actually working — because without measurement, you are just guessing.

Learning objectives

Implement ReAct prompting as the foundation of agent reasoning

Design single-agent and multi-agent architectures and choose between them

Orchestrate agents using LangGraph and SDK-native patterns

Implement memory: in-context, external, and episodic

Identify and handle agentic failure modes before they hit production

Connect agents to workflows using n8n

Build an eval harness with golden sets and LLM-as-judge

Defend against prompt injection in agentic systems

Session content

Part 1 · Agent Fundamentals and ReAct (25 min)

What makes a system an agent: the perceive-reason-act loop
ReAct prompting: Reasoning + Acting as the core pattern
Single vs multi-agent architectures: when to add an agent vs extend an existing one
The supervisor pattern: one orchestrator, multiple specialists
Agent cards and capability discovery with A2A

Part 2 · Orchestration with LangGraph (35 min)

State graphs: modelling agent behaviour as a graph
Nodes, edges, and conditional routing
Checkpointing: saving and resuming agent state
Human-in-the-loop: when to pause and ask
MCP in agentic workflows: connecting tools at the orchestration layer
SDK-native agents: building without a framework

Part 3 · Memory, Failure Modes, Security, and n8n (30 min)

In-context memory: what is in the window right now
External memory: vector stores and knowledge bases as agent memory
Episodic memory: remembering across sessions
Agentic failure modes: tool call loops, context pollution, cascading errors, hallucinated tool outputs
Cost engineering in agents: model routing, token budgeting across multi-step workflows
Prompt injection attacks: direct and indirect (via retrieved content)
Guardrails: validating inputs, outputs, and tool call parameters
n8n and workflow automation: connecting your agent to real business workflows

Evals Session · Evaluation-Driven Development (40 min)

Why agents that seem to work often do not: the silent failure problem
Evaluation-driven development: writing your eval before you write your agent
Building a golden dataset: what good looks like, documented
Rule-based evals: fast, cheap, brittle
LLM-as-judge: using a model to evaluate a model
Setting up Langfuse: tracing, logging, and dashboards
Regression testing: catching regressions before they reach users
Failure mode analysis: turning production failures into eval cases

Key concepts

ReAct Is Not Optional

ReAct (Reasoning + Acting) is the pattern that makes agents reliable. Without an explicit reasoning step before each tool call, agents hallucinate tool parameters, call the wrong tool, and loop indefinitely. The thinking step is not cosmetic. It is what makes the action predictable.

Agentic Failure Modes Are Different

Agents fail in ways that single LLM calls do not. A context pollution failure means the agent has accumulated contradictory information and is now reasoning from false premises. A tool call loop means the agent is repeatedly calling the same tool because it did not register the result. These failures are hard to spot without tracing.

Evals Are Not a Post-Launch Activity

Most teams build first and evaluate later. This is backwards. Your eval dataset defines what the system is supposed to do. Write it before you write the agent. Then you can measure progress during development, not just after launch.

Prompt Injection Is a Real Attack Vector

If your agent retrieves content from external sources (web pages, documents, databases), that content can contain instructions designed to hijack the agent. This is called indirect prompt injection. It is not theoretical — it has been demonstrated against production systems. Sanitise retrieved content before it enters the agent context.

Assignment 3

Agent with Eval Harness

Add a ReAct reasoning loop
Implement at least two tools (one should use MCP)
Add memory: in-context history + at least one external memory store
Connect to an n8n workflow (trigger or output)
Build an eval harness: 10 test cases with expected outputs (golden set), one rule-based evaluator, one LLM-as-judge evaluator using Langfuse
Document two failure modes you found and how you addressed them

Deliverable: Working code + eval results table + failure mode write-up.

Capstone — Phase 3

Stateful agent with ReAct loop, tools, and memory
n8n workflow integration
Langfuse tracing enabled
Eval harness running with golden set and LLM-as-judge
Two documented failure modes and fixes

Core resources

ArticleEngineering Agentic Workflows with MCP and LangGraph (DZone 2026)Architecture guide
ArticleHow to Build a Multi-Agent System: LangGraph, MCP, A2A (FreeCodeCamp)Full guide
ArticleLangGraph + MCP Multi-Agent Workflows 2026 (TechBytes)Practical guide
ArticleBuilding Production-Ready Agentic AI: LangGraph and MCP (DEV.to)Case study
DocsLangfuse Observability OverviewTracing setup
DocsLangfuse LLM-as-a-JudgeEval setup
ArticleAutomated Evaluations of LLM Applications (Langfuse)Practical evals
VideoLLM-as-Judge Evaluators Walkthrough (Langfuse)10 min video
CourseLangGraph Masterclass 2026: OpenAI, MCP, A2A (Udemy)Paid course

Advanced reading

PaperReAct: Synergising Reasoning and Acting in Language ModelsOriginal paper
ArticleTop Open Source LLM Observability Tools 2026Tool comparison
ArticleEval FAQ by Hamel HusainAuthoritative guide
DocsLangGraph DocumentationFull reference
BookAI Engineering by Chip Huyen — Ch. 7-8Agents and evals
ArticleIndirect Prompt Injection Attacks (Simon Willison)Security

LLM Ops, Production, and Shipping

Deployment · Prompt Versioning · CI/CD · Observability · Cost Optimisation · Guardrails · Fine-Tuning · System Card

Week 4 is about getting your system from 'it works on my machine' to 'it works reliably for real users'. This week covers the infrastructure, processes, and documentation required to ship a production AI system with confidence.

Learning objectives

Choose a deployment architecture for an LLM application (serverless, container, streaming)

Implement prompt versioning and model versioning as engineering practices

Set up CI/CD for an LLM system including automated evals in the pipeline

Configure full observability with Langfuse: tracing, cost tracking, dashboards

Detect silent degradation in production before users report it

Optimise costs at scale: batching, caching, model routing

Apply the fine-tuning vs prompting vs RAG decision framework

Write a system card documenting your production AI system

Session content

Part 1 · Deployment Architecture (25 min)

Serverless vs container: latency, cold starts, cost tradeoffs
Streaming responses: why and how
Async queues for long-running agent tasks
API gateway patterns for LLM systems
Managing secrets and credentials in production

Part 2 · LLM Ops (30 min)

Prompt versioning: treating prompts as code artifacts with history
Model versioning: handling model updates without breaking production
CI/CD for LLM systems: what a pipeline looks like
Running evals automatically on every pull request
Rollback strategies: how to safely revert a bad prompt or model change

Part 3 · Observability, Cost, and Safety (35 min)

Langfuse deep dive: traces, sessions, users, scores
Silent degradation: the failure mode no one sees coming
Cost optimisation at scale: prompt caching, batching, model routing
Token cost monitoring: per-user, per-session, per-model breakdowns
Safety and guardrails in production: input validation, output filtering, kill switches
Fallbacks and safe defaults: what the system does when it cannot answer

Part 4 · Fine-Tuning, System Cards, and Shipping (30 min)

Fine-tuning vs prompting vs RAG: the decision framework
When fine-tuning is worth it and when it is not
The system card: documenting what your system does, how it fails, and what to watch
Documentation standards for production AI systems
What good AI engineering looks like in 2026: the skills that matter
Capstone showcase: presenting your system to the cohort

Key concepts

Silent Degradation Is the Hardest Production Problem

A system that crashes is easy to notice. A system that gradually gives worse answers is nearly invisible. Models change. Retrieved content changes. User behaviour changes. Without continuous monitoring and automated evals, your system can degrade significantly before anyone realises. This is why observability is not optional.

Prompts Are Code

A prompt is a production artifact. It has a version history, it can introduce regressions, and changes to it should go through the same review process as code changes. If you are not version-controlling your prompts, you are flying blind.

Fine-Tuning Is Often the Wrong Answer

Fine-tuning is expensive, time-consuming, and creates a model you need to maintain. In most production scenarios, better RAG plus better context engineering beats fine-tuning. Reach for fine-tuning when you need consistent tone or style, you have a high-volume narrow task, or prompt engineering has hit a ceiling. Not before.

The System Card Is Trust Infrastructure

A system card tells anyone who reads it: what this system does, what it does not do, how it fails, what it costs, what is monitored, and how to reach the team. It is the thing that makes it safe to hand a production AI system to someone else. Write it before you ship, not after.

Assignment 4

Capstone — Harden, Document, and Ship

Deploy your agent to a production-accessible endpoint
Configure full Langfuse observability: traces, costs, automated evals
Set up at least one automated eval in your CI pipeline
Implement at least one cost optimisation (caching, batching, or routing)
Add production guardrails: input validation, output filtering
Write a system card: what it does/does not do, architecture diagram, failure modes and mitigations, eval results and baseline metrics, cost profile and monitoring setup

Deliverable: Live deployed system + system card + 5-minute showcase presentation. The showcase is recorded and shared with the TAI alumni community.

Capstone — Phase 4

System deployed and accessible
Full Langfuse observability running
Automated evals in CI pipeline
System card written and reviewed
5-minute showcase presentation prepared

Core resources

DocsLangfuse Observability DocumentationFull reference
ArticleTop LLM Observability Tools 2026Tool comparison
ArticleLLM Prompt Versioning Best PracticesPrompt ops
ArticleAI Engineering 2026: Shifting to Accountability (Jellyfish)Industry framing
ArticleState of AI Engineering (Datadog 2026)Framework adoption
DocsAnthropic Model Card TemplateSystem card reference
BookAI Engineering by Chip Huyen — Ch. 9-10Production patterns

Advanced reading

ArticleCI/CD for LLM Systems: A Practical GuideEngineering practice
ArticleFine-Tuning vs RAG vs Prompting: Decision FrameworkWhen to fine-tune
ArticleAI Engineering Report 2026: Acceleration Whiplash (Faros)Quality risks
ArticleRed-Teaming LLM Applications (Anthropic)Security depth
BookThe LLM Engineering Handbook — Ch. 7-8Advanced production
ArticleOpen Source Self-Hosted Model Deployment with vLLMSelf-hosted inference

The capstone

The capstone is one project built progressively across all four weeks. By the time you present in Week 4, you will have built a complete, production-ready AI system from scratch.

Week 1Document pipeline with structured, validated outputs across two providers

Week 2RAG system with hybrid retrieval, tool use, and MCP server

Week 3Stateful agent with ReAct loop, memory, n8n integration, and eval harness

Week 4Production deployment with full observability, CI evals, and system card

Final deliverable: live system + system card + 5-minute showcase presentation. The showcase is recorded and shared with the TAI alumni community.

Certificate requirements

1Complete all four weekly assignments
2Submit final capstone code to your GitHub repo
3Submit your system card
4Present at the Week 4 showcase

Tools used in this programme

Claude Code

Dev Environment

Primary IDE

Cursor

Dev Environment

Alternative IDE

Anthropic Claude

LLM Provider

Primary model

OpenAI

LLM Provider

Secondary model

Qdrant

Vector DB

Recommended

Langfuse

Observability

Tracing and evals

LangGraph

Orchestration

Agent graphs

n8n

Workflow

Automation layer

Ollama

Local Models

Open-source inference

MCP (Anthropic)

Protocols

Tool standard

AI Engineer Certificate

Programme at a Glance

How this programme works

Pre-course preparation

Environment setup

Pre-course checklist

Weekly curriculum

Foundations, Tooling, and Structured Generation

Learning objectives

Session content

Key concepts

Embeddings, RAG, Context Engineering, and Tool Use

Learning objectives

Session content

Key concepts

Agentic Systems, Orchestration, and Evals

Learning objectives

Session content

Key concepts

LLM Ops, Production, and Shipping

Learning objectives

Session content

Key concepts

The capstone

Certificate requirements

Tools used in this programme