How this programme works
This programme is built around AI-assisted coding. You will use Claude Code or Cursor to handle implementation, which frees the live sessions for concepts, architecture, and decision-making — not syntax. Do not worry about memorising API calls. Focus on understanding what you are building, why the architecture is designed that way, and how to evaluate whether it is working. The agent writes the code. You direct the agent.
Pre-course preparation
Complete this before your first session. The live sessions move fast and assume these foundations are in place. You do not need to master everything here — the goal is familiarity, not expertise.
Environment setup
- Install your coding agent — Claude Code (recommended) or Cursor
- Get your API keys: Anthropic (console.anthropic.com) and OpenAI (platform.openai.com/api-keys)
- Set keys as environment variables, not hardcoded in scripts
- Python 3.10+ with `pip install anthropic openai python-dotenv`
- Confirm you can run a basic API call before Week 1
Pre-course checklist
- Claude Code or Cursor installed and working
- Anthropic API key set as environment variable
- OpenAI API key set as environment variable
- Python 3.10+ with anthropic and openai packages installed
- Made at least one successful API call
- Read at least 2 conceptual foundation resources
- Skimmed at least 1 industry report
- ArticleWhat is an AI Engineer? (roadmap.sh)15 min read
- ArticleContext Engineering: the emerging discipline (Firecrawl)10 min read
- ArticleHow to Break Into AI Engineering in 202620 min read
- BookAI Engineering by Chip Huyen — Ch. 1 and 2Buy or borrow
- BookThe LLM Engineering Handbook — IntroductionBuy or borrow
- ReportStanford AI Index 2026Key charts only
- ReportState of AI Engineering (Datadog)Exec summary
- ReportAI Engineering Report 2026 (Faros)10 min
- ReportNVIDIA State of AI 2026Key findings
Weekly curriculum
W1Foundations, Tooling, and Structured Generation
LLMs · Model Selection · APIs · Token Mechanics · Structured Outputs · Multimodal · Production Prompting
Foundations, Tooling, and Structured Generation
LLMs · Model Selection · APIs · Token Mechanics · Structured Outputs · Multimodal · Production Prompting
Week 1 establishes the engineering mindset. By the end of this week you will understand how language models actually work (at the level an engineer needs, not a researcher), how to make reliable API calls across multiple providers, and how to get structured, validated outputs you can actually build on.
Learning objectives
Session content
Part 1 · What Is an AI Engineer? (20 min)
- AI engineer vs ML engineer vs data scientist: where the lines are and why they matter
- The production gap: why most demos fail when they hit real users
- The coding-agent-first workflow: how to direct an agent, not just prompt it
- Tour of the stack you will be using across this programme
Part 2 · How LLMs Work (30 min)
- Tokens, not words: why this matters for cost and behaviour
- The context window: what goes in, what gets forgotten, what costs money
- Temperature, top-p, and sampling: when to turn the dial and when to leave it
- System prompts, user prompts, assistant turns: the message structure
- Model families and their tradeoffs: Claude, GPT-4o, Gemini, Mistral, open-source
- Cut-off dates and knowledge limitations: how to reason about what a model knows
Part 3 · APIs, Structured Outputs, and Validation (40 min)
- Multi-provider API patterns: how to write provider-agnostic code
- Managing tokens: counting, budgeting, pricing considerations
- Structured outputs and data contracts: getting reliable JSON from a model
- Output validation: what to do when the model ignores your schema
- Multimodal inputs: sending documents, images, and PDFs to the API
- Error handling, retries, and fallbacks
Part 4 · Prompt Design for Production (30 min)
- Why prompts that work once often fail at scale
- System prompt architecture: separating instructions, context, and persona
- Few-shot examples: when to include them and how many
- Chain-of-thought and when it actually helps
- Negative examples: telling the model what NOT to do
- Prompt versioning: treating prompts as code
Key concepts
The Token Mindset
LLMs do not read words. They read tokens — chunks of text roughly 3-4 characters each. The word 'unbelievable' might be 3 tokens. A blank line costs a token. This affects cost, context limits, and model behaviour. Always think in tokens, not characters or words.
Context Window vs Memory
The context window is not memory. It is a sliding window of text the model can see right now. Nothing persists between API calls unless you explicitly put it back in the context. This is the most common source of confusion for engineers new to LLMs.
Structured Outputs Are Not Guaranteed
Even with JSON mode or function calling, models can fail to follow a schema. Always validate outputs against your schema before using them downstream. Never trust that the model returned valid JSON without checking.
Provider-Agnostic Thinking
The best AI engineers are not loyal to one provider. They pick the right model for the task, route traffic based on cost and latency, and abstract provider details behind a clean interface. Build that way from day one.
Multi-Provider Document Pipeline
- Accept a PDF or text document as input
- Extract structured data using a defined schema (your choice of domain)
- Validate the output against that schema and handle validation failures gracefully
- Run against both the Claude and OpenAI APIs and compare outputs
- Log token usage and estimated cost per call
Deliverable: Working code in your GitHub repo + a short write-up (max 300 words) comparing the outputs and costs across providers. What did you notice?
Capstone — Phase 1
- Environment setup complete: Claude Code or Cursor running, API keys configured
- First multi-provider API call working
- Basic document ingestion pipeline with typed, validated outputs
- Project repo created and structured
- DocsAnthropic API DocumentationEssential reference
- DocsOpenAI API ReferenceEssential reference
- ArticleBest AI Resources to Learn AI Engineering 2026 (Firecrawl)Start here
- ArticlePrompt Engineering Guide (DAIR.AI)Comprehensive
- ArticleIBM Prompt Engineering 2026 GuidePractical
- NotebooksClaude Cookbooks (Anthropic GitHub)Hands-on notebooks
- DocsClaude Code OverviewSetup guide
- BookAI Engineering by Chip Huyen — Ch. 1-3Core text
- PaperAttention Is All You Need (Vaswani et al.)Transformer original
- ArticleTokenization deep dive (Hugging Face NLP Course)Token internals
- ArticleLLM Pricing and Model Selection Framework 2026Cost engineering
- BookThe LLM Engineering Handbook — Ch. 1-2Advanced context
- ArticleUnderstanding LLM Sampling ParametersTemperature etc.
W2Embeddings, RAG, Context Engineering, and Tool Use
Embeddings · Vector DBs · RAG · Context Engineering · Function Calling · MCP · Open-Source Models
Embeddings, RAG, Context Engineering, and Tool Use
Embeddings · Vector DBs · RAG · Context Engineering · Function Calling · MCP · Open-Source Models
Week 2 is the retrieval week. You will learn how language models understand semantic similarity, how to build retrieval systems that find the right information at query time, and how to engineer the context window as a first-class engineering concern. You will also connect your first external tool and MCP server.
Learning objectives
Session content
Part 1 · Embeddings and Semantic Search (25 min)
- What embeddings are: text as points in vector space
- How similarity search works: cosine, dot product, L2 distance
- Embedding use cases: semantic search, recommendations, classification, anomaly detection
- Choosing an embedding model in 2026: OpenAI text-embedding-3-large, Cohere, BGE-M3 for self-hosting
- Matryoshka embeddings: flexible dimensions for storage-cost tradeoffs
Part 2 · Vector Databases and RAG Architecture (35 min)
- What vector databases do and why you need one
- Comparing options: Qdrant, Pinecone, Chroma, pgvector, Weaviate
- Chunking strategies: fixed-size vs semantic vs late chunking
- Dense retrieval, sparse retrieval (BM25), and hybrid approaches
- Reranking: why first-pass retrieval is often not enough
- The full RAG pipeline: ingest, embed, store, retrieve, generate
- GraphRAG: when knowledge graphs improve multi-hop reasoning
Part 3 · Context Engineering (30 min)
- The context lifecycle: what goes in, when, and in what order
- Prompt caching: how to reduce cost on repeated prefixes
- Context compression: summarisation, distillation, selective retrieval
- Window budgeting: allocating tokens across system, history, retrieved context, and query
- The difference between context engineering and prompt engineering
Part 4 · Function Calling, MCP, and Open-Source (30 min)
- Function calling: how tools are registered and invoked
- Model Context Protocol (MCP): the universal standard for agent-tool integration
- A2A (Agent-to-Agent): how agents discover and delegate to each other
- Open-source models with Ollama: running Llama, Mistral, and others locally
- Open vs closed source tradeoffs: latency, cost, control, capability
Key concepts
Why RAG Is Not Magic
RAG does not make a model smarter. It makes a model better-informed at query time. If your retrieval is bad, your generation will be bad too. The most common failure is not the LLM — it is the retrieval step returning the wrong chunks. Evaluate retrieval separately from generation.
Chunking Is a Design Decision
How you split documents determines what the model can find. Fixed-size chunks are fast but lose sentence context. Semantic chunking preserves meaning but is slower. Late chunking embeds the full document and retrieves sub-sections. There is no universally right answer. Test on your data.
Context Engineering Is the New Prompt Engineering
The field moved from 'how do I write better prompts' to 'how do I engineer the full context window'. What goes in, in what order, how fresh it is, and how much space it takes — these decisions determine whether a production system is reliable. This is now a core AI engineering skill.
MCP Is the Tool Standard
MCP (Model Context Protocol) has become the universal way to connect AI agents to external tools. By early 2026 every major provider supports it. If you are building an agent that needs to call external services, databases, or APIs, MCP is how you do it cleanly without custom glue code for every integration.
RAG System with Hybrid Retrieval
- Ingest a document set of your choice (at least 20 documents or equivalent)
- Implement at least two chunking strategies and compare retrieval quality
- Set up a vector database (Qdrant recommended)
- Implement both dense and sparse retrieval, then combine them
- Add at least one MCP-connected tool (e.g. web search, database lookup)
- Evaluate: for 10 test queries, did the right chunks come back?
Deliverable: Working code + an evaluation table showing retrieval quality across chunking strategies. Which approach worked best and why?
Capstone — Phase 2
- RAG system running over your chosen domain
- At least one tool connected via function calling
- MCP server configured
- Retrieval evaluation baseline established
- ArticleVector Databases Guide for RAG 2025 (DEV.to)Comprehensive
- ArticleBest Embedding Models for RAG 2026 (StackAI)Comparison guide
- Article10 Best Vector Databases for RAG (ZenML)Comparison
- ArticleWhat is Context Engineering? (IntuitionLabs 2026)Key concept
- ArticleContext Engineering for AI Agents (Neo4j)Practical guide
- DocsMCP Documentation (Anthropic)Official reference
- DocsQdrant QuickstartVector DB setup
- DocsOllama DocumentationLocal models
- VideoRAG from Scratch (LangChain YouTube)Video series
- ArticleVector Databases for RAG: Complete Applications Guide 2026GraphRAG, late chunking
- ArticleGraphRAG: Combining Vectors and Knowledge GraphsAdvanced retrieval
- PaperRAGAS: Automated Evaluation of RAG PipelinesEval framework
- ArticleHybrid Search Deep Dive (Weaviate)Dense + sparse
- BookAI Engineering by Chip Huyen — Ch. 5-6RAG and retrieval
- ArticleContext Window Management PatternsAdvanced context
W3Agentic Systems, Orchestration, and Evals
ReAct · Agent Patterns · LangGraph · Memory · Failure Modes · n8n · Security · Evals · LLM-as-Judge
Agentic Systems, Orchestration, and Evals
ReAct · Agent Patterns · LangGraph · Memory · Failure Modes · n8n · Security · Evals · LLM-as-Judge
Week 3 is the agents and evaluation week. You will build systems that can reason across multiple steps, use tools, maintain state, and recover from failures. The evals session teaches you how to measure whether your agent is actually working — because without measurement, you are just guessing.
Learning objectives
Session content
Part 1 · Agent Fundamentals and ReAct (25 min)
- What makes a system an agent: the perceive-reason-act loop
- ReAct prompting: Reasoning + Acting as the core pattern
- Single vs multi-agent architectures: when to add an agent vs extend an existing one
- The supervisor pattern: one orchestrator, multiple specialists
- Agent cards and capability discovery with A2A
Part 2 · Orchestration with LangGraph (35 min)
- State graphs: modelling agent behaviour as a graph
- Nodes, edges, and conditional routing
- Checkpointing: saving and resuming agent state
- Human-in-the-loop: when to pause and ask
- MCP in agentic workflows: connecting tools at the orchestration layer
- SDK-native agents: building without a framework
Part 3 · Memory, Failure Modes, Security, and n8n (30 min)
- In-context memory: what is in the window right now
- External memory: vector stores and knowledge bases as agent memory
- Episodic memory: remembering across sessions
- Agentic failure modes: tool call loops, context pollution, cascading errors, hallucinated tool outputs
- Cost engineering in agents: model routing, token budgeting across multi-step workflows
- Prompt injection attacks: direct and indirect (via retrieved content)
- Guardrails: validating inputs, outputs, and tool call parameters
- n8n and workflow automation: connecting your agent to real business workflows
Evals Session · Evaluation-Driven Development (40 min)
- Why agents that seem to work often do not: the silent failure problem
- Evaluation-driven development: writing your eval before you write your agent
- Building a golden dataset: what good looks like, documented
- Rule-based evals: fast, cheap, brittle
- LLM-as-judge: using a model to evaluate a model
- Setting up Langfuse: tracing, logging, and dashboards
- Regression testing: catching regressions before they reach users
- Failure mode analysis: turning production failures into eval cases
Key concepts
ReAct Is Not Optional
ReAct (Reasoning + Acting) is the pattern that makes agents reliable. Without an explicit reasoning step before each tool call, agents hallucinate tool parameters, call the wrong tool, and loop indefinitely. The thinking step is not cosmetic. It is what makes the action predictable.
Agentic Failure Modes Are Different
Agents fail in ways that single LLM calls do not. A context pollution failure means the agent has accumulated contradictory information and is now reasoning from false premises. A tool call loop means the agent is repeatedly calling the same tool because it did not register the result. These failures are hard to spot without tracing.
Evals Are Not a Post-Launch Activity
Most teams build first and evaluate later. This is backwards. Your eval dataset defines what the system is supposed to do. Write it before you write the agent. Then you can measure progress during development, not just after launch.
Prompt Injection Is a Real Attack Vector
If your agent retrieves content from external sources (web pages, documents, databases), that content can contain instructions designed to hijack the agent. This is called indirect prompt injection. It is not theoretical — it has been demonstrated against production systems. Sanitise retrieved content before it enters the agent context.
Agent with Eval Harness
- Add a ReAct reasoning loop
- Implement at least two tools (one should use MCP)
- Add memory: in-context history + at least one external memory store
- Connect to an n8n workflow (trigger or output)
- Build an eval harness: 10 test cases with expected outputs (golden set), one rule-based evaluator, one LLM-as-judge evaluator using Langfuse
- Document two failure modes you found and how you addressed them
Deliverable: Working code + eval results table + failure mode write-up.
Capstone — Phase 3
- Stateful agent with ReAct loop, tools, and memory
- n8n workflow integration
- Langfuse tracing enabled
- Eval harness running with golden set and LLM-as-judge
- Two documented failure modes and fixes
- ArticleEngineering Agentic Workflows with MCP and LangGraph (DZone 2026)Architecture guide
- ArticleHow to Build a Multi-Agent System: LangGraph, MCP, A2A (FreeCodeCamp)Full guide
- ArticleLangGraph + MCP Multi-Agent Workflows 2026 (TechBytes)Practical guide
- ArticleBuilding Production-Ready Agentic AI: LangGraph and MCP (DEV.to)Case study
- DocsLangfuse Observability OverviewTracing setup
- DocsLangfuse LLM-as-a-JudgeEval setup
- ArticleAutomated Evaluations of LLM Applications (Langfuse)Practical evals
- VideoLLM-as-Judge Evaluators Walkthrough (Langfuse)10 min video
- CourseLangGraph Masterclass 2026: OpenAI, MCP, A2A (Udemy)Paid course
- PaperReAct: Synergising Reasoning and Acting in Language ModelsOriginal paper
- ArticleTop Open Source LLM Observability Tools 2026Tool comparison
- ArticleEval FAQ by Hamel HusainAuthoritative guide
- DocsLangGraph DocumentationFull reference
- BookAI Engineering by Chip Huyen — Ch. 7-8Agents and evals
- ArticleIndirect Prompt Injection Attacks (Simon Willison)Security
W4LLM Ops, Production, and Shipping
Deployment · Prompt Versioning · CI/CD · Observability · Cost Optimisation · Guardrails · Fine-Tuning · System Card
LLM Ops, Production, and Shipping
Deployment · Prompt Versioning · CI/CD · Observability · Cost Optimisation · Guardrails · Fine-Tuning · System Card
Week 4 is about getting your system from 'it works on my machine' to 'it works reliably for real users'. This week covers the infrastructure, processes, and documentation required to ship a production AI system with confidence.
Learning objectives
Session content
Part 1 · Deployment Architecture (25 min)
- Serverless vs container: latency, cold starts, cost tradeoffs
- Streaming responses: why and how
- Async queues for long-running agent tasks
- API gateway patterns for LLM systems
- Managing secrets and credentials in production
Part 2 · LLM Ops (30 min)
- Prompt versioning: treating prompts as code artifacts with history
- Model versioning: handling model updates without breaking production
- CI/CD for LLM systems: what a pipeline looks like
- Running evals automatically on every pull request
- Rollback strategies: how to safely revert a bad prompt or model change
Part 3 · Observability, Cost, and Safety (35 min)
- Langfuse deep dive: traces, sessions, users, scores
- Silent degradation: the failure mode no one sees coming
- Cost optimisation at scale: prompt caching, batching, model routing
- Token cost monitoring: per-user, per-session, per-model breakdowns
- Safety and guardrails in production: input validation, output filtering, kill switches
- Fallbacks and safe defaults: what the system does when it cannot answer
Part 4 · Fine-Tuning, System Cards, and Shipping (30 min)
- Fine-tuning vs prompting vs RAG: the decision framework
- When fine-tuning is worth it and when it is not
- The system card: documenting what your system does, how it fails, and what to watch
- Documentation standards for production AI systems
- What good AI engineering looks like in 2026: the skills that matter
- Capstone showcase: presenting your system to the cohort
Key concepts
Silent Degradation Is the Hardest Production Problem
A system that crashes is easy to notice. A system that gradually gives worse answers is nearly invisible. Models change. Retrieved content changes. User behaviour changes. Without continuous monitoring and automated evals, your system can degrade significantly before anyone realises. This is why observability is not optional.
Prompts Are Code
A prompt is a production artifact. It has a version history, it can introduce regressions, and changes to it should go through the same review process as code changes. If you are not version-controlling your prompts, you are flying blind.
Fine-Tuning Is Often the Wrong Answer
Fine-tuning is expensive, time-consuming, and creates a model you need to maintain. In most production scenarios, better RAG plus better context engineering beats fine-tuning. Reach for fine-tuning when you need consistent tone or style, you have a high-volume narrow task, or prompt engineering has hit a ceiling. Not before.
The System Card Is Trust Infrastructure
A system card tells anyone who reads it: what this system does, what it does not do, how it fails, what it costs, what is monitored, and how to reach the team. It is the thing that makes it safe to hand a production AI system to someone else. Write it before you ship, not after.
Capstone — Harden, Document, and Ship
- Deploy your agent to a production-accessible endpoint
- Configure full Langfuse observability: traces, costs, automated evals
- Set up at least one automated eval in your CI pipeline
- Implement at least one cost optimisation (caching, batching, or routing)
- Add production guardrails: input validation, output filtering
- Write a system card: what it does/does not do, architecture diagram, failure modes and mitigations, eval results and baseline metrics, cost profile and monitoring setup
Deliverable: Live deployed system + system card + 5-minute showcase presentation. The showcase is recorded and shared with the TAI alumni community.
Capstone — Phase 4
- System deployed and accessible
- Full Langfuse observability running
- Automated evals in CI pipeline
- System card written and reviewed
- 5-minute showcase presentation prepared
- DocsLangfuse Observability DocumentationFull reference
- ArticleTop LLM Observability Tools 2026Tool comparison
- ArticleLLM Prompt Versioning Best PracticesPrompt ops
- ArticleAI Engineering 2026: Shifting to Accountability (Jellyfish)Industry framing
- ArticleState of AI Engineering (Datadog 2026)Framework adoption
- DocsAnthropic Model Card TemplateSystem card reference
- BookAI Engineering by Chip Huyen — Ch. 9-10Production patterns
- ArticleCI/CD for LLM Systems: A Practical GuideEngineering practice
- ArticleFine-Tuning vs RAG vs Prompting: Decision FrameworkWhen to fine-tune
- ArticleAI Engineering Report 2026: Acceleration Whiplash (Faros)Quality risks
- ArticleRed-Teaming LLM Applications (Anthropic)Security depth
- BookThe LLM Engineering Handbook — Ch. 7-8Advanced production
- ArticleOpen Source Self-Hosted Model Deployment with vLLMSelf-hosted inference
The capstone
The capstone is one project built progressively across all four weeks. By the time you present in Week 4, you will have built a complete, production-ready AI system from scratch.
Certificate requirements
- 1Complete all four weekly assignments
- 2Submit final capstone code to your GitHub repo
- 3Submit your system card
- 4Present at the Week 4 showcase
Tools used in this programme
Claude Code
Dev Environment
Cursor
Dev Environment
Anthropic Claude
LLM Provider
OpenAI
LLM Provider
Qdrant
Vector DB
Langfuse
Observability
LangGraph
Orchestration
n8n
Workflow
Ollama
Local Models
MCP (Anthropic)
Protocols