Module T - Token Optimisation | Agentic AI Engineering Bootcamp

Async

Module T - Token Optimisation

Token economics (Module 1.4) taught you to see cost. This module teaches you to reduce it - because at production scale, token optimisation is often the difference between a system that is profitable and one that is not, and between one that feels fast and one that feels sluggish.

First, measure

You cannot optimise what you do not measure. Track input and output tokens per call, per feature, and per user. Use a tokenizer to see how your prompts actually tokenize - text is not one token per word - and instrument your service so token usage is visible on every call.

Technique 1: Trim the prompt

The cheapest tokens are the ones you never send. Audit your system prompt for redundancy and filler - most are far longer than they need to be. A leaner prompt costs less on every single call, forever, and often performs the same or better because the signal is not buried in noise.

Technique 2: Economical few-shot

Examples are powerful and expensive, since they ride along on every call. Use the fewest examples that achieve the behaviour, choose ones that cover the hard cases, and consider dynamic selection: retrieve only the examples relevant to the current input rather than always sending all of them.

Technique 3: Control the output

Output tokens cost money and drive latency. Set sensible max-token limits, ask explicitly for concise output, use structured outputs so the model returns compact data instead of padded prose, and use stop sequences to end generation when the useful part is done.

Technique 4: Prompt caching

If many of your calls share a large, unchanging prefix (a long system prompt, fixed instructions, a stable document), prompt caching lets the provider reuse that processed prefix at a large discount. Structure your prompts so the stable part comes first and the variable part last.

Technique 5: Retrieve, do not stuff

Putting an entire knowledge base into every prompt is the most common cost mistake there is. Retrieval (Week 2) is partly a cost technique: you send only the handful of relevant chunks instead of everything - which slashes input tokens and usually improves quality.

Technique 6: Compress history

In multi-turn systems, conversation history grows without bound. Rather than carrying the full transcript forever, summarize older turns into a compact running summary, keep only recent turns verbatim, and drop what is no longer relevant.

Technique 7: Route to the right model

Send cheap tasks to cheap models. Routing high-volume simple steps to a small fast model and reserving the frontier model for the genuinely hard step is often the single largest cost saving in a real system.

Technique 8: Batch what is not real-time

For work that does not need an instant response (overnight processing, bulk generation, offline evals), batch APIs offer a significant discount over real-time calls.

Technique 9: Reliability is a cost technique

Every failed or malformed response you have to retry is tokens paid twice. The structured outputs and guardrails from Modules 1.2 and 1.3 reduce wasted calls - so reliability and cost optimisation are the same work seen from two angles.

The senior mindset

Token optimisation is not penny-pinching - it is systems design. The engineer who can hold quality flat while cutting cost by half has built a fundamentally better system. Optimise the calls that dominate your bill, measure the effect, and never trade away quality you actually need.

Go deeper (optional)