Week 4 — Final Session

Making Your Agent Production-Ready

Observability, evaluation, and shipping with Langfuse

Aki Wijesundara

Dr. Aki Wijesundara

AI Engineering Bootcamp · The AI Internship

You Built an Agent. Now What?

✅ What you have

  • An ADK agent that works locally
  • It answers questions, retrieves docs, uses tools
  • It looks great in a demo

❌ What you don't have

  • Any idea what it's doing under the hood
  • Any way to measure if it's getting better or worse
  • Any way to catch failures before your users do

Langfuse gives you eyes on your agent.

Langfuse — Your Agent's Control Room

Dashboard

Bird's-eye view: traces, costs, scores, latency

Tracing

See every LLM call, tool use, and decision

Sessions

Group traces by user session

Prompt Management

Version control your prompts

Scores

Track quality metrics over time

LLM-as-a-Judge

Automate evaluation with a second LLM

Human Annotation

Manual review queue for quality checks

Datasets

Golden test cases for regression testing

We're going to walk through each of these. Live.

Dashboard

The Bird's-Eye View

Traces

Total traces tracked + volume over time

Model Costs

Spend by model, by type, by user

Scores

Moving average per score over time

Latency

Percentiles: trace, generation, span

Are traces flowing? Is cost spiking? Are scores dropping?
This is the first thing you check every morning.

Tracing — See Every Decision Your Agent Makes

User Query
  └── Trace starts
       ├── Span: query analysis
       ├── Span: document retrieval
       │        (vector search)
       ├── Span: LLM generation
       └── Span: response validation
  └── Trace ends

Every function decorated with @observe() becomes a span.

Full Trace Tree

Parent → child span hierarchy

Input / Output

What went in, what came out

Token Count + Cost

Per span, per trace

Latency Breakdown

Where time is spent

Sessions & Users

Grouping Traces by Conversation and User

Sessions

One conversation = one session. Replay an entire user conversation end-to-end.

Users

Traces per user, cost per user. Spot who's hitting edge cases.

Filtering

Drill into specific users, sessions, and behavior patterns.

Prompt Management

Version Control for Prompts

Create & Version

Manage prompt versions (v1, v2, v3) directly in Langfuse. No more changing prompts in code and hoping for the best.

Link to Traces

Every trace knows which prompt version produced it. Compare performance across versions.

Playground

Test prompts directly in Langfuse before deploying. Check scores to see which version performs better.

Scores — Measuring What Matters

Numeric, boolean, or categorical — attach scores to any trace and track them over time.

ScoreTypeWhat It Measures
relevanceNumeric (0–1)Does the response answer the question?
hallucinationBoolean (pass/fail)Did it make stuff up?
groundednessNumeric (0–1)Are claims backed by retrieved docs?
latency_okBoolean (pass/fail)Response under 3 seconds?

Scores are the heartbeat of your system. If you only do one thing from today, set up scores.

LLM-as-a-Judge — Automated Evaluation

How it works

Your agent produces a response

Langfuse sends (query + response + context) to a Judge LLM

Judge returns a score + reasoning

Score is logged to the trace automatically

Watch out for

  • Verbosity bias: longer ≠ better
  • Position bias: first option preferred
  • Self-preference: GPT-4 rates GPT-4 higher
  • Always validate against human review

LLM judges scale your evaluation.
Spot-check with humans to calibrate.

Human Annotation

The Quality Gut Check

Annotation Queue

Read the query, read the response, score it. Built into Langfuse.

Human vs. Automated

Compare human scores against LLM-as-a-Judge scores. If they disagree, fix the judge.

Team Workflows

Set up annotation workflows for your team. Sample 10–20 traces a week.

LLM judges scale. Humans calibrate. Use both.

Datasets

Your Golden Test Cases

Curated Test Cases

Input queries + expected outputs. Your source of truth for quality.

Run Experiments

Run your agent against the dataset. Compare run A vs run B side by side.

Aggregated Scores

See how scores aggregate per run. Did your change make things better or worse?

Regression testing for AI. Change a prompt, run your dataset, know the impact.

Your Turn. Ship & Showcase.

Demo format (5 min each)

The Problem — What does it solve?

0–1 min

Live Demo — Real query, real response

1–3 min

Langfuse — Show a trace + scores

3–4 min

Learnings — What surprised you?

4–5 min

What you built in 4 weeks

Week 1: FastAPI + OpenAI

Week 2: RAG + vector search

Week 3: Multi-agent with ADK

Week 4: Langfuse — production-ready

Post it. Right now.

I just shipped my first production-ready
AI system.

4 weeks in @The AI Internship's
AI Engineering Bootcamp:

→ Week 1: FastAPI + OpenAI backend
→ Week 2: RAG with vector search
→ Week 3: Multi-agent workflows with ADK
→ Week 4: Evals, observability & shipping
  with Langfuse

The biggest lesson? Building a demo is
easy. Making it production-ready is a
completely different game.

[Link to your deployed project]

#AIEngineering #BuildInPublic
#TheAIInternship

Tag us. We reshare the best builds. · theaiinternship.com