✦ Week 4 — Final Session

Making Your Agent Production-Ready

Observability, evaluation, and shipping with Langfuse

Dr. Aki Wijesundara

AI Engineering Bootcamp · The AI Internship

The Problem

You Built an Agent. Now What?

✅ What you have

An ADK agent that works locally
It answers questions, retrieves docs, uses tools
It looks great in a demo

❌ What you don't have

Any idea what it's doing under the hood
Any way to measure if it's getting better or worse
Any way to catch failures before your users do

Langfuse gives you eyes on your agent.

Overview

Langfuse — Your Agent's Control Room

Dashboard

Bird's-eye view: traces, costs, scores, latency

Tracing

See every LLM call, tool use, and decision

Sessions

Group traces by user session

Prompt Management

Version control your prompts

Scores

Track quality metrics over time

LLM-as-a-Judge

Automate evaluation with a second LLM

Human Annotation

Manual review queue for quality checks

Datasets

Golden test cases for regression testing

We're going to walk through each of these. Live.

Live Demo

Dashboard

The Bird's-Eye View

Traces

Total traces tracked + volume over time

Model Costs

Spend by model, by type, by user

Scores

Moving average per score over time

Latency

Percentiles: trace, generation, span

Are traces flowing? Is cost spiking? Are scores dropping?
This is the first thing you check every morning.

Live Demo

Tracing — See Every Decision Your Agent Makes

User Query
  └── Trace starts
       ├── Span: query analysis
       ├── Span: document retrieval
       │        (vector search)
       ├── Span: LLM generation
       └── Span: response validation
  └── Trace ends

Every function decorated with @observe() becomes a span.

Full Trace Tree

Parent → child span hierarchy

Input / Output

What went in, what came out

Token Count + Cost

Per span, per trace

Latency Breakdown

Where time is spent

Live Demo

Sessions & Users

Grouping Traces by Conversation and User

Sessions

One conversation = one session. Replay an entire user conversation end-to-end.

Users

Traces per user, cost per user. Spot who's hitting edge cases.

Filtering

Drill into specific users, sessions, and behavior patterns.

Live Demo

Prompt Management

Version Control for Prompts

Create & Version

Manage prompt versions (v1, v2, v3) directly in Langfuse. No more changing prompts in code and hoping for the best.

Link to Traces

Every trace knows which prompt version produced it. Compare performance across versions.

Playground

Test prompts directly in Langfuse before deploying. Check scores to see which version performs better.

Live Demo

Scores — Measuring What Matters

Numeric, boolean, or categorical — attach scores to any trace and track them over time.

Score	Type	What It Measures
relevance	Numeric (0–1)	Does the response answer the question?
hallucination	Boolean (pass/fail)	Did it make stuff up?
groundedness	Numeric (0–1)	Are claims backed by retrieved docs?
latency_ok	Boolean (pass/fail)	Response under 3 seconds?

Scores are the heartbeat of your system. If you only do one thing from today, set up scores.

Live Demo

LLM-as-a-Judge — Automated Evaluation

How it works

Your agent produces a response

↓

Langfuse sends (query + response + context) to a Judge LLM

↓

Judge returns a score + reasoning

↓

Score is logged to the trace automatically

Watch out for

Verbosity bias: longer ≠ better
Position bias: first option preferred
Self-preference: GPT-4 rates GPT-4 higher
Always validate against human review

LLM judges scale your evaluation.
Spot-check with humans to calibrate.

Live Demo

Human Annotation

The Quality Gut Check

Annotation Queue

Read the query, read the response, score it. Built into Langfuse.

Human vs. Automated

Compare human scores against LLM-as-a-Judge scores. If they disagree, fix the judge.

Team Workflows

Set up annotation workflows for your team. Sample 10–20 traces a week.

LLM judges scale. Humans calibrate. Use both.

Live Demo

Datasets

Your Golden Test Cases

Curated Test Cases

Input queries + expected outputs. Your source of truth for quality.

Run Experiments

Run your agent against the dataset. Compare run A vs run B side by side.

Aggregated Scores

See how scores aggregate per run. Did your change make things better or worse?

Regression testing for AI. Change a prompt, run your dataset, know the impact.

Ship It

Your Turn. Ship & Showcase.

Demo format (5 min each)

The Problem — What does it solve?

0–1 min

Live Demo — Real query, real response

1–3 min

Langfuse — Show a trace + scores

3–4 min

Learnings — What surprised you?

4–5 min

What you built in 4 weeks

Week 1: FastAPI + OpenAI

Week 2: RAG + vector search

Week 3: Multi-agent with ADK

Week 4: Langfuse — production-ready

Post it. Right now.

I just shipped my first production-ready
AI system.

4 weeks in @The AI Internship's
AI Engineering Bootcamp:

→ Week 1: FastAPI + OpenAI backend
→ Week 2: RAG with vector search
→ Week 3: Multi-agent workflows with ADK
→ Week 4: Evals, observability & shipping
  with Langfuse

The biggest lesson? Building a demo is
easy. Making it production-ready is a
completely different game.

[Link to your deployed project]

#AIEngineering #BuildInPublic
#TheAIInternship

Tag us. We reshare the best builds. · theaiinternship.com