
Observability, evaluation, and shipping with Langfuse
Dr. Aki Wijesundara
AI Engineering Bootcamp · The AI Internship
Langfuse gives you eyes on your agent.
Bird's-eye view: traces, costs, scores, latency
See every LLM call, tool use, and decision
Group traces by user session
Version control your prompts
Track quality metrics over time
Automate evaluation with a second LLM
Manual review queue for quality checks
Golden test cases for regression testing
We're going to walk through each of these. Live.
The Bird's-Eye View
Total traces tracked + volume over time
Spend by model, by type, by user
Moving average per score over time
Percentiles: trace, generation, span
Are traces flowing? Is cost spiking? Are scores dropping?
This is the first thing you check every morning.
User Query
└── Trace starts
├── Span: query analysis
├── Span: document retrieval
│ (vector search)
├── Span: LLM generation
└── Span: response validation
└── Trace endsEvery function decorated with @observe() becomes a span.
Parent → child span hierarchy
What went in, what came out
Per span, per trace
Where time is spent
Grouping Traces by Conversation and User
One conversation = one session. Replay an entire user conversation end-to-end.
Traces per user, cost per user. Spot who's hitting edge cases.
Drill into specific users, sessions, and behavior patterns.
Version Control for Prompts
Manage prompt versions (v1, v2, v3) directly in Langfuse. No more changing prompts in code and hoping for the best.
Every trace knows which prompt version produced it. Compare performance across versions.
Test prompts directly in Langfuse before deploying. Check scores to see which version performs better.
Numeric, boolean, or categorical — attach scores to any trace and track them over time.
| Score | Type | What It Measures |
|---|---|---|
| relevance | Numeric (0–1) | Does the response answer the question? |
| hallucination | Boolean (pass/fail) | Did it make stuff up? |
| groundedness | Numeric (0–1) | Are claims backed by retrieved docs? |
| latency_ok | Boolean (pass/fail) | Response under 3 seconds? |
Scores are the heartbeat of your system. If you only do one thing from today, set up scores.
Your agent produces a response
Langfuse sends (query + response + context) to a Judge LLM
Judge returns a score + reasoning
Score is logged to the trace automatically
LLM judges scale your evaluation.
Spot-check with humans to calibrate.
The Quality Gut Check
Read the query, read the response, score it. Built into Langfuse.
Compare human scores against LLM-as-a-Judge scores. If they disagree, fix the judge.
Set up annotation workflows for your team. Sample 10–20 traces a week.
LLM judges scale. Humans calibrate. Use both.
Your Golden Test Cases
Input queries + expected outputs. Your source of truth for quality.
Run your agent against the dataset. Compare run A vs run B side by side.
See how scores aggregate per run. Did your change make things better or worse?
Regression testing for AI. Change a prompt, run your dataset, know the impact.
The Problem — What does it solve?
0–1 minLive Demo — Real query, real response
1–3 minLangfuse — Show a trace + scores
3–4 minLearnings — What surprised you?
4–5 minWeek 1: FastAPI + OpenAI
Week 2: RAG + vector search
Week 3: Multi-agent with ADK
Week 4: Langfuse — production-ready
I just shipped my first production-ready
AI system.
4 weeks in @The AI Internship's
AI Engineering Bootcamp:
→ Week 1: FastAPI + OpenAI backend
→ Week 2: RAG with vector search
→ Week 3: Multi-agent workflows with ADK
→ Week 4: Evals, observability & shipping
with Langfuse
The biggest lesson? Building a demo is
easy. Making it production-ready is a
completely different game.
[Link to your deployed project]
#AIEngineering #BuildInPublic
#TheAIInternshipTag us. We reshare the best builds. · theaiinternship.com