Watch Your AI Feature With Langfuse

You cannot fix what you cannot see. A traditional API service gives you logs, latency metrics, and error rates. An AI feature needs all of that plus something more: visibility into what the model received, what it returned, how long each step took, and whether the output was any good. Langfuse provides exactly that, and it takes about 15 minutes to wire into an existing Python app.

What Langfuse Tracks

Langfuse organizes observability data into three layers. A trace represents one end-to-end user interaction with your AI feature, from the moment a request arrives to the moment a response is returned. Within a trace, spans represent individual steps: a retrieval call, an LLM call, a post-processing step. Each span records its start time, end time, inputs, outputs, and metadata. Scores are quality signals attached to a trace, either computed automatically by an LLM judge or logged manually during a human review session.

This three-layer model means you can answer questions like: which step is causing latency? On which traces did quality scores drop last Tuesday? What was the exact context sent to the model for this failing response?

Instrumenting a LangChain App

If you are using LangChain, adding Langfuse observability is a single callback handler. Every chain invocation automatically generates a full trace with spans for each node in the chain.

import os
from langchain_anthropic import ChatAnthropic
from langchain.schema import HumanMessage
from langfuse.callback import CallbackHandler

# Configure Langfuse
langfuse_handler = CallbackHandler(
    public_key=os.environ["LANGFUSE_PUBLIC_KEY"],
    secret_key=os.environ["LANGFUSE_SECRET_KEY"],
    host="https://cloud.langfuse.com"
)

# Your existing LangChain setup
llm = ChatAnthropic(model="claude-opus-4-5")

def answer_question(user_input: str, session_id: str) -> str:
    messages = [HumanMessage(content=user_input)]

    # Pass the handler in config to trace this call
    result = llm.invoke(
        messages,
        config={
            "callbacks": [langfuse_handler],
            "metadata": {"session_id": session_id}
        }
    )
    return result.content

Every call to answer_question now appears as a trace in your Langfuse dashboard with full input/output visibility, token counts, and latency breakdown.

Finding Failures in the Dashboard

The Langfuse dashboard surfaces three types of failure that are hard to catch any other way. First, high-latency traces: filter by P95 latency to find the requests that were slowest, then inspect which span is the bottleneck. Second, traces with user feedback signals: if you log thumbs-down events as scores, filter to traces with negative scores to read the exact conversation that disappointed a user. Third, traces with unexpected output lengths: a response that is 10x shorter than average is often a sign the model bailed out early or failed silently.

Prompt

"You are analyzing a set of AI feature traces exported from Langfuse. Each trace includes the user input, the model output, total latency, and an optional quality score. Identify the top three failure patterns across these traces. For each pattern, describe what it looks like, how many traces it affects, and what change would most likely fix it."

Automated Scores

Langfuse lets you post scores back to any trace via its Python SDK. Connect this to your LLM judge to automatically score every production trace and surface quality trends in the dashboard alongside your latency and error data.

from langfuse import Langfuse

langfuse = Langfuse()

def score_trace(trace_id: str, response: str, question: str, context: str):
    scores = judge_response(question, context, response)

    langfuse.score(
        trace_id=trace_id,
        name="overall_quality",
        value=scores["overall_score"],
        comment="accuracy={}, tone={}".format(scores["accuracy"], scores["tone"])
    )

With automated scoring in place, your Langfuse dashboard shows both operational metrics and quality metrics in the same view. That combination, latency plus quality, is what separates an AI team that is guessing from one that is operating with evidence.

Want to build this live with Aki?

Join a Lightning Lesson and go deeper on this topic. Browse upcoming sessions →

Watch Your AI Feature With Langfuse

Key Takeaways

What Langfuse Tracks

Instrumenting a LangChain App

Finding Failures in the Dashboard

Automated Scores

Want to build this live with Aki?

Aki Wijesundara

Ready to Launch Your AI Career?

Table of Contents

Share Article

Get Weekly AI Career Tips