LLM as Judge: Automatically Grade AI Quality

Your AI feature is in production. Users are sending hundreds of requests a day. How do you know which responses are good? Manual review of even a small sample is expensive and inconsistent. Automated rule checks catch format errors but miss semantic quality. The practical answer is LLM-as-judge: you use a second, carefully prompted LLM call to evaluate the output of the first one.

The Judge Prompt Pattern

A judge prompt has three parts: the evaluation task description, the rubric (what criteria to score on), and the output schema (how to return the scores). The task description should explain exactly what kind of output is being evaluated and what "quality" means for that output. Vague judge prompts produce inconsistent scores.

Prompt

"You are evaluating an AI customer support response. Score it on three dimensions, each from 0 to 1: accuracy (does the response correctly address the user's question based only on the provided context?), tone (is the response professional and empathetic?), and completeness (does it fully resolve the user's request without requiring follow-up?). Return JSON with keys: accuracy, tone, completeness, and overall_score (the mean of the three)."

Here is a minimal implementation that wraps this judge prompt into a reusable function:

import anthropic
import json

client = anthropic.Anthropic()

JUDGE_PROMPT = """You are evaluating an AI customer support response.
Score from 0 to 1 on: accuracy, tone, completeness.
Return JSON: {"accuracy": float, "tone": float, "completeness": float, "overall_score": float}

User question: {question}
Context provided: {context}
AI response: {response}"""

def judge_response(question: str, context: str, response: str) -> dict:
    result = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=256,
        messages=[{
            "role": "user",
            "content": JUDGE_PROMPT.format(
                question=question,
                context=context,
                response=response
            )
        }]
    )
    return json.loads(result.content[0].text)

Designing a Scoring Rubric

A rubric is a set of named criteria with clear descriptions of what each score level means. Without a rubric, different judge calls will interpret "good" differently. With a rubric, the judge's scores are consistent and interpretable.

For each criterion, define at minimum what a score of 0, 0.5, and 1 looks like. For accuracy: 0 means factually incorrect, 0.5 means partially correct or incomplete, 1 means fully accurate and grounded in context. Write these definitions directly into your judge prompt. Specific language in the rubric reduces variance in scores.

Calibrating the Judge Against Human Labels

Before you trust a judge, calibrate it. Collect 30 to 50 responses that humans have already scored. Run the judge on those same responses. Compute the correlation between judge scores and human scores. If the correlation is above 0.8, your judge is a good proxy for human quality. Below 0.6, revise the rubric and retry.

import numpy as np

def calibrate_judge(human_scores: list, judge_scores: list) -> dict:
    human = np.array(human_scores)
    judge = np.array(judge_scores)
    correlation = float(np.corrcoef(human, judge)[0, 1])
    mae = float(np.mean(np.abs(human - judge)))
    return {
        "correlation": round(correlation, 3),
        "mean_absolute_error": round(mae, 3),
        "is_calibrated": correlation >= 0.8
    }

# Example calibration run
human_labels = [0.9, 0.4, 1.0, 0.6, 0.8, 0.3, 0.7, 1.0, 0.5, 0.9]
judge_labels = [0.85, 0.5, 0.95, 0.55, 0.75, 0.4, 0.65, 0.9, 0.6, 0.85]
print(calibrate_judge(human_labels, judge_labels))

Calibration is not a one-time step. Re-run it after changing your judge prompt, switching models, or expanding into a new use case. A judge that was well-calibrated for support responses may not transfer well to a different task.

Want to build this live with Aki?

Join a Lightning Lesson and go deeper on this topic. Browse upcoming sessions →

LLM as Judge: Automatically Grade AI Quality

Key Takeaways

The Judge Prompt Pattern

Designing a Scoring Rubric

Calibrating the Judge Against Human Labels

Want to build this live with Aki?

Aki Wijesundara

Ready to Launch Your AI Career?

Table of Contents

Share Article

Get Weekly AI Career Tips