LLM as Judge: Automatically Grade AI Quality
Using a second LLM call to evaluate the first is the most scalable way to measure AI output quality, and building it right takes less than an afternoon.
Key Takeaways
- Comprehensive strategies proven to work at top companies
- Actionable tips you can implement immediately
- Expert insights from industry professionals
Your AI feature is in production. Users are sending hundreds of requests a day. How do you know which responses are good? Manual review of even a small sample is expensive and inconsistent. Automated rule checks catch format errors but miss semantic quality. The practical answer is LLM-as-judge: you use a second, carefully prompted LLM call to evaluate the output of the first one.
The Judge Prompt Pattern
A judge prompt has three parts: the evaluation task description, the rubric (what criteria to score on), and the output schema (how to return the scores). The task description should explain exactly what kind of output is being evaluated and what "quality" means for that output. Vague judge prompts produce inconsistent scores.
Prompt
"You are evaluating an AI customer support response. Score it on three dimensions, each from 0 to 1: accuracy (does the response correctly address the user's question based only on the provided context?), tone (is the response professional and empathetic?), and completeness (does it fully resolve the user's request without requiring follow-up?). Return JSON with keys: accuracy, tone, completeness, and overall_score (the mean of the three)."
Here is a minimal implementation that wraps this judge prompt into a reusable function:
import anthropic
import json
client = anthropic.Anthropic()
JUDGE_PROMPT = """You are evaluating an AI customer support response.
Score from 0 to 1 on: accuracy, tone, completeness.
Return JSON: {"accuracy": float, "tone": float, "completeness": float, "overall_score": float}
User question: {question}
Context provided: {context}
AI response: {response}"""
def judge_response(question: str, context: str, response: str) -> dict:
result = client.messages.create(
model="claude-opus-4-5",
max_tokens=256,
messages=[{
"role": "user",
"content": JUDGE_PROMPT.format(
question=question,
context=context,
response=response
)
}]
)
return json.loads(result.content[0].text)
Designing a Scoring Rubric
A rubric is a set of named criteria with clear descriptions of what each score level means. Without a rubric, different judge calls will interpret "good" differently. With a rubric, the judge's scores are consistent and interpretable.
For each criterion, define at minimum what a score of 0, 0.5, and 1 looks like. For accuracy: 0 means factually incorrect, 0.5 means partially correct or incomplete, 1 means fully accurate and grounded in context. Write these definitions directly into your judge prompt. Specific language in the rubric reduces variance in scores.
Calibrating the Judge Against Human Labels
Before you trust a judge, calibrate it. Collect 30 to 50 responses that humans have already scored. Run the judge on those same responses. Compute the correlation between judge scores and human scores. If the correlation is above 0.8, your judge is a good proxy for human quality. Below 0.6, revise the rubric and retry.
import numpy as np
def calibrate_judge(human_scores: list, judge_scores: list) -> dict:
human = np.array(human_scores)
judge = np.array(judge_scores)
correlation = float(np.corrcoef(human, judge)[0, 1])
mae = float(np.mean(np.abs(human - judge)))
return {
"correlation": round(correlation, 3),
"mean_absolute_error": round(mae, 3),
"is_calibrated": correlation >= 0.8
}
# Example calibration run
human_labels = [0.9, 0.4, 1.0, 0.6, 0.8, 0.3, 0.7, 1.0, 0.5, 0.9]
judge_labels = [0.85, 0.5, 0.95, 0.55, 0.75, 0.4, 0.65, 0.9, 0.6, 0.85]
print(calibrate_judge(human_labels, judge_labels))
Calibration is not a one-time step. Re-run it after changing your judge prompt, switching models, or expanding into a new use case. A judge that was well-calibrated for support responses may not transfer well to a different task.
Want to build this live with Aki?
Join a Lightning Lesson and go deeper on this topic. Browse upcoming sessions →
Aki Wijesundara
Expert team of AI professionals and career advisors with experience at top tech companies. We've helped 500+ students land internships at Google, Meta, OpenAI, and other leading AI companies.
Ready to Launch Your AI Career?
Join our comprehensive program and get personalized guidance from industry experts who've been where you want to go.
Table of Contents
Share Article
Get Weekly AI Career Tips
Join 5,000+ professionals getting actionable career advice in their inbox.
No spam. Unsubscribe anytime.