LLM as Judge: Score Output Automatically at Scale

You have a working LLM judge. It produces good scores on individual responses. Now you want to run it on every response your application produces in production. At a hundred responses per day that is straightforward. At ten thousand, it requires real engineering: batching for throughput, caching to avoid redundant calls, cost controls to prevent runaway spend, and a pipeline that stores scores where you can actually use them.

Batching for Throughput

The simplest way to scale judge calls is to run them in parallel using a thread pool. Each judge call is independent, so there is no reason to run them sequentially. With a pool of 10 workers, you can score 10 responses simultaneously and complete a batch of 100 in roughly the same time as scoring 10 sequentially.

from concurrent.futures import ThreadPoolExecutor, as_completed
from typing import List

def score_batch(responses: List[dict], max_workers: int = 10) -> List[dict]:
    """Score a batch of responses in parallel."""
    results = []

    def score_one(item):
        score = judge_response(
            question=item["question"],
            context=item["context"],
            response=item["response"]
        )
        return {"id": item["id"], **score}

    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = {executor.submit(score_one, r): r for r in responses}
        for future in as_completed(futures):
            try:
                results.append(future.result())
            except Exception as e:
                item = futures[future]
                results.append({"id": item["id"], "error": str(e)})

    return results

In production, trigger this batch job on a schedule or after accumulating a buffer of unscored responses. Avoid scoring synchronously in the request path unless your latency budget allows for it.

Cost Management and Caching

LLM judge calls add real cost. At scale, two practices keep costs under control. First, use a smaller, cheaper model for the judge than for the primary generation. Claude Haiku is well-suited for structured scoring tasks and costs a fraction of Opus or Sonnet. Second, cache scores for identical inputs. If your application sometimes surfaces the same response to multiple users, scoring it once and storing the result avoids redundant API calls.

A simple hash of the response text makes a good cache key. Store scores in a key-value store like Redis with a TTL that matches how long you want to retain quality data. Do not cache indefinitely: as your rubric evolves, old cached scores become stale.

Building the Full Scoring Pipeline

A production scoring pipeline has four stages: collect responses from your primary application, score them with the judge, store the results with the original request metadata, and surface aggregates in a dashboard or alerting system.

import sqlite3
from datetime import datetime

def run_scoring_pipeline(db_path: str = "scores.db"):
    conn = sqlite3.connect(db_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS scores (
            id TEXT PRIMARY KEY,
            overall_score REAL,
            accuracy REAL,
            tone REAL,
            completeness REAL,
            scored_at TEXT
        )
    """)

    # Fetch unscored responses from your app database
    unscored = fetch_unscored_responses(limit=500)

    if not unscored:
        print("No unscored responses found.")
        return

    scores = score_batch(unscored)

    rows = [
        (s["id"], s.get("overall_score"), s.get("accuracy"),
         s.get("tone"), s.get("completeness"), datetime.utcnow().isoformat())
        for s in scores if "error" not in s
    ]
    conn.executemany(
        "INSERT OR REPLACE INTO scores VALUES (?,?,?,?,?,?)", rows
    )
    conn.commit()
    print(f"Scored {len(rows)} responses. Errors: {len(scores) - len(rows)}")
    conn.close()

Using the Scores

Scores stored in a database unlock three workflows. Real-time alerting: trigger a Slack message when the rolling average score drops below a threshold. Regression detection: compare this week's score distribution to last week's after every deploy. Root cause analysis: filter to low-scoring responses and cluster them by input type or topic to find systemic failure patterns. A scoring pipeline that runs continuously turns quality from a launch checklist into a living metric.

Want to build this live with Aki?

Join a Lightning Lesson and go deeper on this topic. Browse upcoming sessions →

LLM as Judge: Score Output Automatically at Scale

Key Takeaways

Batching for Throughput

Cost Management and Caching

Building the Full Scoring Pipeline

Using the Scores

Want to build this live with Aki?

Aki Wijesundara

Ready to Launch Your AI Career?

Table of Contents

Share Article

Get Weekly AI Career Tips