No Vibes, Just Evals: Measure Your AI Before You Ship

You run the prompt a few times. It looks good. You ship. Three days later a user reports something wrong, but you cannot tell whether it started failing before or after the last deploy. This is what "it seems good" buys you: no signal, no accountability, and no way to improve systematically. The fix is evals.

What Vibes-Based Review Actually Measures

When you eyeball AI output without a formal process, you are measuring your own mood, your familiarity with the topic, and the five examples you happened to check. That is not a quality signal. It is a noise signal. You will miss regressions on edge cases you never thought to test. You will approve outputs that look fluent but are factually wrong. And you will have no baseline to compare against after the next model update.

Quantitative evals solve this by replacing your impression with a number. A number can be compared. A number can regress. A number can gate a deploy.

Building a Test Set

A test set is a collection of inputs with known-good expected outputs. Fifty to a hundred examples is enough to start. Curate them from real user queries where possible, adding edge cases and known failure modes you have observed. Store them in a JSON file so they are version-controlled alongside your code.

For each example, record at minimum: the input, the expected output or expected properties of the output, and a label for the type of case (happy path, edge case, adversarial). This lets you slice your eval results by category and see exactly where a change helped or hurt.

Choosing the Right Metrics

Not every task needs the same metric. Pick the one that matches your feature's contract with the user.

Exact match works for classification tasks: sentiment, category assignment, yes/no decisions. Semantic similarity (using sentence embeddings) works for generation tasks where the exact wording can vary but the meaning should be consistent. Format compliance is a binary check that every feature should include: did the model return the expected schema or structure?

Prompt

"You are an evaluator. Given the following expected output and actual output from an AI system, score the actual output from 0 to 1 on faithfulness (does it contain only information present in the expected output?) and completeness (does it cover all key points from the expected output?). Return JSON with keys: faithfulness_score, completeness_score, and reasoning."

Running Evals Before Every Deploy

An eval that runs once is an audit. An eval that runs on every deploy is a guardrail. Wire your eval suite into CI so it runs automatically on every pull request.

import json
import anthropic

client = anthropic.Anthropic()

def run_evals(test_set_path: str, threshold: float = 0.85) -> dict:
    with open(test_set_path) as f:
        tests = json.load(f)

    results = []
    for test in tests:
        response = client.messages.create(
            model="claude-opus-4-5",
            max_tokens=512,
            messages=[{"role": "user", "content": test["input"]}]
        )
        actual = response.content[0].text
        score = score_response(actual, test["expected"])
        results.append({"id": test["id"], "score": score, "passed": score >= threshold})

    pass_rate = sum(r["passed"] for r in results) / len(results)
    return {"pass_rate": pass_rate, "passed": pass_rate >= threshold, "results": results}

if __name__ == "__main__":
    report = run_evals("tests/eval_set.json")
    print(f"Pass rate: {report['pass_rate']:.1%}")
    if not report["passed"]:
        raise SystemExit("Evals failed. Deploy blocked.")

Set a pass rate threshold that matches your risk tolerance. Start at 80 percent and raise it as your test set matures. A failing eval run should block the deploy, just like a failing unit test.

Want to build this live with Aki?

Join a Lightning Lesson and go deeper on this topic. Browse upcoming sessions →

No Vibes, Just Evals: Measure Your AI Before You Ship

Key Takeaways

What Vibes-Based Review Actually Measures

Building a Test Set

Choosing the Right Metrics

Running Evals Before Every Deploy

Want to build this live with Aki?

Aki Wijesundara

Ready to Launch Your AI Career?

Table of Contents

Share Article

Get Weekly AI Career Tips