No Vibes, Just Evals: Measure Your AI Before You Ship
Subjective impressions of AI quality do not catch regressions, so here is how to replace them with quantitative evals that run before every deploy.
Key Takeaways
- Comprehensive strategies proven to work at top companies
- Actionable tips you can implement immediately
- Expert insights from industry professionals
You run the prompt a few times. It looks good. You ship. Three days later a user reports something wrong, but you cannot tell whether it started failing before or after the last deploy. This is what "it seems good" buys you: no signal, no accountability, and no way to improve systematically. The fix is evals.
What Vibes-Based Review Actually Measures
When you eyeball AI output without a formal process, you are measuring your own mood, your familiarity with the topic, and the five examples you happened to check. That is not a quality signal. It is a noise signal. You will miss regressions on edge cases you never thought to test. You will approve outputs that look fluent but are factually wrong. And you will have no baseline to compare against after the next model update.
Quantitative evals solve this by replacing your impression with a number. A number can be compared. A number can regress. A number can gate a deploy.
Building a Test Set
A test set is a collection of inputs with known-good expected outputs. Fifty to a hundred examples is enough to start. Curate them from real user queries where possible, adding edge cases and known failure modes you have observed. Store them in a JSON file so they are version-controlled alongside your code.
For each example, record at minimum: the input, the expected output or expected properties of the output, and a label for the type of case (happy path, edge case, adversarial). This lets you slice your eval results by category and see exactly where a change helped or hurt.
Choosing the Right Metrics
Not every task needs the same metric. Pick the one that matches your feature's contract with the user.
Exact match works for classification tasks: sentiment, category assignment, yes/no decisions. Semantic similarity (using sentence embeddings) works for generation tasks where the exact wording can vary but the meaning should be consistent. Format compliance is a binary check that every feature should include: did the model return the expected schema or structure?
Prompt
"You are an evaluator. Given the following expected output and actual output from an AI system, score the actual output from 0 to 1 on faithfulness (does it contain only information present in the expected output?) and completeness (does it cover all key points from the expected output?). Return JSON with keys: faithfulness_score, completeness_score, and reasoning."
Running Evals Before Every Deploy
An eval that runs once is an audit. An eval that runs on every deploy is a guardrail. Wire your eval suite into CI so it runs automatically on every pull request.
import json
import anthropic
client = anthropic.Anthropic()
def run_evals(test_set_path: str, threshold: float = 0.85) -> dict:
with open(test_set_path) as f:
tests = json.load(f)
results = []
for test in tests:
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=512,
messages=[{"role": "user", "content": test["input"]}]
)
actual = response.content[0].text
score = score_response(actual, test["expected"])
results.append({"id": test["id"], "score": score, "passed": score >= threshold})
pass_rate = sum(r["passed"] for r in results) / len(results)
return {"pass_rate": pass_rate, "passed": pass_rate >= threshold, "results": results}
if __name__ == "__main__":
report = run_evals("tests/eval_set.json")
print(f"Pass rate: {report['pass_rate']:.1%}")
if not report["passed"]:
raise SystemExit("Evals failed. Deploy blocked.")
Set a pass rate threshold that matches your risk tolerance. Start at 80 percent and raise it as your test set matures. A failing eval run should block the deploy, just like a failing unit test.
Want to build this live with Aki?
Join a Lightning Lesson and go deeper on this topic. Browse upcoming sessions →
Aki Wijesundara
Expert team of AI professionals and career advisors with experience at top tech companies. We've helped 500+ students land internships at Google, Meta, OpenAI, and other leading AI companies.
Ready to Launch Your AI Career?
Join our comprehensive program and get personalized guidance from industry experts who've been where you want to go.
Table of Contents
Share Article
Get Weekly AI Career Tips
Join 5,000+ professionals getting actionable career advice in their inbox.
No spam. Unsubscribe anytime.