How to brief an engineer to build the evals you need

Deep diveSelf-paced

How to brief an engineer to build the evals you need

At some point the evaluation you want will outgrow what you can maintain by hand, and you will hand the Codify and Enforce work to an engineer. This lesson is about doing that handoff well, because a vague ask produces useless metrics and a good one produces evaluation you can trust.

The reason handoffs fail is that people ask for the wrong thing. They say "give me some quality metrics," and they get generic scores, coherence, fluency, similarity, that sound reasonable and tell you nothing about whether your product is doing its actual job. The fix is to bring the raw material you already have from TRACE. Hand the engineer your traces and your must-pass checklist, the real failures and the specific cases, not a request for abstract metrics. You are giving them the ground truth of what "good" and "bad" mean for your product, which is the thing they cannot generate themselves.

Ask for two specific things. First, checks that are binary and tied to your real failure modes, pass/fail on the actual problems you found, not scores on generic qualities. Second, if any check uses an AI to judge outputs, ask them to validate that judge against human judgment, to confirm, on a sample, that the AI judge agrees with what you, the human, would call pass or fail. An unvalidated AI judge is just another vibe check wearing a lab coat, and asking for validation is how you tell a rigorous setup from a sloppy one.

Finally, ask for what you should get back: not a number in isolation, but the ability to see a specific failure's rate go down after a fix. The proof that evaluation is working is a chart where the thing you were failing at happens less over time. If you can brief an engineer with your real failures and your cases, and ask for binary, validated checks that show improvement, you will get evaluation you can actually run your product on, which is the entire point.

Go deeper (optional)

Aligning AI judges with human judgment (validating the validators)