Live

The vibe-check trap

"It works on the examples I tried" is not evaluation, and for a system with non-deterministic components it is dangerous. You cannot improve what you cannot measure, and you cannot ship changes with confidence if a tweak might silently break something you never re-tested.

Success is iteration speed with confidence, and that requires real evals. This is the trap the whole TRACE method escapes.

Go deeper (optional)