Live
The vibe-check trap
"It works on the examples I tried" is not evaluation, and for a system with non-deterministic components it is dangerous. You cannot improve what you cannot measure, and you cannot ship changes with confidence if a tweak might silently break something you never re-tested.
Success is iteration speed with confidence, and that requires real evals. This is the trap the whole TRACE method escapes.
Go deeper (optional)