Live
Trace and Read: error analysis
Trace means capturing full records of what your system actually did, input through every step to output. Read means going through those traces by hand and journaling what went wrong, open-ended, like qualitative research.
Even as an engineer, do not skip to metrics: reading real traces is what tells you the failures that are actually happening rather than the ones you imagined. This is where you find the problems worth building evals for.
Go deeper (optional)