Evals for PMs: How to Know If Your AI Feature Works

You shipped an AI feature. The demo went well, the team liked it in review, and users have started using it. But do you actually know if it is working? Not "working" in the sense of not throwing errors. Working in the sense of producing correct, useful, appropriately scoped output for the variety of real inputs it now receives. Most PMs cannot answer that question. Evals are how you change that.

What PMs Get Wrong About AI Quality

The most common mistake is treating AI quality like a launch question rather than an ongoing signal. You run the feature against a handful of examples before launch, it looks good, and you move on. But AI features degrade in ways that traditional software does not: prompt changes, model updates, new user inputs that fall outside what you tested, retrieval data that drifts over time. Quality is not a property you verify once. It is a metric you monitor continuously.

The second mistake is delegating quality entirely to engineering. Engineers are essential for building evals, but they cannot define what "good" means for your feature. That is a product decision. A PM who says "write evals for the summarizer" without defining what the summarizer should and should not do has handed engineering an impossible task.

Writing Testable Acceptance Criteria

The key shift is moving from vague criteria to verifiable ones. Instead of "the response should be helpful," write criteria that can be checked automatically or by a second LLM call.

Prompt

"Convert the following acceptance criteria for an AI summarizer into specific, measurable eval checks. For each criterion, describe: (1) what to measure, (2) how to measure it automatically, and (3) what threshold indicates passing. Criterion: The summary should always cite a specific section of the source document."

Good testable criteria follow a pattern: the output must always / never / contain / not contain / be within [measurable property]. Examples: "The response must not include any HTML tags." "The recommended product must be present in the current catalog." "The response must be fewer than 150 words." These are criteria your engineering team can turn directly into automated checks.

Working With Engineering to Build a Simple Eval Suite

Once you have testable criteria, bring three things to your engineering partner: a list of 20 to 50 representative inputs (real user queries are best), the testable criteria from above, and clear examples of passing and failing output for each criterion. This is everything an engineer needs to write the eval functions.

Ask for a simple dashboard or weekly report rather than a complex system. The minimum viable eval suite is a script that runs your test cases, checks each criterion, and outputs a pass rate per criterion. A spreadsheet is fine. Complexity can come later.

Reading Eval Results in Product Reviews

When eval results come back, look for three things: overall pass rate trend (is quality going up or down over time?), criterion-level breakdown (which specific checks are failing?), and failure clustering (do failures share a pattern, such as a specific input type or topic?). A drop in pass rate on "response cites a source" tells you something specific went wrong and points directly to where to investigate.

In sprint reviews, replace "the AI seems to be working well" with "our summarizer pass rate is 91 percent this week, up from 87 percent after last week's prompt change, with the remaining failures concentrated on multi-document inputs." That is a product team that knows what it shipped.

Prompt

"Given the following eval results from our AI feature, write a two-paragraph summary for a product review. Highlight: the overall pass rate and whether it improved or regressed since last week, the top failing criterion and what it suggests about user experience, and one recommended action for the next sprint."

Want to build this live with Aki?

Join a Lightning Lesson and go deeper on this topic. Browse upcoming sessions →

Evals for PMs: How to Know If Your AI Feature Works

Key Takeaways

What PMs Get Wrong About AI Quality

Writing Testable Acceptance Criteria

Working With Engineering to Build a Simple Eval Suite

Reading Eval Results in Product Reviews

Want to build this live with Aki?

Aki Wijesundara

Ready to Launch Your AI Career?

Table of Contents

Share Article

Get Weekly AI Career Tips