AI Tools

Evals for PMs: How to Know If Your AI Feature Works

Product managers can define and interpret AI evals without writing a line of code, and doing so turns vague feature reviews into concrete quality decisions.

June 26, 2026
6 min read
Aki Wijesundara
#Evals#Product Management#AI Features

Key Takeaways

  • Comprehensive strategies proven to work at top companies
  • Actionable tips you can implement immediately
  • Expert insights from industry professionals

You shipped an AI feature. The demo went well, the team liked it in review, and users have started using it. But do you actually know if it is working? Not "working" in the sense of not throwing errors. Working in the sense of producing correct, useful, appropriately scoped output for the variety of real inputs it now receives. Most PMs cannot answer that question. Evals are how you change that.

What PMs Get Wrong About AI Quality

The most common mistake is treating AI quality like a launch question rather than an ongoing signal. You run the feature against a handful of examples before launch, it looks good, and you move on. But AI features degrade in ways that traditional software does not: prompt changes, model updates, new user inputs that fall outside what you tested, retrieval data that drifts over time. Quality is not a property you verify once. It is a metric you monitor continuously.

The second mistake is delegating quality entirely to engineering. Engineers are essential for building evals, but they cannot define what "good" means for your feature. That is a product decision. A PM who says "write evals for the summarizer" without defining what the summarizer should and should not do has handed engineering an impossible task.

Writing Testable Acceptance Criteria

The key shift is moving from vague criteria to verifiable ones. Instead of "the response should be helpful," write criteria that can be checked automatically or by a second LLM call.

Prompt

"Convert the following acceptance criteria for an AI summarizer into specific, measurable eval checks. For each criterion, describe: (1) what to measure, (2) how to measure it automatically, and (3) what threshold indicates passing. Criterion: The summary should always cite a specific section of the source document."

Good testable criteria follow a pattern: the output must always / never / contain / not contain / be within [measurable property]. Examples: "The response must not include any HTML tags." "The recommended product must be present in the current catalog." "The response must be fewer than 150 words." These are criteria your engineering team can turn directly into automated checks.

Working With Engineering to Build a Simple Eval Suite

Once you have testable criteria, bring three things to your engineering partner: a list of 20 to 50 representative inputs (real user queries are best), the testable criteria from above, and clear examples of passing and failing output for each criterion. This is everything an engineer needs to write the eval functions.

Ask for a simple dashboard or weekly report rather than a complex system. The minimum viable eval suite is a script that runs your test cases, checks each criterion, and outputs a pass rate per criterion. A spreadsheet is fine. Complexity can come later.

Reading Eval Results in Product Reviews

When eval results come back, look for three things: overall pass rate trend (is quality going up or down over time?), criterion-level breakdown (which specific checks are failing?), and failure clustering (do failures share a pattern, such as a specific input type or topic?). A drop in pass rate on "response cites a source" tells you something specific went wrong and points directly to where to investigate.

In sprint reviews, replace "the AI seems to be working well" with "our summarizer pass rate is 91 percent this week, up from 87 percent after last week's prompt change, with the remaining failures concentrated on multi-document inputs." That is a product team that knows what it shipped.

Prompt

"Given the following eval results from our AI feature, write a two-paragraph summary for a product review. Highlight: the overall pass rate and whether it improved or regressed since last week, the top failing criterion and what it suggests about user experience, and one recommended action for the next sprint."

Want to build this live with Aki?

Join a Lightning Lesson and go deeper on this topic. Browse upcoming sessions →

A

Aki Wijesundara

Expert team of AI professionals and career advisors with experience at top tech companies. We've helped 500+ students land internships at Google, Meta, OpenAI, and other leading AI companies.

📍 Silicon Valley🎓 500+ Success Stories⭐ 98% Success Rate

Ready to Launch Your AI Career?

Join our comprehensive program and get personalized guidance from industry experts who've been where you want to go.

Share Article

Get Weekly AI Career Tips

Join 5,000+ professionals getting actionable career advice in their inbox.

No spam. Unsubscribe anytime.