Evaluate Your RAG: The Three Metrics That Matter

The Three Metrics That Define RAG Quality

Most teams building RAG systems judge quality by eye: does the answer look right? That is not a measurement strategy. To systematically improve your pipeline, you need to measure three specific things: context precision, context recall, and answer faithfulness.

Context precision asks: of the chunks your retrieval returned, what fraction were actually relevant to the question? A system that retrieves 10 chunks but only 3 are relevant has low precision. Context recall asks the inverse: of all the relevant information in your corpus, how much did your retrieval surface? A system that consistently misses the most important chunk has low recall. Answer faithfulness asks: does the final answer stick to what the retrieved context says, or does the model add information from its pre-training memory?

Measuring Context Precision and Recall

To measure precision and recall, you ideally need a labeled dataset: a set of questions, the correct answers, and the specific chunks that should be retrieved for each question. You run your pipeline on each question, compare the retrieved chunks to the gold standard, and compute the overlap.

Building this dataset is the hard part. Expect to spend several hours annotating 50 to 100 examples before you have a reliable benchmark. The effort is worth it: a labeled evaluation set lets you measure the impact of every change you make to chunking, embeddings, top-k, or prompts. Without it, you are flying blind and can easily regress quality while thinking you are improving it.

Faithfulness: Does the Answer Stay in the Context

Faithfulness is the metric that catches hallucinations at the generation step. For each claim in the generated answer, a faithfulness scorer checks whether that claim is supported by the retrieved context. A faithful answer only makes claims traceable back to a retrieved chunk. An unfaithful answer adds information the retrieval did not provide.

Low faithfulness is usually a prompt engineering problem. If your system prompt does not explicitly instruct the model to answer only from context, the model blends retrieved knowledge with parametric memory. Fix this before tuning your retrieval: a faithfulness score below 0.8 typically means your prompt needs work, not your index.

Running Automated Evals with RAGAS

RAGAS (Retrieval Augmented Generation Assessment) automates the evaluation process using an LLM as a judge. You provide a dataset of questions, ground-truth answers, retrieved contexts, and generated answers. RAGAS scores all four metrics automatically without requiring manual annotation for every example.

pip install ragas datasets

from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
    context_precision,
    context_recall,
    faithfulness,
    answer_relevancy,
)

# Build your evaluation dataset
data = {
    "question": [
        "What is the cancellation policy?",
        "How do I reset my password?",
    ],
    "answer": [
        "You can cancel within 30 days for a full refund.",
        "Click Forgot Password on the login screen.",
    ],
    "contexts": [
        ["Our policy allows cancellations within 30 days of purchase for a full refund..."],
        ["To reset your password, go to the login page and click Forgot Password..."],
    ],
    "ground_truth": [
        "Cancellations are accepted within 30 days for a full refund.",
        "Use the Forgot Password link on the login screen.",
    ],
}

dataset = Dataset.from_dict(data)
results = evaluate(
    dataset,
    metrics=[context_precision, context_recall, faithfulness, answer_relevancy],
)

df = results.to_pandas()
print(df[["context_precision", "context_recall", "faithfulness", "answer_relevancy"]])
print("\nMean scores:")
print(df[["context_precision", "context_recall", "faithfulness", "answer_relevancy"]].mean())

Run this evaluation before and after every significant change to your pipeline. A change that improves faithfulness but drops recall often means you tightened the prompt too aggressively. Track all four metrics together to catch these trade-offs early. Store your evaluation runs in a spreadsheet or experiment tracker so you can see trends across multiple iterations.

Prompt

"I want to build a lightweight eval harness for my RAG system without using RAGAS. Help me write an LLM-as-judge prompt that scores a retrieved context and a generated answer on faithfulness from 1 to 5, and explain how to aggregate those scores into a single pipeline quality metric I can track over time."

Want to build this live with Aki?

Join a Lightning Lesson and go deeper on this topic. Browse upcoming sessions →

Evaluate Your RAG: The Three Metrics That Matter

Key Takeaways

The Three Metrics That Define RAG Quality

Measuring Context Precision and Recall

Faithfulness: Does the Answer Stay in the Context

Running Automated Evals with RAGAS

Want to build this live with Aki?

Aki Wijesundara

Ready to Launch Your AI Career?

Table of Contents

Share Article

Get Weekly AI Career Tips