Re-Ranking: The RAG Quality Boost You're Missing
First-pass vector retrieval is fast but imprecise. Adding a cross-encoder re-ranker as a second stage dramatically improves the quality of context your LLM receives without touching your index.
Key Takeaways
- Comprehensive strategies proven to work at top companies
- Actionable tips you can implement immediately
- Expert insights from industry professionals
Why First-Pass Retrieval Falls Short
Your vector search returns the top-k chunks based on embedding similarity. But embedding similarity is a blunt instrument. It measures whether two pieces of text are roughly in the same neighborhood of semantic space, not whether one actually answers the other. You will often see chunks retrieved that are topically related but do not contain the specific answer the user needs.
This is where re-ranking comes in. After your initial retrieval, you run a second, more expensive model to score each candidate chunk against the query. You then reorder the results by this new score before passing them to the LLM. The result is a significant improvement in the quality of context your model receives, without touching your index or changing your embeddings.
Bi-Encoders vs Cross-Encoders: The Core Trade-off
Your embedding model is a bi-encoder. It encodes the query and each document independently, then compares the resulting vectors. This is fast because you precompute document vectors at index time. The weakness is that the query and document never interact during encoding, so the model cannot attend to fine-grained token-level relationships between them.
A cross-encoder processes the query and document together as a single input, allowing it to attend to every token in both. This produces far more accurate relevance scores. The catch: you cannot precompute anything. Every query requires a fresh forward pass over each candidate document. That is why you use cross-encoders only on a small shortlist from your initial retrieval, typically 20 to 50 candidates, not over your entire corpus.
Adding a Re-Ranker to Your Pipeline
The two most practical options are the Cohere re-ranker API (hosted, no GPU required) and the sentence-transformers cross-encoder models (self-hosted, free). Start with Cohere if you want something running in minutes. Switch to a local cross-encoder if you need lower latency or want to keep data on-premise. The code below shows both patterns side by side.
pip install cohere sentence-transformers
import cohere
from sentence_transformers import CrossEncoder
query = "How do I configure OAuth2 in the API?"
# initial_chunks: list of strings from your first-pass vector retrieval
# Option A: Cohere re-ranker (hosted)
co = cohere.Client("YOUR_COHERE_API_KEY")
cohere_result = co.rerank(
query=query,
documents=initial_chunks,
top_n=5,
model="rerank-english-v3.0",
)
reranked_cohere = [r.document["text"] for r in cohere_result.results]
# Option B: Local cross-encoder (sentence-transformers)
cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
pairs = [[query, chunk] for chunk in initial_chunks]
scores = cross_encoder.predict(pairs)
# Sort by score descending, keep top 5
scored = sorted(zip(scores, initial_chunks), reverse=True)
reranked_local = [chunk for _, chunk in scored[:5]]
print("Top re-ranked result: {}".format(reranked_local[0][:300]))
In practice, re-ranking adds 50 to 200 milliseconds of latency depending on whether you use a hosted or local model and how many candidates you score. For most applications that is a worthwhile trade for the improvement in answer quality. You can also run the re-ranker asynchronously alongside the start of LLM generation if your architecture supports streaming.
When to Re-Rank (and When to Skip It)
Re-ranking helps most when your queries are specific and your corpus is large. If users ask precise factual questions, a cross-encoder reliably surfaces the right paragraph. If your corpus is small (under a few thousand chunks) or your queries are broad and exploratory, the initial retrieval is often sufficient.
The best way to decide is to measure. Build a small evaluation set of 50 to 100 query-answer pairs and compare retrieval precision with and without re-ranking. Run this experiment before investing in infrastructure: the improvement is usually obvious, but sometimes initial retrieval is already good enough and the added latency is not justified.
Prompt
"My RAG system retrieves topically relevant chunks but misses the specific paragraph that actually answers the user's question. Walk me through how to add a cross-encoder re-ranker as a second stage, what top-k values to use for initial retrieval vs the re-ranker shortlist, and how to measure whether it actually helps."
Want to build this live with Aki?
Join a Lightning Lesson and go deeper on this topic. Browse upcoming sessions →
Aki Wijesundara
Expert team of AI professionals and career advisors with experience at top tech companies. We've helped 500+ students land internships at Google, Meta, OpenAI, and other leading AI companies.
Ready to Launch Your AI Career?
Join our comprehensive program and get personalized guidance from industry experts who've been where you want to go.
Table of Contents
Share Article
Get Weekly AI Career Tips
Join 5,000+ professionals getting actionable career advice in their inbox.
No spam. Unsubscribe anytime.