Engineering

BM25 + Embeddings: The Hybrid Retrieval Pattern That Works

Pure semantic search misses exact keyword matches. Pure BM25 misses semantic similarity. Qdrant native sparse and dense hybrid search combines both in a single query with no application-side merging required.

June 26, 2026
7 min read
Aki Wijesundara
#BM25#Hybrid Search#Retrieval

Key Takeaways

  • Comprehensive strategies proven to work at top companies
  • Actionable tips you can implement immediately
  • Expert insights from industry professionals

Sparse Vectors vs Dense Vectors

Dense vectors are what most people mean when they talk about embeddings. A sentence-transformers model encodes a piece of text into a fixed-size vector of floating-point numbers, typically 384 or 1536 dimensions, where most values are non-zero. These vectors capture semantic meaning: documents about similar topics cluster together in the vector space even when they use completely different words.

Sparse vectors are fundamentally different. In a sparse vector, most values are zero. Each non-zero dimension corresponds to a specific token in a vocabulary, and the value at that dimension represents the token's importance in the document. BM25 (Best Match 25) is the classic algorithm for computing these sparse representations. BM25 scores each token based on its frequency within the document, the document length relative to the corpus average, and the inverse document frequency across the corpus (rare terms score higher than common ones).

The failure modes of each approach are complementary. Dense vectors excel at semantic matching but fail on exact term recall. A user searching for "CVE-2024-11234" or a specific API method name expects exact keyword recall. A semantic search model that has not seen those exact terms during training will return semantically adjacent but incorrect results. BM25 will not: it matches the exact token regardless of whether it appears in training data. Combining both in a single query gives you semantic breadth from the dense component and lexical precision from the sparse component.

Qdrant's Native Sparse Vector Support

Most hybrid search implementations run dense search and keyword search as separate queries, collect two result lists, and then merge them in application code using reciprocal rank fusion (RRF) or another rank aggregation method. This works but has real costs: two sequential (or parallel) API calls, application-side merge logic, and no integration with Qdrant's payload filtering at the fusion stage.

Qdrant supports both sparse and dense vectors natively in the same collection and the same query. You define both vector spaces at collection creation time. Documents carry both a dense embedding and a sparse BM25 representation. Queries use both vector spaces simultaneously, with Qdrant running the merge at the database layer using its built-in RRF or Distribution-Based Score Fusion support.

The sparse vector support in Qdrant is designed specifically for BM25-style vectors: high-dimensional (vocabulary-sized), almost entirely zero, stored in a compressed format that only records non-zero index-value pairs. The fastembed library from Qdrant provides a BM25 sparse encoder that produces vectors in exactly this format and integrates directly with Qdrant's sparse vector upsert API.

Indexing and Querying with Hybrid Vectors

Creating a hybrid collection requires defining both vector spaces separately at collection creation. Dense vectors go in vectors_config, sparse vectors go in sparse_vectors_config. At upsert time, each point carries both a dense vector and a sparse vector computed by the respective encoders.

from qdrant_client import QdrantClient
from qdrant_client.models import (
    VectorParams, SparseVectorParams, Distance,
    PointStruct, SparseVector
)
from fastembed import SparseTextEmbedding, TextEmbedding
import uuid

client = QdrantClient(host="localhost", port=6333)

client.create_collection(
    collection_name="hybrid_search",
    vectors_config={
        "dense": VectorParams(size=384, distance=Distance.COSINE)
    },
    sparse_vectors_config={
        "sparse": SparseVectorParams()
    }
)

dense_model = TextEmbedding("sentence-transformers/all-MiniLM-L6-v2")
sparse_model = SparseTextEmbedding("Qdrant/bm25")

def index_documents(documents):
    texts = [doc["content"] for doc in documents]
    dense_embeddings = list(dense_model.embed(texts))
    sparse_embeddings = list(sparse_model.embed(texts))

    points = []
    for i, (dense, sparse) in enumerate(zip(dense_embeddings, sparse_embeddings)):
        points.append(PointStruct(
            id=str(uuid.uuid4()),
            vector={
                "dense": dense.tolist(),
                "sparse": SparseVector(
                    indices=sparse.indices.tolist(),
                    values=sparse.values.tolist()
                )
            },
            payload={
                "title": documents[i]["title"],
                "content": documents[i]["content"]
            }
        ))

    client.upsert(collection_name="hybrid_search", points=points)
    print("Indexed {} documents with hybrid vectors".format(len(points)))

The fastembed BM25 encoder fits its vocabulary to your corpus when you first call embed() on a batch of documents. For production use, fit the vocabulary once on your full corpus and serialize the model state before serving queries, so query-time sparse vectors use the same token space as the indexed document vectors. Mismatched vocabularies will silently degrade recall in ways that are difficult to diagnose without explicit recall measurement.

Querying with both vector spaces uses Qdrant's query_points API with prefetch blocks. Each prefetch block runs against one vector space and returns a candidate list. The fusion step merges the two ranked lists using RRF, which ranks documents based on their position in each list without requiring score normalization between the two systems.

from qdrant_client.models import Prefetch, FusionQuery, Fusion

def hybrid_search(query, limit=10):
    dense_vec = list(dense_model.query_embed(query))[0].tolist()
    sparse_result = list(sparse_model.query_embed(query))[0]

    results = client.query_points(
        collection_name="hybrid_search",
        prefetch=[
            Prefetch(
                query=dense_vec,
                using="dense",
                limit=20
            ),
            Prefetch(
                query=SparseVector(
                    indices=sparse_result.indices.tolist(),
                    values=sparse_result.values.tolist()
                ),
                using="sparse",
                limit=20
            )
        ],
        query=FusionQuery(fusion=Fusion.RRF),
        limit=limit
    )
    return results

results = hybrid_search("BM25 sparse vector indexing performance")
for r in results.points:
    print("[{:.4f}] {}".format(r.score, r.payload["title"]))

When Hybrid Search Beats RRF Post-processing

The native Qdrant approach consistently outperforms application-side post-processing in three ways. First, latency: two sequential API calls add two round-trip latencies. Two parallel calls avoid the sequential overhead but require concurrent request management. Qdrant runs both prefetch queries internally in parallel, with no additional network round-trips between them. Second, consistency: application-side merging requires normalizing scores from two different systems (cosine similarity from dense search and BM25 scores from keyword search), which are not on the same scale and require empirical calibration. Qdrant's RRF implementation uses only rank positions, not raw scores, making it robust out of the box.

Third, and most importantly: native hybrid search integrates correctly with Qdrant's payload filtering. When you add a query_filter to a hybrid query, Qdrant applies the filter before running both the dense and sparse prefetch stages, correctly restricting the candidate pool for both vector spaces. Application-side post-processing cannot achieve this without running two separate filtered queries against two separate systems, doubling infrastructure complexity and making the filter boundary harder to enforce consistently.

On retrieval benchmarks, hybrid search reliably outperforms dense-only search on queries containing specific technical terms, product names, version numbers, or other low-frequency identifiers that are unlikely to be well-represented in embedding space. The dense component recovers semantically related content the user may not have phrased precisely. The sparse component ensures exact or near-exact term matches are not lost in semantic smoothing. Together they cover the failure modes that each approach has individually.

Prompt

"I have a dense-only semantic search system in Qdrant and want to add BM25 hybrid search. My corpus has 200,000 technical documents already indexed. Walk me through adding sparse vector support without downtime, including how to re-index existing documents and how to measure whether recall actually improved after the migration."

Want to build this live with Aki?

Join a Lightning Lesson and go deeper on this topic. Browse upcoming sessions →

A

Aki Wijesundara

Expert team of AI professionals and career advisors with experience at top tech companies. We've helped 500+ students land internships at Google, Meta, OpenAI, and other leading AI companies.

📍 Silicon Valley🎓 500+ Success Stories⭐ 98% Success Rate

Ready to Launch Your AI Career?

Join our comprehensive program and get personalized guidance from industry experts who've been where you want to go.

Share Article

Get Weekly AI Career Tips

Join 5,000+ professionals getting actionable career advice in their inbox.

No spam. Unsubscribe anytime.