Scale Your Vector DB to a Million Documents

What Actually Changes at Scale

From a few thousand vectors to a million, three things break: query latency, memory usage, and ingestion speed. Understanding which one is your bottleneck determines which configuration changes will actually help.

Query latency breaks because brute-force (exact) similarity search scales linearly with the number of vectors. Searching 100,000 vectors takes roughly 10x longer than searching 10,000. At a million vectors, a brute-force scan returns results in hundreds of milliseconds rather than single-digit milliseconds. The solution is approximate nearest neighbor (ANN) indexing, which sacrifices a small percentage of recall in exchange for dramatically faster queries. Most production workloads accept 95 to 99% recall in exchange for 10x to 100x speed improvements.

Memory usage breaks because dense vectors take real space. A single 1536-dimensional float32 vector uses 6KB. One million such vectors require 6GB of RAM just for the raw vectors, before accounting for the HNSW graph overhead (which adds another 50 to 100% on top). Without memory management strategies like quantization or on-disk indexing, a large collection requires expensive high-memory instances. Vector quantization techniques can reduce memory usage by 4x to 16x while retaining most recall quality.

Ingestion speed breaks because inserting vectors one at a time, or in small batches, creates excessive API round-trip overhead and triggers index optimization too frequently. At scale, batched upserts with the right batch size and parallel upload workers are necessary to achieve reasonable ingestion throughput. A poorly tuned ingestion pipeline can take hours to load what a well-tuned one loads in minutes.

ANN Algorithms: HNSW and IVF

The two dominant ANN algorithms for vector databases are HNSW (Hierarchical Navigable Small World) and IVF (Inverted File Index). Most production vector databases default to HNSW, and for good reason.

HNSW builds a multi-layer graph over the vector space. Each layer is a small-world graph where nodes are connected to their nearest neighbors. Search starts at the top layer (fewest nodes, long-range connections) and descends to lower layers (more nodes, shorter-range connections), greedy-walking toward the query vector at each layer. This gives HNSW excellent query latency and strong recall at the cost of higher memory usage and slower index build time compared to simpler methods.

The key HNSW parameters are M (the number of bidirectional connections per node in the graph) and ef_construct (the size of the candidate set during index construction). Higher M improves recall but increases memory and build time. Higher ef_construct improves recall at the cost of slower builds. At query time, the ef parameter controls recall vs. latency: higher ef means more candidate nodes examined, more accurate results, slower query. The sweet spot for most production workloads is M=16, ef_construct=100 to 200, and a query ef tuned to your latency budget.

IVF partitions the vector space into clusters and during search only examines the clusters closest to the query. IVF uses less memory than HNSW and builds faster, but query recall is more sensitive to configuration. IVF is a better choice when memory is the primary constraint and you can tolerate additional recall tuning. In practice, HNSW is the safer default for most teams.

Quantization and Batched Upserts

Scalar quantization (int8) reduces each float32 component (4 bytes) to an int8 (1 byte), a 4x reduction in vector storage with minimal recall loss. Qdrant applies re-scoring on the original float32 vectors for the top candidates returned by the quantized ANN search, recovering most of the recall loss at a small latency cost. For most workloads, int8 quantization with re-scoring is the best trade-off: 4x memory reduction, less than 1% recall drop.

from qdrant_client import QdrantClient
from qdrant_client.models import (
    VectorParams, Distance, HnswConfigDiff,
    OptimizersConfigDiff, ScalarQuantization,
    ScalarQuantizationConfig, ScalarType
)

client = QdrantClient(host="localhost", port=6333)

client.create_collection(
    collection_name="large_corpus",
    vectors_config=VectorParams(
        size=1536,
        distance=Distance.COSINE,
        hnsw_config=HnswConfigDiff(
            m=16,
            ef_construct=200,
            full_scan_threshold=10000,
            on_disk=True
        ),
        quantization_config=ScalarQuantization(
            scalar=ScalarQuantizationConfig(
                type=ScalarType.INT8,
                always_ram=True
            )
        )
    ),
    optimizers_config=OptimizersConfigDiff(
        indexing_threshold=20000,
        memmap_threshold=50000
    )
)

print("Collection created with HNSW + int8 quantization")

Setting on_disk=True for the HNSW graph moves the graph structure to disk, dramatically reducing RAM usage for very large collections. The quantized int8 vectors are kept in RAM (always_ram=True) because they are accessed on every query. The raw float32 vectors used for re-scoring can live on disk since they are only accessed for the small top-k candidate set returned by the ANN search.

from qdrant_client.models import PointStruct

BATCH_SIZE = 2000

def upsert_in_batches(client, collection_name, all_points):
    total = len(all_points)
    batches = (total + BATCH_SIZE - 1) // BATCH_SIZE
    for i in range(0, total, BATCH_SIZE):
        batch = all_points[i:i + BATCH_SIZE]
        client.upsert(
            collection_name=collection_name,
            points=batch,
            wait=False
        )
        current_batch = i // BATCH_SIZE + 1
        print("Uploaded batch {}/{} ({} points)".format(
            current_batch, batches, len(batch)
        ))
    print("All {} points submitted".format(total))

all_points = [
    PointStruct(
        id=i,
        vector=embeddings[i].tolist(),
        payload={
            "doc_id": metadata[i]["id"],
            "source": metadata[i]["source"]
        }
    )
    for i in range(len(embeddings))
]

upsert_in_batches(client, "large_corpus", all_points)

Qdrant vs Pinecone at Scale

Both Qdrant and Pinecone handle million-vector collections well, but they optimize for different priorities. Qdrant gives you full control over HNSW parameters, quantization strategies, on-disk storage, and filtering behavior. You can tune the recall-latency-memory trade-off precisely for your specific workload. The operational cost is that you manage the infrastructure: provisioning, monitoring, backups, and upgrades.

Pinecone abstracts all of this. You specify the vector dimension and distance metric, and Pinecone handles indexing, replication, and scaling automatically. The trade-off is less tuning control and higher per-query cost at very large scales. For teams without dedicated infrastructure engineers, Pinecone's fully managed model often justifies the premium. For teams with engineering capacity and cost sensitivity at high volume, Qdrant's self-hosted option is significantly cheaper per query.

Prompt

"I need to scale my Qdrant collection from 100,000 to 5 million vectors. My current query latency is 50ms and I need to keep it under 100ms. My vectors are 1536-dimensional. Help me audit my current HNSW and quantization configuration and identify the highest-impact changes to make before the scale-up."

Want to build this live with Aki?

Join a Lightning Lesson and go deeper on this topic. Browse upcoming sessions →

Scale Your Vector DB to a Million Documents

Key Takeaways

What Actually Changes at Scale

ANN Algorithms: HNSW and IVF

Quantization and Batched Upserts

Qdrant vs Pinecone at Scale

Want to build this live with Aki?

Aki Wijesundara

Ready to Launch Your AI Career?

Table of Contents

Share Article

Get Weekly AI Career Tips