A Practical Guide to Vector Search at Production Scale

Vector search at toy scale is a one-liner: load a flat index, brute-force cosine similarity, ship it. At 100M vectors that’s a 4-second query and a node-killing memory bill.

What actually matters at scale is mundane.

The three knobs

Pick an index family, pick a recall target, pick a memory budget. You don’t get to optimize all three.

HNSW is fast and high-recall but eats memory linearly with vector count.
IVF-PQ scales to billions but you trade 5–10 points of recall.
DiskANN is the right answer when your dataset doesn’t fit in RAM and you can’t afford the recall hit of PQ.

We use HNSW under 50M, DiskANN above. We never use IVF-PQ as a primary index anymore — the recall cliff at the boundary of clusters bites you.

The thing nobody tells you

Re-embedding is the silent killer. When you upgrade your model, you have to re-embed every vector. At 100M vectors and $0.0001 per embed, that’s $10K just to ship a model update. Build batched re-embedding into your pipeline from day one, or you’ll dread your own roadmap.

Filtering

Pre-filter (apply WHERE before search) wins below ~1% selectivity. Above that, post-filter (search-then-filter with overshoot) is faster every time. We benchmark both at deploy and pick per-query.

A Practical Guide to Vector Search at Production Scale

The three knobs

The thing nobody tells you

Filtering

More from the blog

Building Real-Time Data Pipelines at Scale

How We Cut Pipeline Latency by 85% With Adaptive Buffering

Designing Anomaly Detection That Engineers Actually Trust