A Practical Guide to Vector Search at Production Scale
Engineering 1 min read

A Practical Guide to Vector Search at Production Scale

Most vector search tutorials stop at 100K embeddings. This is what changes at 100M — and what stays the same.

Vector search at toy scale is a one-liner: load a flat index, brute-force cosine similarity, ship it. At 100M vectors that’s a 4-second query and a node-killing memory bill.

What actually matters at scale is mundane.

The three knobs

Pick an index family, pick a recall target, pick a memory budget. You don’t get to optimize all three.

  • HNSW is fast and high-recall but eats memory linearly with vector count.
  • IVF-PQ scales to billions but you trade 5–10 points of recall.
  • DiskANN is the right answer when your dataset doesn’t fit in RAM and you can’t afford the recall hit of PQ.

We use HNSW under 50M, DiskANN above. We never use IVF-PQ as a primary index anymore — the recall cliff at the boundary of clusters bites you.

The thing nobody tells you

Re-embedding is the silent killer. When you upgrade your model, you have to re-embed every vector. At 100M vectors and $0.0001 per embed, that’s $10K just to ship a model update. Build batched re-embedding into your pipeline from day one, or you’ll dread your own roadmap.

Filtering

Pre-filter (apply WHERE before search) wins below ~1% selectivity. Above that, post-filter (search-then-filter with overshoot) is faster every time. We benchmark both at deploy and pick per-query.

[ related ]

More from the blog

Engineering deep-dives, product updates, and notes from the team.

View all posts