Technical reference

Scaling an AI Knowledge Stack Past a Million Entries

Everything works at 10,000 entries. Most things work at 100,000. By a million, the decisions you avoided at the start become the latency problems you can't ignore.

The Scale Phases

Scaling Vector Storage Thresholds

Vector database performance does not degrade linearly. Most teams encounter specific breakpoints where query latency spikes or recall drops, often mistakenly concluding that a migration to a specialized vector store is required when the existing Postgres instance simply needs tuning to scale AI knowledge base 1M entries.

  • 10k to 100k Entries: Default IVFFlat indexes typically suffice. Performance remains high with minimal configuration, as the dataset fits comfortably in memory.
  • 100k to 1M Entries: Recall begins to fluctuate. Tuning lists and probes becomes necessary for IVFFlat, or a transition to HNSW is required for consistent latency.
  • 1M to 10M Entries: This is the critical inflection point. To scale AI knowledge base 1M entries and beyond, teams must implement HNSW indexes and introduce Postgres partitioning to prevent index build times from becoming prohibitive.
  • 10M+ Entries: Architectural shifts are required. This phase necessitates read replicas for query distribution and materialized views to pre-calculate common similarity joins.

Failure to recognize these phases leads to over-engineering. Many deployments attempt to implement complex sharding at 500k rows when a simple index change would resolve the bottleneck.

IVFFlat vs HNSW at Scale

Indexing Tradeoffs for High-Volume Datasets

Choosing between IVFFlat and HNSW depends on the balance between memory availability and recall requirements. For those looking to scale AI knowledge base 1M entries, HNSW is generally recommended due to superior recall (95%+) and consistent query performance.

Factor HNSW (<50M) IVFFlat (50M+)
Build Time Slow (Hours for 100M) Fast (Minutes for 100M)
Index Size 2-5x larger Compact
Recall 95%+ default 70-80% default
Inserts Incremental Requires rebuilds

HNSW is tuned via m (connections per node) and ef_construction. To increase recall at runtime, adjust hnsw.ef_search.

CREATE INDEX CONCURRENTLY idx_vectors_hnsw ON table USING hnsw (embedding vector_cosine_ops)
WITH (m = 32, ef_construction = 200);

IVFFlat relies on Voronoi cells. It requires all data to be loaded before indexing to ensure proper clustering. For datasets exceeding 1M rows, the lists heuristic shifts from square root of rows to row_count / 1000.

CREATE INDEX CONCURRENTLY idx_vectors_ivfflat ON table USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 1000); -- For 1M entries

Switch from IVFFlat to HNSW when recall drops below 90% for primary query patterns, provided the infrastructure can support the increased memory footprint.

Partitioning Strategies

Implementing Declarative Partitioning

To scale AI knowledge base 1M entries effectively, Postgres partitioning is used to parallelize index builds and prune unnecessary data during scans. By splitting a massive table into smaller sub-tables, pgvector can maintain indexes on a per-partition basis.

Partitioning Methods

  • Range Partitioning: Ideal for time-series data (e.g., created_at). This allows the system to isolate recent "hot" vectors and drop old partitions without expensive DELETE operations.
  • List Partitioning: Used for scoped retrieval based on a specific attribute, such as source_type or category. Queries filtered by these keys ignore irrelevant partitions entirely.
  • Hash Partitioning: Applied at extreme scales (10M+) to distribute data evenly across physical disks using a stable key like tenant_id.
CREATE TABLE vectors (
  id SERIAL,
  embedding vector(1536),
  tenant_id INT
) PARTITION BY HASH (tenant_id);

CREATE TABLE vectors_p0 PARTITION OF vectors FOR VALUES WITH (MODULUS 4, REMAINDER 0);

This structure ensures that a query including WHERE tenant_id = 123 only scans the specific partition containing that ID's vectors, drastically reducing I/O overhead.

Caching and Read Replicas

Optimizing Retrieval Latency

Once a system reaches 10M+ rows, index tuning alone cannot eliminate latency. To scale AI knowledge base 1M entries into the tens of millions, architectural caching and read distribution are required.

Redis is frequently deployed as a Top-K cache. By storing the results of common vector queries with a TTL of several minutes or hours, the system avoids redundant heavy computations on the primary database. This is particularly effective for trending documents in RAG pipelines.

For read-heavy workloads, implementing Postgres read replicas—available through providers like Supabase—allows query traffic to be offloaded from the primary write instance. This prevents long-running vector similarity searches from blocking critical write transactions.

Often, perceived scaling issues are actually failures in retrieval quality. If latency is low but results are irrelevant, the bottleneck is likely poor chunking strategies or inadequate embedding models rather than database performance.

Combining read replicas with an aggressive caching layer ensures that the system maintains sub-second response times even as the underlying vector store grows toward 100M entries.

Appendix · Questions

Reference: common questions

How do I scale pgvector past 1 million rows?
For datasets between 1M and 50M entries, implement HNSW indexes to maintain high recall (95%+) and consistent query performance. To further optimize, use Postgres declarative partitioning on a stable key like tenant_id or date to parallelize index builds and prune scans during similarity searches.
When should I switch from IVFFlat to HNSW in pgvector?
Switch to HNSW when you require higher recall, incremental data inserts without frequent index rebuilds, and more consistent query latency. IVFFlat is generally reserved for massive datasets (50M+ vectors) where memory constraints are critical and build speed outweighs retrieval precision.
Do I need partitioning for a large AI knowledge base?
Yes, once you hit the 10M+ range. Partitioning by category or tenant_id allows you to isolate hot data and perform parallel index builds across sub-tables, preventing a single massive HNSW index from becoming a maintenance bottleneck.
How do I cache AI knowledge base queries for better performance?
Implement a caching layer using Redis or Valkey to store the results of frequent semantic queries. Since vector searches are computationally expensive, caching the top-k result IDs for common embeddings significantly reduces the load on your Postgres CPU.
Should I use Postgres read replicas for vector search workloads?
Yes. Vector similarity searches are CPU and memory intensive; offloading these queries to read replicas prevents semantic search spikes from locking up your primary write instance, ensuring your knowledge base remains available for updates.
What is the memory cost of HNSW vs IVFFlat?
HNSW has a significantly higher memory footprint because it stores complex neighbor graphs to enable fast traversal. IVFFlat is much more compact as it uses Voronoi cells, making it the better choice for memory-bound environments with 50M+ vectors.
How do I rebuild a pgvector index without downtime?
Use the `CREATE INDEX CONCURRENTLY` command. This allows Postgres to build the HNSW or IVFFlat index in the background without locking the table, ensuring your AI application can continue querying and inserting data during the process.
Can I shard pgvector across multiple Postgres instances?
While pgvector is a plugin for a single instance, you can achieve sharding using tools like Citus or by implementing application-level sharding. This distributes your vector embeddings across multiple nodes to overcome the vertical scaling limits of a single server.
What breaks first when scaling pgvector to 10M rows?
Memory (RAM) is usually the first bottleneck, as HNSW indexes must ideally reside in memory for peak performance. If RAM is exhausted, you will see a sharp increase in disk I/O and query latency, followed by significantly longer index build times.
How do I measure retrieval quality at scale?
Track recall by comparing the results of an approximate nearest neighbor (ANN) search against a brute-force exact scan on a sample subset. For HNSW, tune `hnsw.ef_search` to find the optimal balance between query speed and accuracy.
When is it time to add a dedicated vector database?
Consider moving to a dedicated store like Milvus or Pinecone when your dataset exceeds 50M-100M vectors or you require specialized features like native multi-tenancy and automatic scaling that exceed the capabilities of a self-managed Postgres cluster.