Scaling an AI Knowledge Stack Past a Million Entries
Everything works at 10,000 entries. Most things work at 100,000. By a million, the decisions you avoided at the start become the latency problems you can't ignore.
The Scale Phases
Scaling Vector Storage Thresholds
Vector database performance does not degrade linearly. Most teams encounter specific breakpoints where query latency spikes or recall drops, often mistakenly concluding that a migration to a specialized vector store is required when the existing Postgres instance simply needs tuning to scale AI knowledge base 1M entries.
- 10k to 100k Entries: Default IVFFlat indexes typically suffice. Performance remains high with minimal configuration, as the dataset fits comfortably in memory.
- 100k to 1M Entries: Recall begins to fluctuate. Tuning
listsandprobesbecomes necessary for IVFFlat, or a transition to HNSW is required for consistent latency. - 1M to 10M Entries: This is the critical inflection point. To scale AI knowledge base 1M entries and beyond, teams must implement HNSW indexes and introduce Postgres partitioning to prevent index build times from becoming prohibitive.
- 10M+ Entries: Architectural shifts are required. This phase necessitates read replicas for query distribution and materialized views to pre-calculate common similarity joins.
Failure to recognize these phases leads to over-engineering. Many deployments attempt to implement complex sharding at 500k rows when a simple index change would resolve the bottleneck.
IVFFlat vs HNSW at Scale
Indexing Tradeoffs for High-Volume Datasets
Choosing between IVFFlat and HNSW depends on the balance between memory availability and recall requirements. For those looking to scale AI knowledge base 1M entries, HNSW is generally recommended due to superior recall (95%+) and consistent query performance.
| Factor | HNSW (<50M) | IVFFlat (50M+) |
|---|---|---|
| Build Time | Slow (Hours for 100M) | Fast (Minutes for 100M) |
| Index Size | 2-5x larger | Compact |
| Recall | 95%+ default | 70-80% default |
| Inserts | Incremental | Requires rebuilds |
HNSW is tuned via m (connections per node) and ef_construction. To increase recall at runtime, adjust hnsw.ef_search.
CREATE INDEX CONCURRENTLY idx_vectors_hnsw ON table USING hnsw (embedding vector_cosine_ops)
WITH (m = 32, ef_construction = 200);
IVFFlat relies on Voronoi cells. It requires all data to be loaded before indexing to ensure proper clustering. For datasets exceeding 1M rows, the lists heuristic shifts from square root of rows to row_count / 1000.
CREATE INDEX CONCURRENTLY idx_vectors_ivfflat ON table USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 1000); -- For 1M entries
Switch from IVFFlat to HNSW when recall drops below 90% for primary query patterns, provided the infrastructure can support the increased memory footprint.
Partitioning Strategies
Implementing Declarative Partitioning
To scale AI knowledge base 1M entries effectively, Postgres partitioning is used to parallelize index builds and prune unnecessary data during scans. By splitting a massive table into smaller sub-tables, pgvector can maintain indexes on a per-partition basis.
Partitioning Methods
- Range Partitioning: Ideal for time-series data (e.g.,
created_at). This allows the system to isolate recent "hot" vectors and drop old partitions without expensive DELETE operations. - List Partitioning: Used for scoped retrieval based on a specific attribute, such as
source_typeorcategory. Queries filtered by these keys ignore irrelevant partitions entirely. - Hash Partitioning: Applied at extreme scales (10M+) to distribute data evenly across physical disks using a stable key like
tenant_id.
CREATE TABLE vectors (
id SERIAL,
embedding vector(1536),
tenant_id INT
) PARTITION BY HASH (tenant_id);
CREATE TABLE vectors_p0 PARTITION OF vectors FOR VALUES WITH (MODULUS 4, REMAINDER 0);
This structure ensures that a query including WHERE tenant_id = 123 only scans the specific partition containing that ID's vectors, drastically reducing I/O overhead.
Caching and Read Replicas
Optimizing Retrieval Latency
Once a system reaches 10M+ rows, index tuning alone cannot eliminate latency. To scale AI knowledge base 1M entries into the tens of millions, architectural caching and read distribution are required.
Redis is frequently deployed as a Top-K cache. By storing the results of common vector queries with a TTL of several minutes or hours, the system avoids redundant heavy computations on the primary database. This is particularly effective for trending documents in RAG pipelines.
For read-heavy workloads, implementing Postgres read replicas—available through providers like Supabase—allows query traffic to be offloaded from the primary write instance. This prevents long-running vector similarity searches from blocking critical write transactions.
Often, perceived scaling issues are actually failures in retrieval quality. If latency is low but results are irrelevant, the bottleneck is likely poor chunking strategies or inadequate embedding models rather than database performance.
Combining read replicas with an aggressive caching layer ensures that the system maintains sub-second response times even as the underlying vector store grows toward 100M entries.