Sub-10ms AI Responses Without Calling the LLM
A user asks “How do I request vacation?” Your Q&A assistant retrieves documents, builds a prompt, calls the LLM, streams the response. Two seconds, maybe three. The next user asks “What’s the time-off process?” Same answer. Same two seconds. Same cost.
Exact-match caching doesn’t help — the strings are different. But the intent is identical. I needed a cache that understood meaning, not just characters.
The naive approach: exact-match caching
The first thing everyone tries is hashing the question and looking it up in a key-value store. If the hash matches, return the cached answer. Simple, fast, and almost completely useless.
Real users don’t ask questions the same way twice. “How do I request PTO?”, “Where do I submit a vacation request?”, “Time off — how?” — all the same question, zero cache hits. I measured the hit rate on exact-match caching against a representative test set: under 3%. Not worth the code.
Embedding the question
The fix is to compare questions by meaning instead of characters. Every incoming question gets turned into a vector embedding — a list of numbers that captures its semantic content. Then instead of exact string comparison, you do a nearest-neighbor search against all previously cached question-answer pairs.
// LookupSemanticCache checks if a semantically similar question
// has already been answered. Returns the cached answer if the
// similarity exceeds the threshold.
func LookupSemanticCache(
ctx context.Context,
db *pgxpool.Pool,
questionEmbedding []float32,
tenantID string,
threshold float64,
) (*CachedAnswer, error) {
var answer CachedAnswer
err := db.QueryRow(ctx, `
SELECT
answer_text,
sources_referenced,
1 - (question_embedding <=> $1::vector) AS similarity
FROM qa_cache
WHERE tenant_id = $2
ORDER BY question_embedding <=> $1::vector
LIMIT 1
`, pgvector.NewVector(questionEmbedding), tenantID).Scan(
&answer.Text,
&answer.SourcesReferenced,
&answer.Similarity,
)
if err != nil {
return nil, err
}
if answer.Similarity < threshold {
return nil, nil // no close enough match
}
return &answer, nil
}
The <=> operator is pgvector’s cosine distance. Lower distance means higher similarity. The query finds the closest cached question to the incoming one, and if it’s close enough, returns the cached answer.
Threshold tuning: the number that matters
I use a cosine distance threshold of 0.15 (which corresponds to a similarity of 0.85). This number took some tuning to get right, and it’s the single most important parameter in the whole system.
The tradeoff is simple:
- Too low (more permissive) — the cache returns answers for questions that are close but not close enough. “How do I request vacation?” matches “How do I cancel my vacation?” Same topic, opposite intent. Wrong cached answer served with full confidence.
- Too high (more strict) — the cache almost never hits. You’re paying for embeddings on every question but rarely getting the benefit.
I arrived at 0.85 by logging every cache lookup during two weeks of testing — the question, the nearest match, the similarity score, and whether the cached answer was actually correct for the new question. 0.85 gave me a false-positive rate under 2% with a cache hit rate around 35% across the test dataset.
const (
// SimilarityThreshold is the minimum cosine similarity required
// for a cache hit. Tuned against test data to balance
// hit rate (~35%) against false positives (<2%).
SimilarityThreshold = 0.85
)
The schema
The cache lives in PostgreSQL with the pgvector extension. Each entry stores the question embedding, the generated answer, and metadata about which source documents were used to produce it.
CREATE TABLE qa_cache (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id TEXT NOT NULL,
question_text TEXT NOT NULL,
question_embedding vector(1536) NOT NULL,
answer_text TEXT NOT NULL,
sources_referenced TEXT[] NOT NULL DEFAULT '{}',
language TEXT NOT NULL DEFAULT 'en',
hit_count INTEGER NOT NULL DEFAULT 0,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
last_hit_at TIMESTAMPTZ,
invalidated_at TIMESTAMPTZ
);
-- HNSW index for fast approximate nearest-neighbor search.
-- ef_construction=128 and m=16 give good recall at this scale.
CREATE INDEX idx_qa_cache_embedding
ON qa_cache
USING hnsw (question_embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 128);
-- Tenant-scoped lookups are the common path.
CREATE INDEX idx_qa_cache_tenant
ON qa_cache (tenant_id)
WHERE invalidated_at IS NULL;
Two things to note:
- Tenant scoping. Every query filters by
tenant_id. Each organization only searches its own cache. This matters because the same question can have different correct answers depending on the organization’s documents. “What’s the vacation policy?” has a different answer for every company. - HNSW index. This is what makes the lookup fast. HNSW (Hierarchical Navigable Small World) is an approximate nearest-neighbor algorithm that trades a small amount of accuracy for massive speed gains. At 100K cached entries, a lookup takes single-digit milliseconds.
Soft invalidation
The hard part of any cache is invalidation. When a source document changes, which cached answers are now stale?
The naive approach is to nuke the entire cache whenever any document changes. This works but it’s wasteful — if you update the vacation policy document, you don’t need to invalidate cached answers about the expense report process.
The sources_referenced array makes surgical invalidation possible. When a document changes, you only invalidate cache entries that actually cited that document:
// InvalidateBySource marks cache entries as stale when a source
// document they reference has been updated. Only affects entries
// that actually cited the changed document.
func InvalidateBySource(
ctx context.Context,
db *pgxpool.Pool,
tenantID string,
sourceID string,
) (int64, error) {
tag, err := db.Exec(ctx, `
UPDATE qa_cache
SET invalidated_at = now()
WHERE tenant_id = $1
AND $2 = ANY(sources_referenced)
AND invalidated_at IS NULL
`, tenantID, sourceID)
if err != nil {
return 0, err
}
return tag.RowsAffected(), nil
}
When the vacation policy document gets updated, only the cached Q&A pairs that referenced it get invalidated. Everything else stays warm. In practice, a single document update invalidates 5–15% of a tenant’s cache entries, not 100%.
Invalidated entries aren’t deleted — they’re marked with a timestamp. The lookup query filters them out with WHERE invalidated_at IS NULL. This makes invalidation a fast UPDATE rather than a DELETE, and you keep the history for debugging.
The full lookup flow
Putting it all together, the cache sits in front of the full RAG pipeline:
User question
|
v
Embed question (one API call, ~50ms)
|
v
Search cache (pgvector HNSW, ~5ms)
|
+---> Cache HIT (similarity >= 0.85)
| |
| v
| Return cached answer (~0ms)
| Total: ~55ms
|
+---> Cache MISS
|
v
Full RAG pipeline:
Retrieve docs + Build prompt + LLM call
Total: 2000-3000ms
|
v
Store in cache for next time
A cache hit skips the document retrieval, prompt construction, and LLM call entirely. The only cost is the embedding of the incoming question (which you need anyway for the similarity search) plus the vector lookup. That’s ~55ms end-to-end versus 2–3 seconds for the full pipeline.
The “sub-10ms” in the title refers to the cache lookup itself, not the embedding step. If you pre-compute embeddings (which some architectures allow), the entire response is sub-10ms.
The technical names I didn’t know
After building this, I found the established terminology:
- Semantic caching — caching by meaning rather than exact key. The term comes from database research (semantic query caching) but maps directly to what I built for LLM responses.
- Cache coherence — keeping cached data consistent with its source of truth. In distributed systems this is about CPU cache lines and memory barriers. Here it’s about invalidating cached AI answers when their source documents change.
- Soft invalidation — marking entries as stale rather than deleting them. Common in CDN and browser caching. The “stale-while-revalidate” pattern is a cousin of what I’m doing here.
These are classic caching patterns from distributed systems — I just applied them to AI inference instead of web pages or database queries.
What doesn’t work
- Context-dependent questions. “What permissions do I have?” looks the same regardless of who’s asking, but the answer depends on the user’s role. Semantic similarity can’t capture this — the question embeddings are identical. I handle this by including role information in the cache key (effectively partitioning the cache by role), but it reduces the hit rate.
- Evolving answers. Some questions have answers that change frequently. Caching them means serving stale information until invalidation fires. The
sources_referencedtracking helps, but only if the answer was actually generated from those sources. If the LLM synthesized something from general knowledge, there’s no source to track. - Multilingual overlap. The same question in different languages produces different embeddings. “How do I request vacation?” and “¿Cómo solicito vacaciones?” are semantically identical but won’t match. I store a
languagefield and only match within the same language, which means separate cache pools per language. - The similarity threshold is a blunt instrument. 0.85 works on average, but some question clusters are tighter (0.90 would be fine) and others are wider (0.80 would be better). A per-topic threshold would be more accurate but dramatically more complex.
Cost impact
Every cache hit saves:
- One LLM call (the big one — this is where most of the cost and latency lives)
- One document retrieval operation
- One prompt construction step
At the 35% hit rate I saw across the test dataset, that’s roughly a third of LLM costs eliminated. The embedding cost for the cache lookup is a rounding error compared to the LLM call it replaces. And the difference in response time is dramatic — going from a 2-second response to a sub-100ms response for over a third of questions makes the system feel qualitatively different.
The cache pays for itself almost immediately. The pgvector index, the storage, the embedding calls for lookups — all of it combined costs less than the LLM calls you’re avoiding.