Posts Tagged "llm-optimization"

An AI Wrote 200% Where It Meant 20%. The Bound Caught It.

Confidence is what the model thinks of itself. Bounds are what your system thinks of the model. They are independent signals; use both.

Voice Agents Don't Need to Be Faster — They Need to Feel Faster

Two agents with identical latency can feel completely different. The gap is fixable at the orchestration layer, without touching the model.

Don't Read the PDF. Write the Parser.

I stopped feeding hospital PDFs to a vision model. When the layout changes, the AI fixes the parser instead — and production never sees a token.

Sub-10ms AI Responses Without Calling the LLM

Users ask similar questions in different words. Semantic caching with pgvector turns repeated intent into instant answers — no LLM call, no embedding, no retrieval pipeline.

Your AI Forgot What You Said 30 Messages Ago

February 26, 2026

Context windows fill up fast in long AI conversations. Sliding windows, progressive compression, and token budgeting — the patterns I built before I knew their names.