Posts Tagged "llm-optimization"

Voice Agents Don't Need to Be Faster — They Need to Feel Faster

Two agents with identical latency can feel completely different. The gap is fixable at the orchestration layer, without touching the model.

Don't Read the PDF. Write the Parser.

I stopped feeding hospital PDFs to a vision model. When the layout changes, the AI fixes the parser instead — and production never sees a token.

Sub-10ms AI Responses Without Calling the LLM

Users ask similar questions in different words. Semantic caching with pgvector turns repeated intent into instant answers — no LLM call, no embedding, no retrieval pipeline.

Your AI Forgot What You Said 30 Messages Ago

Context windows fill up fast in long AI conversations. Sliding windows, progressive compression, and token budgeting — the patterns I built before I knew their names.