Posts Tagged "llm-optimization"
Voice Agents Don't Need to Be Faster — They Need to Feel Faster
Two agents with identical latency can feel completely different. The gap is fixable at the orchestration layer, without touching the model.
Read Post
Don't Read the PDF. Write the Parser.
I stopped feeding hospital PDFs to a vision model. When the layout changes, the AI fixes the parser instead — and production never sees a token.
Read Post
Sub-10ms AI Responses Without Calling the LLM
Users ask similar questions in different words. Semantic caching with pgvector turns repeated intent into instant answers — no LLM call, no embedding, no retrieval pipeline.
Read Post
Your AI Forgot What You Said 30 Messages Ago
Context windows fill up fast in long AI conversations. Sliding windows, progressive compression, and token budgeting — the patterns I built before I knew their names.
Read Post