Posts Tagged "llm-optimization"

Don't Read the PDF. Write the Parser.

I stopped feeding hospital PDFs to a vision model. When the layout changes, the AI fixes the parser instead — and production never sees a token.

Sub-10ms AI Responses Without Calling the LLM

Users ask similar questions in different words. Semantic caching with pgvector turns repeated intent into instant answers — no LLM call, no embedding, no retrieval pipeline.

Your AI Forgot What You Said 30 Messages Ago

Context windows fill up fast in long AI conversations. Sliding windows, progressive compression, and token budgeting — the patterns I built before I knew their names.