February 26, 2026

context-window-management token-optimization long-context-ai memory-management llm-agents go

Your AI Forgot What You Said 30 Messages Ago

I was 50 messages into a conversation with an AI assistant when it asked me a question I’d already answered — twice. Not a hallucination, not a misunderstanding. It had genuinely lost the information. The earlier messages had fallen out of the context window, and the AI was working with a partial picture of our conversation.

I didn’t know the term “context window management” at the time. I just knew that long conversations broke things, and I needed to fix it.

The naive approach: send everything

The first implementation was obvious — just send the full conversation history with every request. It works until it doesn’t.

The problems show up fast:

Token costs explode. Every message in the history gets billed on every request. A 50-message conversation means you’re paying for all 50 messages each time.
Latency spikes. More input tokens means slower responses. Users notice.
The AI gets confused. Older context competes with newer context. The model gives weight to things that were discussed 40 messages ago and are no longer relevant, or contradicts decisions that were already settled.

I needed a way to keep conversations long without keeping the context window bloated.

Sliding window: keep what’s recent

The first real fix was a sliding window. Keep the opening message (system context with instructions and configuration) plus the last N messages. Everything in between gets dropped.

const MaxMessageWindow = 10

// TruncateHistory keeps the system message and the most recent
// messages within the window limit.
func TruncateHistory(messages []Message) []Message {
    if len(messages) <= MaxMessageWindow+1 {
        return messages
    }

    // Always keep the first message (system context)
    result := make([]Message, 0, MaxMessageWindow+1)
    result = append(result, messages[0])

    // Keep the last N messages
    start := len(messages) - MaxMessageWindow
    result = append(result, messages[start:]...)

    return result
}

This is simple and it works surprisingly well for most conversations. The system prompt stays anchored, and the AI always has the most recent exchanges. For short conversations, nothing gets dropped at all.

But there’s a problem: knowledge disappears with the messages.

The window drops messages, not knowledge

In message 5, the user says “I’m working with the payments module.” By message 20, that message has been dropped from the window. Now the AI doesn’t know which module the user is working with. It asks again — or worse, it guesses wrong.

The window is the right approach for managing token count, but you need a separate mechanism for preserving the knowledge that accumulates during a conversation. I ended up building what I called “accumulated notes” — a structured document, organized by topic, that persists across the entire session regardless of which messages are still in the window.

// AccumulatedNotes tracks knowledge extracted from the conversation,
// organized by topic. This persists even as messages leave the window.
type AccumulatedNotes struct {
    Sections []NoteSection
}

type NoteSection struct {
    Topic     string
    Status    string   // "confirmed", "active", "superseded"
    Points    []string
    UpdatedAt int      // message index when last updated
}

// AddOrUpdate merges new information into the appropriate section.
// If the topic exists, it appends new points. If not, it creates
// a new section.
func (n *AccumulatedNotes) AddOrUpdate(topic string, points []string, msgIndex int) {
    for i, section := range n.Sections {
        if section.Topic == topic {
            n.Sections[i].Points = append(n.Sections[i].Points, points...)
            n.Sections[i].Status = "active"
            n.Sections[i].UpdatedAt = msgIndex
            return
        }
    }
    n.Sections = append(n.Sections, NoteSection{
        Topic:     topic,
        Status:    "active",
        Points:    points,
        UpdatedAt: msgIndex,
    })
}

The AI extracts key facts and decisions from each exchange and files them into the notes. When an old message falls out of the sliding window, the knowledge it contained is already captured in the notes document. The notes get included in every prompt — after the system message but before the conversation window.

Progressive compression: three tiers

The notes themselves grow over time. In a long session with many topics, the accumulated notes can eat up a significant chunk of the token budget. So I added compression.

The idea is simple: not all knowledge needs the same level of detail. A topic that was discussed, decided, and hasn’t come up in 20 messages doesn’t need five bullet points. It needs one sentence.

// CompressNotes reduces the token footprint of accumulated notes
// by compressing confirmed topics to one-line summaries.
func CompressNotes(notes *AccumulatedNotes, currentMsgIndex int) {
    for i, section := range notes.Sections {
        messagesSinceUpdate := currentMsgIndex - section.UpdatedAt

        switch {
        case section.Status == "confirmed" || messagesSinceUpdate > 20:
            // Tier 1: Confirmed or stale topics get compressed
            // to a single summary line.
            summary := summarizePoints(section.Points)
            notes.Sections[i].Points = []string{summary}
            notes.Sections[i].Status = "confirmed"

        case messagesSinceUpdate > 10:
            // Tier 2: Aging topics keep their most recent points.
            if len(section.Points) > 3 {
                notes.Sections[i].Points = section.Points[len(section.Points)-3:]
            }

        default:
            // Tier 3: Active topics keep full detail.
        }
    }
}

The three tiers:

Confirmed/stale topics — compressed to a one-line summary. This covers things like “User is working on the payments module” or “Decided to use PostgreSQL for storage.” These are settled facts. One sentence is enough. Savings: 70–80% of the original tokens.
Aging topics — trimmed to the last 3 points. Still relevant, but the full history of the discussion isn’t needed.
Active topics — full detail. These are things being discussed right now. Don’t touch them.

Token budgeting: don’t blow the cap

Even with compression, the notes can grow beyond what’s comfortable. I set a hard token budget for the notes section and enforce it with boundary-aware truncation.

The key insight is that you can’t just cut the notes at an arbitrary character count. If you cut in the middle of a section, the AI reads a topic header followed by an incomplete thought — worse than not including the topic at all.

const MaxNotesTokens = 1500

// TruncateNotes removes the oldest sections to fit within the
// token budget, but never cuts mid-section.
func TruncateNotes(notes *AccumulatedNotes, maxTokens int) *AccumulatedNotes {
    totalTokens := 0
    var kept []NoteSection

    // Iterate from newest to oldest (most recently updated first)
    sorted := sortByUpdateDesc(notes.Sections)

    for _, section := range sorted {
        sectionTokens := estimateTokens(section)
        if totalTokens+sectionTokens > maxTokens {
            continue // skip this section entirely
        }
        totalTokens += sectionTokens
        kept = append(kept, section)
    }

    return &AccumulatedNotes{Sections: kept}
}

The truncation iterates from the most recently updated sections to the oldest. If a section doesn’t fit within the remaining budget, it gets skipped entirely — no partial sections. This means the AI always sees complete, coherent topic summaries, even if some older topics are missing altogether.

Putting it together

The full prompt assembly looks like this:

// BuildDynamicPrompt assembles the final prompt from all components,
// respecting token budgets for each section.
func BuildDynamicPrompt(
    systemMsg Message,
    notes *AccumulatedNotes,
    history []Message,
) []Message {
    // 1. Compress notes based on recency and status
    CompressNotes(notes, len(history))

    // 2. Truncate notes to fit token budget
    trimmedNotes := TruncateNotes(notes, MaxNotesTokens)

    // 3. Sliding window on conversation history
    recentHistory := TruncateHistory(history)

    // 4. Assemble: system + notes + recent messages
    prompt := make([]Message, 0)
    prompt = append(prompt, systemMsg)
    prompt = append(prompt, notesToMessage(trimmedNotes))
    prompt = append(prompt, recentHistory[1:]...) // skip system msg (already added)

    return prompt
}

The result is a prompt that stays under roughly 8K tokens for most conversations, regardless of how long the session runs. The system prompt anchors the behavior, the notes preserve accumulated knowledge, and the sliding window provides immediate conversational context.

These patterns have names

I built all of this from constraints — the context window was too small, conversations were too long, and I needed things to work. It was only later that I found the academic and industry terms for what I’d built. If you want to go deeper, these are the keywords to search:

Sliding window attention — keeping a fixed-size window of recent context. Some transformer architectures use this internally at the attention layer. I applied the same idea at the application level, to the conversation history itself.
Progressive summarization — reducing detail for older information while keeping recent information at full fidelity. Tiago Forte popularized the term for personal knowledge management. The three-tier compression I built is the same principle, applied to AI conversation memory.
Token budgeting — allocating a fixed token budget to different prompt sections and enforcing hard caps. You’ll find this in RAG systems, multi-agent architectures, and anywhere prompts are composed from multiple sources.

The patterns emerged from the constraints, not from the literature. That’s not a flex — it just means these problems are common enough that anyone working on long AI conversations will eventually arrive at the same solutions.

What doesn’t work

I want to be honest about the gaps:

Long tangents get lost. If a topic spans many messages but never gets explicitly confirmed, the notes capture fragments but miss the arc. The compression is lossy — that’s the point, but it means nuance gets flattened.
Compressed notes lose tone. When the AI reads back a one-line summary of a 10-point discussion, it sometimes misses context that was implicit in the original exchange. “Decided to use caching” doesn’t capture the three reasons why that decision was made.
The window size is a tradeoff. 10 messages is enough for most interactions but too small for complex multi-step debugging. I’ve experimented with dynamic window sizes, but the complexity wasn’t worth the improvement.
Token estimation is approximate. I use a simple character-based heuristic rather than a real tokenizer. It’s close enough for budgeting but occasionally over- or under-shoots by 10–15%.

The numbers

For a typical long session (50+ messages across multiple topics):

Raw notes: ~3000 tokens (all topics at full detail)
After compression: 1000–1500 tokens (confirmed topics summarized, aging topics trimmed)
Sliding window (10 messages): ~4000–5000 tokens
System prompt: ~1500 tokens
Total prompt size: ~7000–8000 tokens, regardless of conversation length

Without these patterns, the same 50-message conversation would send 25K+ tokens per request — and the quality of the AI’s responses would actually be worse because of the competing context problem.

The constraint forces discipline. A fixed budget means the system has to decide what matters, and it turns out that’s exactly what you want.