The Boring Stuff That Keeps AI Running at 3am

Nobody talks about what happens when the AI API returns a 529 at 3am during your batch run. Or when a streaming response stalls mid-sentence. Or when the same request gets submitted twice because the user’s browser retried on a timeout.

These aren’t AI problems. They’re distributed systems problems wearing an AI hat. And the solutions are the same boring, well-established patterns that have kept web services running for decades. I just had to rediscover them in a new context.

Exponential backoff with jitter

LLM APIs have rate limits and they have bad days. When you get a 429 (rate limited) or a 5xx (server error), the worst thing you can do is retry immediately — you add to the pile of requests that caused the problem in the first place.

The classic fix is exponential backoff: wait 1 second, then 2, then 4, then 8. But if ten clients all get rate-limited at the same time, they all retry at the same intervals — the thundering herd problem. Adding random jitter breaks up the collision.

// RetryWithBackoff retries a function with exponential backoff and jitter.
// It respects Retry-After headers from the server when available.
func RetryWithBackoff(ctx context.Context, maxRetries int, fn func() (*http.Response, error)) (*http.Response, error) {
    var lastErr error
    baseDelay := 1 * time.Second

    for attempt := 0; attempt <= maxRetries; attempt++ {
        resp, err := fn()
        if err == nil && resp.StatusCode < 400 {
            return resp, nil
        }

        if resp != nil {
            // Don't retry client errors (except rate limits)
            if resp.StatusCode >= 400 && resp.StatusCode < 500 && resp.StatusCode != 429 {
                return resp, fmt.Errorf("terminal error: %d", resp.StatusCode)
            }

            // Respect Retry-After header if present
            if retryAfter := resp.Header.Get("Retry-After"); retryAfter != "" {
                if seconds, err := strconv.Atoi(retryAfter); err == nil {
                    select {
                    case <-time.After(time.Duration(seconds) * time.Second):
                    case <-ctx.Done():
                        return nil, ctx.Err()
                    }
                    continue
                }
            }
        }

        if attempt == maxRetries {
            lastErr = fmt.Errorf("max retries exceeded: %w", err)
            break
        }

        // Exponential backoff: 1s, 2s, 4s, 8s...
        delay := baseDelay * time.Duration(1<<uint(attempt))
        // Add random jitter: 0-500ms
        jitter := time.Duration(rand.Int63n(500)) * time.Millisecond
        delay += jitter

        select {
        case <-time.After(delay):
        case <-ctx.Done():
            return nil, ctx.Err()
        }
    }

    return nil, lastErr
}

The key decisions:

400/401/403: don’t retry. These are client errors. Your request is wrong. Retrying the same bad request wastes time and budget.
429/5xx: retry. These are transient. The server is either overloaded or having a bad moment. Back off and try again.
Respect Retry-After. If the server tells you when to come back, listen. It knows better than your exponential formula.
Context cancellation. Every wait is interruptible. If the parent context is cancelled (user navigated away, timeout fired), stop immediately.

Dual streaming timeouts

LLM responses are streamed — they arrive as a sequence of chunks over an open connection. This creates a problem that traditional HTTP timeouts don’t handle well: a request can technically be “in progress” while the stream has stalled.

I use two independent timeouts:

const (
    // OverallTimeout is the maximum wall-clock time for an entire
    // streaming response. Prevents hung connections that never close.
    OverallTimeout = 10 * time.Minute

    // IdleChunkTimeout is the maximum time between consecutive stream
    // chunks. Detects stalled streams where the connection is alive
    // but no data is flowing.
    IdleChunkTimeout = 180 * time.Second
)

// StreamWithTimeouts reads an SSE stream with both overall and
// idle timeout enforcement.
func StreamWithTimeouts(ctx context.Context, body io.Reader, handler func(chunk []byte) error) error {
    ctx, cancel := context.WithTimeout(ctx, OverallTimeout)
    defer cancel()

    scanner := bufio.NewScanner(body)
    for {
        // Reset idle timer before each read
        idleTimer := time.NewTimer(IdleChunkTimeout)

        done := make(chan struct{})
        var line string
        var scanOk bool

        go func() {
            scanOk = scanner.Scan()
            if scanOk {
                line = scanner.Text()
            }
            close(done)
        }()

        select {
        case <-done:
            idleTimer.Stop()
            if !scanOk {
                return scanner.Err()
            }
            if err := handler([]byte(line)); err != nil {
                return err
            }
        case <-idleTimer.C:
            return fmt.Errorf("stream idle for %v, assuming stalled", IdleChunkTimeout)
        case <-ctx.Done():
            idleTimer.Stop()
            return ctx.Err()
        }
    }
}

Overall timeout (10 minutes) catches connections that never close. Some LLM responses are legitimately long — complex prompts with detailed outputs. Ten minutes is generous but finite.
Idle chunk timeout (180 seconds) catches streams that stop producing data. The connection is open, the server hasn’t closed it, but nothing is flowing. This happens more often than you’d expect — network issues, upstream load balancer hiccups, provider-side queuing.

You need both. The overall timeout alone doesn’t catch stalls — a stalled stream can sit there for 10 minutes doing nothing. The idle timeout alone doesn’t prevent extremely long (but active) responses from running forever.

SSE heartbeat keepalive

Server-Sent Events (SSE) connections have a silent enemy: intermediate proxies. Load balancers, reverse proxies, CDNs — they all have idle connection timeouts. If no data flows for 30–60 seconds, the proxy kills the connection.

LLM responses can have long pauses. The model is “thinking” — no tokens are being emitted, but the connection needs to stay alive. The fix is a heartbeat:

const HeartbeatInterval = 15 * time.Second

// StartHeartbeat sends SSE comment lines at regular intervals
// to keep the connection alive through proxies.
func StartHeartbeat(ctx context.Context, w http.ResponseWriter, flusher http.Flusher) {
    ticker := time.NewTicker(HeartbeatInterval)
    defer ticker.Stop()

    for {
        select {
        case <-ticker.C:
            // SSE comment line — clients ignore these, but proxies
            // see traffic and keep the connection alive.
            fmt.Fprintf(w, ": keepalive\n\n")
            flusher.Flush()
        case <-ctx.Done():
            return
        }
    }
}

A colon-prefixed line in SSE is a comment. Clients are required to ignore it. But to the proxy, it’s traffic — the connection is active. Fifteen seconds is conservative enough to survive most proxy configurations.

Graceful degradation

LLM responses are unpredictable. You ask for JSON, you get JSON with a preamble. You ask for a specific format, you get something close but not quite. The system needs to handle these gracefully instead of crashing.

// ParseStructuredResponse attempts to parse the LLM's response as JSON.
// If parsing fails, it retries with an enforcement prompt. If that also
// fails, it returns a fallback response.
func ParseStructuredResponse(
    ctx context.Context,
    client *Client,
    rawResponse string,
    messages []Message,
) (*StructuredResponse, error) {
    // Attempt 1: parse directly
    var result StructuredResponse
    if err := json.Unmarshal([]byte(rawResponse), &result); err == nil {
        return &result, nil
    }

    // Attempt 2: extract JSON from markdown code blocks
    extracted := extractJSONFromMarkdown(rawResponse)
    if extracted != "" {
        if err := json.Unmarshal([]byte(extracted), &result); err == nil {
            return &result, nil
        }
    }

    // Attempt 3: retry with enforcement prompt
    messages = append(messages, Message{
        Role:    "user",
        Content: "Your previous response was not valid JSON. Please respond with ONLY valid JSON, no markdown, no explanation.",
    })
    retryResponse, err := client.Complete(ctx, messages)
    if err != nil {
        return nil, fmt.Errorf("enforcement retry failed: %w", err)
    }
    if err := json.Unmarshal([]byte(retryResponse), &result); err == nil {
        return &result, nil
    }

    // Fallback: return what we have with an error flag
    return &StructuredResponse{
        Content: rawResponse,
        Error:   "failed to parse structured response after retries",
    }, nil
}

The chain is: try to parse → try to extract from markdown → retry with enforcement → fall back cleanly. Each step is cheaper than the next. Most responses parse on the first attempt. The enforcement retry costs one more LLM call but usually works. The fallback means the user gets something rather than an error screen.

Idempotency and deduplication

Users double-click submit buttons. Browsers retry on timeout. Flaky networks cause duplicate requests. In a traditional CRUD app, this might create a duplicate record. In an AI system, it means two LLM calls for the same prompt — double the cost, and potentially two conflicting responses shown to the user.

// IdempotencyCache prevents duplicate processing of identical requests
// within a time window using SHA256 message fingerprinting.
type IdempotencyCache struct {
    mu      sync.RWMutex
    entries map[string]*CacheEntry
    ttl     time.Duration
}

type CacheEntry struct {
    Response  string
    CreatedAt time.Time
}

// CheckOrStore returns a cached response if this message fingerprint
// was seen within the TTL window. Otherwise, stores a placeholder
// and returns nil (indicating the caller should proceed).
func (c *IdempotencyCache) CheckOrStore(tenantID string, messages []Message) *string {
    fingerprint := c.computeFingerprint(tenantID, messages)

    c.mu.RLock()
    entry, exists := c.entries[fingerprint]
    c.mu.RUnlock()

    if exists && time.Since(entry.CreatedAt) < c.ttl {
        return &entry.Response
    }

    c.mu.Lock()
    c.entries[fingerprint] = &CacheEntry{CreatedAt: time.Now()}
    c.mu.Unlock()

    return nil
}

func (c *IdempotencyCache) computeFingerprint(tenantID string, messages []Message) string {
    h := sha256.New()
    h.Write([]byte(tenantID))
    for _, msg := range messages {
        h.Write([]byte(msg.Role))
        h.Write([]byte(msg.Content))
    }
    return hex.EncodeToString(h.Sum(nil))
}

The fingerprint is a SHA256 hash of the tenant ID and message contents. If the same tenant sends the same messages within the TTL window (I use 60 seconds), the second request gets the cached response instead of triggering a new LLM call. The TTL ensures the cache doesn’t grow unbounded and that genuinely repeated questions (asked minutes apart) still get fresh answers.

The technical names I didn’t know

After building all of this, the terminology became clear:

Resilience engineering — designing systems that handle failure gracefully rather than preventing all failures. Every pattern here assumes things will go wrong.
Exponential backoff with jitter — literally textbook. AWS published the canonical article on this. I just applied it to LLM API calls.
Circuit breaker — I don’t implement a full circuit breaker, but the retryable-vs-terminal error classification is the same idea. Don’t keep trying when the failure is permanent.
Idempotency — ensuring that performing the same operation multiple times produces the same result. Standard in payment processing, distributed messaging, and now AI inference.

None of this is new. It’s all from distributed systems literature, just applied to a new kind of API call.

What doesn’t work

Retries don’t fix bad prompts. If the model genuinely can’t handle your prompt — it’s too long, too ambiguous, or asks for something it can’t do — retrying with backoff just burns money. I had to learn to distinguish between transient failures (retry) and fundamental failures (rethink the prompt).
Timeouts are hard to tune. 180 seconds for idle chunk timeout works for conversational responses. But batch processing with complex prompts can legitimately pause for 60+ seconds while the model “thinks.” I ended up making the idle timeout configurable per use case rather than a single global constant.
Heartbeats don’t survive all proxies. Some aggressive WAFs and corporate proxies buffer the entire response before forwarding. No amount of heartbeats help — the client sees nothing until the response is complete. The only fix is to document minimum proxy requirements upfront.
The enforcement retry is a blunt hammer. Asking the model to “please return valid JSON this time” works maybe 80% of the time. The other 20%, the model has a fundamental misunderstanding of the expected format. For those cases, you need the fallback.

The meta-lesson is that reliability patterns don’t make unreliable systems reliable. They make unreliable systems manageable. The AI still fails. The API still goes down. Streams still stall. But instead of waking you up at 3am, the system handles it, logs it, and moves on.