Voice Agents Don't Know When You're Done Talking

I’ve built AI text agents before, but never the voice layer. That changed after talking to teams who run voice AI in production at scale. Seeing what it actually takes made me want to understand the hard parts from the inside.

I also use Gemini voice daily — walking, thinking through decisions out loud. You notice things as a user that you’d miss as a builder. That combination sent me down a rabbit hole. This is what I found.

A confused character trying to communicate with a robot that responds with gibberish and ERR 404

The first problem

The first version of every voice agent has the same bug. The agent waits for silence, then responds. Works fine in a quiet room with a patient speaker. Falls apart everywhere else.

Somebody pauses to think mid-sentence and the agent jumps in. Somebody speaks slowly and the agent interrupts them three times. Somebody says “um” and gets cut off. It doesn’t feel like a latency issue. It feels like talking to something that isn’t actually listening.

This is the turn-taking problem. It’s not a tuning parameter. It’s an architectural one.

Why silence detection fails

Voice Activity Detection — detecting when audio energy drops — is the obvious first approach. Wait for N milliseconds of quiet, conclude the user is done, start reasoning. Simple, works in demos.

The problem is that human speech is full of intentional pauses that aren’t turn boundaries. People pause to think. They trail off and continue. They say “so…” and then keep going. A fixed silence threshold can’t distinguish “I’m done” from “I’m thinking.”

Make the threshold shorter and the agent interrupts constantly. Make it longer and the agent feels sluggish. There’s no value that works because the underlying model is wrong — turn completion isn’t a silence event, it’s a conversational one.

Humans don’t use silence detection either. We read prosody, intonation, syntactic completion, breath patterns. By the time silence arrives, we’ve usually already decided whether it’s a turn boundary. The signal isn’t the quiet — it’s everything that led up to it. What you actually need is something that understands intent from context, not just audio energy.

Four events instead of one threshold

A production turn-detection model doesn’t emit a binary “done speaking” signal. It emits a stream of probabilistic events as confidence builds. There are four:

StartOfTurn — the user has started speaking. Immediately cancel any active playback. Stop everything the agent was saying and start listening.

EagerEndOfTurn — medium confidence that the turn is ending. The user might be done. Start reasoning speculatively — begin drafting a response, but don’t commit to it yet. This is exactly what humans do. Psycholinguists call it forward modeling: we start constructing a response before the other person finishes speaking. It’s not rudeness — it’s predictive cognition running in the background, ready to be discarded if the speaker continues. The agent isn’t faking attentiveness here. It’s converging on the same strategy.

TurnResumed — the user continued speaking. They weren’t done. Cancel the speculative work immediately and go back to listening.

EndOfTurn — high confidence the turn is complete. Finalize the response and speak.

The key event is EagerEndOfTurn. It lets you start useful work before you’re certain the user is done. When you get EndOfTurn to confirm, you’re already most of the way through reasoning rather than starting from scratch. That gap — the time between when the user stops talking and when the agent starts speaking — is where voice agents feel slow or feel natural. Speculative reasoning closes it without actually being faster.

What this looks like in code

The core pattern is speculative preparation with safe cancellation:

on_event("EagerEndOfTurn", transcript):
    # Medium confidence — start work speculatively
    draft = begin_reasoning(transcript, context)

on_event("TurnResumed"):
    # User kept talking — discard everything
    draft.cancel()

on_event("EndOfTurn", transcript):
    # High confidence — finalize and speak
    response = draft.finalize(transcript)
    speak(response)

on_event("StartOfTurn"):
    # User started speaking — stop immediately
    stop_playback()
    draft.cancel()
    emit_state("Listening")

Notice that StartOfTurn cancels both playback and any pending draft. The agent can be interrupted at any point in the cycle. That cancellability is non-negotiable — without it, users get overlapping speech or stale responses and the whole thing feels broken.

The architectural implication

This is where the model stops being just a turn-detection question and becomes a systems design question.

Everything downstream of EagerEndOfTurn must be cancellable. LLM calls, tool executions, TTS synthesis — all of it. If a user interrupts while the agent is halfway through a reasoning step, that step needs to stop cleanly. Not time out. Not finish and get discarded. Stop.

That means:

Streaming LLM output so you can cut it off mid-generation rather than waiting for a complete response
Tracking in-flight tool calls with identifiers so you can cancel them if the user speaks again
Treating TTS as interruptible, with explicit playback cancellation that resets buffers cleanly
Never letting a pending action block the listening path

The rule I keep coming back to: user speech is always higher priority than anything else in the system. The moment StartOfTurn fires, everything else stops. That signal should never have to wait in a queue behind a pending function call or a slow database lookup.

on_event("StartOfTurn"):
    cancel_active_tool_calls()
    stop_tts_playback()
    emit_state("Listening")
    # Nothing else matters right now

React to events, don’t poll state

There’s a second thing this model forces on you: you can’t poll for conversational state. The whole architecture has to be event-driven.

Polling — checking every N milliseconds whether the user is still speaking — introduces the same fundamental problem as silence thresholds: you’re sampling a continuous signal and making discrete decisions from snapshots. You miss the gaps between polls. You add latency on every check. And you get the timing wrong at exactly the moments that matter most.

A production voice agent should never ask “is the user speaking right now?” It should react to “the user started speaking” and “the user stopped speaking.” Those are events. They happen once. They trigger immediate state transitions.

That distinction sounds subtle but it changes how you write the entire system. State transitions become explicit. There’s a single source of truth for where the conversation is. Debugging gets easier because you have a log of events rather than a history of sampled states.

What doesn’t work

The four-events model assumes your turn-detection provider actually emits these events. Many don’t. If you’re using a basic VAD library — detecting energy drops with a fixed threshold — you only get a rough approximation of EndOfTurn. You don’t get EagerEndOfTurn or TurnResumed. The architecture collapses back to the threshold problem it was meant to replace.

Speculative reasoning also assumes your LLM calls are interruptible. If you’re using a provider without streaming support, you can’t cancel mid-generation. The TurnResumed event fires, but you can’t act on it until the response finishes. The whole point of the eager event disappears.

And even with the right infrastructure, this model doesn’t fix bad transcription. If the STT layer misses words or lags, the transcript you pass to the reasoning step is wrong. Speculative work built on a bad transcript gets cancelled — but the real damage is when it doesn’t get cancelled, because the error looked like a valid EndOfTurn.

The result

When this works, the agent feels like it’s actually listening. It doesn’t jump in during thinking pauses. It doesn’t let silence stretch awkwardly because it started reasoning too late. When you interrupt it, it stops immediately — not after finishing its sentence, not after a brief delay, immediately.

Users don’t think about any of this. They just notice that the conversation feels natural instead of mechanical. That “it actually listens” feeling isn’t personality or voice quality — it’s correct turn management underneath.

The silence threshold approach feels like a latency problem. You tune it, you optimize it, you still get the wrong behavior. The event-driven approach with speculative reasoning isn’t faster — the models aren’t doing less work. It just starts doing the right work at the right time, and stops when it should stop.

That’s the difference between an agent that feels broken and one that feels like it’s actually there.