Voice Agents Don't Know When You're Done Talking
I’ve built AI text agents before, but never added the voice layer. That changed after talking to teams who have — Diga and Clara, both running voice AI in production at scale. Seeing what it actually takes made me want to understand the hard parts from the inside.
I also use Gemini voice daily — walking, thinking through decisions out loud. You notice things as a user that you’d miss as a builder. That combination sent me down a rabbit hole. This is what I found.

The first problem
The first version of every voice agent has the same bug. The agent waits for silence, then responds. Works fine in a quiet room with a patient speaker. Falls apart everywhere else.
Somebody pauses to think mid-sentence and the agent jumps in. Somebody speaks slowly and the agent interrupts them three times. Somebody says “um” and gets cut off. It doesn’t feel like a latency issue. It feels like talking to something that isn’t actually listening.
This is the turn-taking problem. It’s not a tuning parameter. It’s an architectural one.
Why silence detection fails
Voice Activity Detection — detecting when audio energy drops — is the obvious first approach. Wait for N milliseconds of quiet, conclude the user is done, start reasoning. Simple, works in demos.
The problem is that human speech is full of intentional pauses that aren’t turn boundaries. People pause to think. They trail off and continue. They say “so…” and then keep going. A fixed silence threshold can’t distinguish “I’m done” from “I’m thinking.”
Make the threshold shorter and the agent interrupts constantly. Make it longer and the agent feels sluggish. There’s no value that works because the underlying model is wrong — turn completion isn’t a silence event, it’s a conversational one.
Humans don’t use silence detection either. We read prosody, intonation, syntactic completion, breath patterns. By the time silence arrives, we’ve usually already decided whether it’s a turn boundary. The signal isn’t the quiet — it’s everything that led up to it. What you actually need is something that understands intent from context, not just audio energy.
Four events instead of one threshold
A production turn-detection model doesn’t emit a binary “done speaking” signal. It emits a stream of probabilistic events as confidence builds. There are four:
StartOfTurn — the user has started speaking. Immediately cancel any active playback. Stop everything the agent was saying and start listening.
EagerEndOfTurn — medium confidence that the turn is ending. The user might be done. Start reasoning speculatively — begin drafting a response, but don’t commit to it yet. This is exactly what humans do. Psycholinguists call it forward modeling: we start constructing a response before the other person finishes speaking. It’s not rudeness — it’s predictive cognition running in the background, ready to be discarded if the speaker continues. The agent isn’t faking attentiveness here. It’s converging on the same strategy.
TurnResumed — the user continued speaking. They weren’t done. Cancel the speculative work immediately and go back to listening.
EndOfTurn — high confidence the turn is complete. Finalize the response and speak.
The key event is EagerEndOfTurn. It lets you start useful work before you’re certain the user is done. When you get EndOfTurn to confirm, you’re already most of the way through reasoning rather than starting from scratch. That gap — the time between when the user stops talking and when the agent starts speaking — is where voice agents feel slow or feel natural. Speculative reasoning closes it without actually being faster.
What this looks like in code
The core pattern is speculative preparation with safe cancellation:
on_event("EagerEndOfTurn", transcript):
# Medium confidence — start work speculatively
draft = begin_reasoning(transcript, context)
on_event("TurnResumed"):
# User kept talking — discard everything
draft.cancel()
on_event("EndOfTurn", transcript):
# High confidence — finalize and speak
response = draft.finalize(transcript)
speak(response)
on_event("StartOfTurn"):
# User started speaking — stop immediately
stop_playback()
draft.cancel()
emit_state("Listening")
Notice that StartOfTurn cancels both playback and any pending draft. The agent can be interrupted at any point in the cycle. That cancellability is non-negotiable — without it, users get overlapping speech or stale responses and the whole thing feels broken.
The architectural implication
This is where the model stops being just a turn-detection question and becomes a systems design question.
Everything downstream of EagerEndOfTurn must be cancellable. LLM calls, tool executions, TTS synthesis — all of it. If a user interrupts while the agent is halfway through a reasoning step, that step needs to stop cleanly. Not time out. Not finish and get discarded. Stop.
That means:
- Streaming LLM output so you can cut it off mid-generation rather than waiting for a complete response
- Tracking in-flight tool calls with identifiers so you can cancel them if the user speaks again
- Treating TTS as interruptible, with explicit playback cancellation that resets buffers cleanly
- Never letting a pending action block the listening path
The rule I keep coming back to: user speech is always higher priority than anything else in the system. The moment StartOfTurn fires, everything else stops. That signal should never have to wait in a queue behind a pending function call or a slow database lookup.
on_event("StartOfTurn"):
cancel_active_tool_calls()
stop_tts_playback()
emit_state("Listening")
# Nothing else matters right now
React to events, don’t poll state
There’s a second thing this model forces on you: you can’t poll for conversational state. The whole architecture has to be event-driven.
Polling — checking every N milliseconds whether the user is still speaking — introduces the same fundamental problem as silence thresholds: you’re sampling a continuous signal and making discrete decisions from snapshots. You miss the gaps between polls. You add latency on every check. And you get the timing wrong at exactly the moments that matter most.
A production voice agent should never ask “is the user speaking right now?” It should react to “the user started speaking” and “the user stopped speaking.” Those are events. They happen once. They trigger immediate state transitions.
That distinction sounds subtle but it changes how you write the entire system. State transitions become explicit. There’s a single source of truth for where the conversation is. Debugging gets easier because you have a log of events rather than a history of sampled states.
The result
When this works, the agent feels like it’s actually listening. It doesn’t jump in during thinking pauses. It doesn’t let silence stretch awkwardly because it started reasoning too late. When you interrupt it, it stops immediately — not after finishing its sentence, not after a brief delay, immediately.
Users don’t think about any of this. They just notice that the conversation feels natural instead of mechanical. That “it actually listens” feeling isn’t personality or voice quality — it’s correct turn management underneath.
The silence threshold approach feels like a latency problem. You tune it, you optimize it, you still get the wrong behavior. The event-driven approach with speculative reasoning isn’t faster — the models aren’t doing less work. It just starts doing the right work at the right time, and stops when it should stop.
That’s the difference between an agent that feels broken and one that feels like it’s actually there.
There’s something worth sitting with in that. The agent doesn’t feel natural because it’s better at mimicking humans — it feels natural because it’s converging on the same solution human psychology arrived at through years of social development. Turn-taking is a genuinely hard coordination problem. We don’t notice how hard because we’ve been solving it unconsciously since we learned to talk.
Thoughts
The models described here treat all users the same. But turn-taking patterns aren’t universal. Some people pause longer, speak faster, trail off more. Silence itself is signal — how long someone typically pauses when thinking versus when they’re done is different for every speaker. A system that learns individual speech rhythms over time would outperform any fixed model. The data is already there in every conversation. The question is whether the system is built to use it.
The other gap is modality. Audio alone is incomplete. A user who pauses and looks up is thinking. A user who pauses and looks at you is done. Humans read that instantly — from eye contact, facial expression, breath. Voice agents on video calls have access to the same signals. The tradeoff is real: processing video frames in real time adds latency and compute on top of an already tight pipeline. But for high-stakes conversations, the accuracy gains are probably worth it.