Workshop logo

Building Production-Ready Voice Agents with LiveKit

Hands-on workshop for real-time voice AI systems

01

Hours

Semantic Turn Detection

Preventing awkward interruptions

Turn detection decides when the agent should speak and when it should stay quiet. Without it, you get barge-ins. Barge-ins are when the agent starts answering while the human is still mid-sentence. LiveKit ships a multilingual semantic turn detector we can layer on top of voice activity detection.

Why semantic turn detection matters

Voice activity detection (VAD) spots human-patterned audio, but humans pause, restart sentences, and trail off.

Instead, a semantic detector looks at the meaning of your sentences as well.

The most important job of turn detection is to reduce unwanted interruptions, but it also improves accuracy for the STT engine, and has a negligible impact on latency (~20ms).

Adding the turn detector

First, add the turn detector extra to your environment:

uv add "livekit-agents[turn-detector]"

This pulls in the multilingual model we will reference in code.

Modify agent.py

  1. Import the model:
   from livekit.plugins.turn_detector.multilingual import MultilingualModel
  1. Inject it into the session:
   session = AgentSession(
      stt="deepgram/nova-3",
      llm="openai/gpt-4.1-mini",
      tts="cartesia/sonic-2",
      vad=silero.VAD.load(),
      turn_detection=MultilingualModel(),
   )

That’s all you need. The session now emits turn events when the semantic model decides a speaker is finished.

Test the change

Run the console agent again and try these scenarios:

  • Pause halfway through a sentence for ~1 second. The agent should not jump in.
  • Speak a phrase in another language (if you know one!) Turn detection should still behave naturally.
  • Talk over the agent deliberately to test how it recovers from interruptions.

This single change dramatically improves conversation quality and prepares the agent for more advanced latency optimizations in later lessons.

Best practices

  • Combine with VAD and noise controls. Using turn detection with noise cancellation and background voice cancellation.
  • Language variability: different languages have different pause patterns—semantic models help normalize.
  • We can coordinate with preemptive generation (Lesson 4) to allow early LLM planning but avoid speaking until a clear end-of-turn!

Ideas for agent evals

  • Rapid-fire short sentences vs a single long sentence with pauses.
  • Background voices (TV, nearby chatter) vs the primary speaker.
  • Code-switching between languages mid-turn.