Voice Agent Foundations
Understanding the real-time voice AI pipeline
In this lesson, you'll ground yourself in the architecture of real-time voice agents before you touch any code. The goal is to understand which components are required, how they interact, and why latency matters so much.
What we are building
By the end of the workshop, you will ship a production-quality voice assistant that:
- Listens to a caller in real time through LiveKit’s WebRTC infrastructure
- Streams speech-to-text (STT), large language model (LLM), and text-to-speech (TTS) responses
- Handles interruptions gracefully with voice activity detection (VAD) and semantic turn detection
- Escalates conversations, calls external tools, and collects metrics so you can iterate effectively
We will iterate on the agent.py scaffold provided in the repository linked here. Each lesson introduces a capability and immediately applies it in code.
Anatomy of a voice agent
Every voice agent follows the same high-level pipeline:
- Voice Activity Detection (VAD) determines when the human is speaking. Without it, your agent wastes money on silence and may interrupt the user.
- Speech-to-Text (STT) transcribes audio into tokens the LLM can understand. Choose models that support your languages and accents.
- Large Language Model (LLM) generates the next reply. Prompts, planning, and tool integrations all live here.
- Text-to-Speech (TTS) converts the LLM response into audio that feels natural and on-brand.
Supporting components such as background voice cancellation (BVC), noise suppression, and end-of-turn detectors help the agent behave like a good conversational partner.
Latency expectations
Humans expect conversational latency under ~500ms. Every component adds delay:
| Component | Best Case | Typical |
|---|---|---|
| VAD | 15–20ms | 20–30ms |
| STT | 200–300ms | 400–600ms |
| LLM | 100–200ms | 500–1000ms |
| TTS | 100–150ms | 200–300ms |
| Total | ~415ms | 1.1s–2s |
Low-latency agents keep pipelines streaming and parallel, avoid blocking I/O, and carefully choose providers that balance quality with speed.
Why WebRTC for voice
Real-time voice has different networking needs than text:
- HTTP (TCP): great for request/response text, but suffers head‑of‑line blocking and lacks audio semantics; not ideal for live speech.
- WebSockets (TCP): persistent and bidirectional, yet still blocked by TCP retransmits under poor networks.
- WebRTC (UDP): designed for media. Uses Opus audio compression, per‑packet timestamps, jitter buffering, and adaptive bitrate to keep latency low even on flaky networks.
In practice, WebRTC lets your agent deliver first audio faster and stay responsive during loss/jitter. We’ll rely on LiveKit’s WebRTC infrastructure throughout the course.
Architecture walkthrough
Open the repository voice-agent-workshop/agent.py. The starting point looks like this:
import logging
from dotenv import load_dotenv
from livekit.agents import (
Agent,
AgentSession,
JobContext,
RoomInputOptions,
WorkerOptions,
cli,
)
from livekit.plugins import noise_cancellation, silero
class Assistant(Agent):
def __init__(self) -> None:
super().__init__(
instructions="You are a helpful voice AI assistant.",
)
async def entrypoint(ctx: JobContext):
session = AgentSession(
stt="deepgram/nova-3",
llm="openai/gpt-4.1-mini",
tts="cartesia/sonic-2",
vad=silero.VAD.load(),
)
await session.start(
agent=Assistant(),
room=ctx.room,
room_input_options=RoomInputOptions(
noise_cancellation=noise_cancellation.BVC(),
),
)
await ctx.connect()
Run the starter agent in console mode:
uv run agent.py console
Say a few phrases into the console agent and observe the latency and responsiveness. Keep this baseline in mind as we add capabilities throughout the course.
Workshop roadmap
These are the different topics that we're going to learn about today:
- Foundations (this lesson) — architecture, baseline agent
- Turn detection — improve conversational dynamics with semantic turn-taking
- Personality & fallbacks — customize prompts, voices, and provider redundancy
- Metrics & preemptive generation — capture usage data and reduce response delays
- Tools & MCP — integrate external capabilities and control planes
- Consent & handoffs — build workflows for real customers
Each lesson introduces new LiveKit APIs and updates agent.py incrementally.
Testing mindset from day one
Voice agents face unpredictable inputs, ambiguous intents, and context-heavy conversations. Plan your evals early.
We'll revisit testing and metrics later, but try to keep a backlog of scenarios you want to test as features land.