Voice Agent Foundations

Understanding the real-time voice AI pipeline

In this lesson, you'll ground yourself in the architecture of real-time voice agents before you touch any code. The goal is to understand which components are required, how they interact, and why latency matters so much.

What we are building

By the end of the workshop, you will ship a production-quality voice assistant that:

Listens to a caller in real time through LiveKit’s WebRTC infrastructure
Streams speech-to-text (STT), large language model (LLM), and text-to-speech (TTS) responses
Handles interruptions gracefully with voice activity detection (VAD) and semantic turn detection
Escalates conversations, calls external tools, and collects metrics so you can iterate effectively

We will iterate on the agent.py scaffold provided in the repository linked here. Each lesson introduces a capability and immediately applies it in code.

Anatomy of a voice agent

Every voice agent follows the same high-level pipeline:

Voice Activity Detection (VAD) determines when the human is speaking. Without it, your agent wastes money on silence and may interrupt the user.
Speech-to-Text (STT) transcribes audio into tokens the LLM can understand. Choose models that support your languages and accents.
Large Language Model (LLM) generates the next reply. Prompts, planning, and tool integrations all live here.
Text-to-Speech (TTS) converts the LLM response into audio that feels natural and on-brand.

Supporting components such as background voice cancellation (BVC), noise suppression, and end-of-turn detectors help the agent behave like a good conversational partner.

Latency expectations

Humans expect conversational latency under ~500ms. Every component adds delay:

Component	Best Case	Typical
VAD	15–20ms	20–30ms
STT	200–300ms	400–600ms
LLM	100–200ms	500–1000ms
TTS	100–150ms	200–300ms
Total	~415ms	1.1s–2s

Low-latency agents keep pipelines streaming and parallel, avoid blocking I/O, and carefully choose providers that balance quality with speed.

Why WebRTC for voice

Real-time voice has different networking needs than text:

HTTP (TCP): great for request/response text, but suffers head‑of‑line blocking and lacks audio semantics; not ideal for live speech.
WebSockets (TCP): persistent and bidirectional, yet still blocked by TCP retransmits under poor networks.
WebRTC (UDP): designed for media. Uses Opus audio compression, per‑packet timestamps, jitter buffering, and adaptive bitrate to keep latency low even on flaky networks.

In practice, WebRTC lets your agent deliver first audio faster and stay responsive during loss/jitter. We’ll rely on LiveKit’s WebRTC infrastructure throughout the course.

Architecture walkthrough

Open the repository voice-agent-workshop/agent.py. The starting point looks like this:

import logging

from dotenv import load_dotenv
from livekit.agents import (
    Agent,
    AgentSession,
    JobContext,
    RoomInputOptions,
    WorkerOptions,
    cli,
)
from livekit.plugins import noise_cancellation, silero


class Assistant(Agent):
    def __init__(self) -> None:
        super().__init__(
            instructions="You are a helpful voice AI assistant.",
        )


async def entrypoint(ctx: JobContext):
    session = AgentSession(
        stt="deepgram/nova-3",
        llm="openai/gpt-4.1-mini",
        tts="cartesia/sonic-2",
        vad=silero.VAD.load(),
    )

    await session.start(
        agent=Assistant(),
        room=ctx.room,
        room_input_options=RoomInputOptions(
            noise_cancellation=noise_cancellation.BVC(),
        ),
    )

    await ctx.connect()

Run the starter agent in console mode:

uv run agent.py console

Say a few phrases into the console agent and observe the latency and responsiveness. Keep this baseline in mind as we add capabilities throughout the course.

Workshop roadmap

These are the different topics that we're going to learn about today:

Foundations (this lesson) — architecture, baseline agent
Turn detection — improve conversational dynamics with semantic turn-taking
Personality & fallbacks — customize prompts, voices, and provider redundancy
Metrics & preemptive generation — capture usage data and reduce response delays
Tools & MCP — integrate external capabilities and control planes
Consent & handoffs — build workflows for real customers

Each lesson introduces new LiveKit APIs and updates agent.py incrementally.

Testing mindset from day one

Voice agents face unpredictable inputs, ambiguous intents, and context-heavy conversations. Plan your evals early.

We'll revisit testing and metrics later, but try to keep a backlog of scenarios you want to test as features land.