Voice Agent Foundations

Understanding the real-time voice AI pipeline

In this lesson, you'll build a mental model of how voice agents actually work before writing any code. The goal is to understand which components make up a real-time voice agent, how they interact, and why latency matters so much.

What we are building

By the end of the workshop, you will ship a production-quality voice assistant that:

Listens to a caller in real time through LiveKit's WebRTC infrastructure
Streams speech-to-text (STT), large language model (LLM), and text-to-speech (TTS) responses
Handles interruptions gracefully with voice activity detection (VAD) and semantic turn detection
Escalates conversations, calls external tools, and collects metrics so you can iterate effectively

Anatomy of a voice agent

Every voice agent follows the same high-level pipeline. There are four core components:

Voice Activity Detection (VAD) determines when the human is speaking. Without it, your agent wastes money on silence and may interrupt the user.
Speech-to-Text (STT) transcribes audio into tokens the LLM can understand. Choose models that support your languages and accents.
Large Language Model (LLM) generates the next reply. Prompts, planning, and tool integrations all live here.
Text-to-Speech (TTS) converts the LLM response into audio that feels natural and on-brand.

Supporting components such as background voice cancellation (BVC), noise suppression, and end-of-turn detectors help the agent behave like a good conversational partner.

Latency expectations

Humans expect conversational latency under ~500ms. Every component adds delay:

Component	Best case	Typical
VAD	15–20ms	20–30ms
STT	200–300ms	400–600ms
LLM	100–200ms	500–1000ms
TTS	100–150ms	200–300ms
Total	~415ms	1.1s–2s

Low-latency agents keep pipelines streaming and parallel, avoid blocking I/O, and carefully choose providers that balance quality with speed.

Why WebRTC for voice

Real-time voice has different networking needs than text:

HTTP (TCP): great for request/response text, but suffers head‑of‑line blocking and lacks audio semantics; not ideal for live speech.
WebSockets (TCP): persistent and bidirectional, yet still blocked by TCP retransmits under poor networks.
WebRTC (UDP): designed for media. Uses Opus audio compression, per‑packet timestamps, jitter buffering, and adaptive bitrate to keep latency low even on flaky networks.

In practice, WebRTC lets your agent deliver first audio faster and stay responsive during loss/jitter. We'll rely on LiveKit's WebRTC infrastructure throughout the course.

Setup

Open your terminal and initiate a new project using uv:

uv init livekit-voice-agent --bare
cd livekit-voice-agent

The --bare flag creates a minimal project without any boilerplate files.

Now install the dependencies:

uv add \
  "livekit-agents[silero,turn-detector]~=1.3" \
  "livekit-plugins-noise-cancellation~=0.2" \
  "python-dotenv"

This installs the LiveKit Agents SDK with Silero VAD and turn detection, the noise cancellation plugin, and python-dotenv for environment variables.

Next, grab your API credentials from the LiveKit Cloud dashboard. If you haven't created an account yet, head to cloud.livekit.io and sign up. It's free.

You can find your API keys and URL under your project settings in the LiveKit Cloud dashboard.

In your project root, create a .env file and add your credentials:

LIVEKIT_URL=wss://your-project.livekit.cloud
LIVEKIT_API_KEY=your-api-key
LIVEKIT_API_SECRET=your-api-secret

Replace these placeholder values with your actual credentials from the dashboard.

Architecture walkthrough

Create a new file called agent.py in your project root. This is your starter agent:

import logging

from dotenv import load_dotenv
from livekit import agents
from livekit.agents import Agent, AgentServer, AgentSession, JobContext, room_io
from livekit.plugins import noise_cancellation, silero

load_dotenv()


# Define your agent's behavior by extending the Agent class
class Assistant(Agent):
    def __init__(self) -> None:
        super().__init__(
            instructions="You are a helpful voice AI assistant.",  # System prompt for the LLM
        )


server = AgentServer()


# The entrypoint function runs when a participant joins the room
@server.rtc_session()
async def entrypoint(ctx: JobContext):
    # Configure the voice pipeline with STT, LLM, TTS, and VAD providers
    session = AgentSession(
        stt="assemblyai/universal-streaming:en",  # Speech-to-text provider
        llm="openai/gpt-4.1-mini",                # Language model for responses
        tts="cartesia/sonic-3",                   # Text-to-speech voice
        vad=silero.VAD.load(),                    # Voice activity detection
    )

    # Start the session with noise cancellation enabled
    await session.start(
        agent=Assistant(),
        room=ctx.room,
        room_options=room_io.RoomOptions(
            audio_input=room_io.AudioInputOptions(
                noise_cancellation=noise_cancellation.BVC(),  # Background voice cancellation
            ),
        ),
    )


if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO)
    agents.cli.run_app(server)

Running the starter agent

First, download the required model files like the Silero VAD model:

uv run agent.py download-files

Then run the starter agent in console mode:

uv run agent.py console

Say a few phrases into the console agent and observe the latency and responsiveness. Keep this baseline in mind as we add capabilities throughout the course.

To stop the agent, press Ctrl+C.

Tip: If you need to switch the input or output device selected, use these flags:

--list-devices: list all available input and output audio devices

--input-device: use the numeric input device ID to set the input device

--output-device: use the numeric input device ID to set the output device

Testing mindset from day one

Voice agents deal with messy, real-world inputs. People mumble, change their minds mid-sentence, and ask things in ways you never anticipated. Starting to think about how you'll test these scenarios now will save you headaches later.

We'll revisit testing and metrics later, but try to keep a backlog of scenarios you want to test as features land. Think about edge cases, different accents, background noise, and interruptions.