Workshop logo

Building Production-Ready Voice Agents with LiveKit

Hands-on workshop for real-time voice AI systems

01

Hours

Metrics, Deployment, and Latency Optimization

Measuring performance and deploying to production

Your agent now sounds great and survives outages. But how do you know if it's actually performing well? This lesson covers three connected topics: metrics collection to understand your agent's behavior, deployment to LiveKit Cloud for production observability, and latency optimization techniques including preemptive generation.

Key metrics that matter

Before jumping into code, here's what you're measuring:

  • TTFA (time to first audio): The delay from when the user stops speaking to when they hear the first word of the response. This is the number users feel most directly.
  • TTFT (time to first token): When the LLM starts generating, which helps isolate whether delays come from transcription, the model, or speech synthesis.
  • Token usage: For cost estimates.
  • Interruption rate: How often users cut off the agent.
  • Fallback activations: When primary providers fail over.

These metrics show up in console logs during development and in LiveKit Cloud's dashboard in production. Target TTFA under 1000ms for most responses.

Capturing usage metrics

Add these imports near the top of agent.py:

from livekit.agents import AgentStateChangedEvent, MetricsCollectedEvent, metrics

logger = logging.getLogger(__name__)

Inside entrypoint, before session.start(...), wire up the collectors:

# Aggregate data across all conversation turns
usage_collector = metrics.UsageCollector()

# Track End of Utterance timing (when turn detector decides user finished speaking)
last_eou_metrics: metrics.EOUMetrics | None = None

@session.on("metrics_collected")
def _on_metrics_collected(ev: MetricsCollectedEvent):
    nonlocal last_eou_metrics
    # Capture EOU metrics for TTFA calculation
    if ev.metrics.type == "eou_metrics":
        last_eou_metrics = ev.metrics

    # Log each metric as it arrives and add to usage collector
    metrics.log_metrics(ev.metrics)
    usage_collector.collect(ev.metrics)


async def log_usage():
    # Print per-session summary (tokens, audio duration, costs)
    summary = usage_collector.get_summary()
    logger.info("Usage summary: %s", summary)


# Fire log_usage when worker shuts down
ctx.add_shutdown_callback(log_usage)

The usage_collector aggregates data across all conversation turns, including token counts for the LLM, audio duration for STT and TTS, and cost estimates when available.

Tracking time to first audio

The agent cycles through several states during a conversation: listening, thinking, and speaking. The agent_state_changed event fires each time the agent transitions between these states.

When the agent enters the speaking state, measure how long the human waited:

@session.on("agent_state_changed")
def _on_agent_state_changed(ev: AgentStateChangedEvent):
    if ev.new_state == "speaking":
        if last_eou_metrics:
            # Calculate time since user finished speaking
            elapsed = time.time() - last_eou_metrics.timestamp
            logger.info(f"Time to first audio: {elapsed:.3f}s")

Add the time import at the top of your file:

import time

This gives you a real number to optimize against.

Enabling preemptive generation

Preemptive generation lets the LLM start forming a response before the user finishes speaking. The agent waits for a clear end-of-turn before actually speaking, but the thinking has already begun.

Enable it with a single flag:

session = AgentSession(
    # ... existing config ...
    preemptive_generation=True,
)

This can shave hundreds of milliseconds off perceived latency, especially for longer user turns.

When to be careful

Preemptive generation trades accuracy for speed. A few scenarios where this tradeoff might not work in your favor:

  • Mid-sentence direction changes. If the user says "Book me a flight to New York... actually, make that Chicago," the LLM may have already committed to New York.
  • Complex multi-part instructions. Long requests with multiple steps can get partially answered before the full context is available.
  • High-accuracy domains. Medical, legal, or financial applications where getting it right matters more than getting it fast.

For many conversational use cases, preemptive generation works well. But if you notice the agent jumping to conclusions or misinterpreting intent, consider disabling it or tuning your turn detection thresholds.

Testing metrics locally

Run the agent in console mode:

uv run agent.py console

As you interact with the agent, watch the console output. You'll see metrics logged after each component processes, including token counts, audio duration, and timing information.

Each time the agent starts speaking, you'll see the time to first audio log. Here's a reference again for what to expect:

Component Best case Typical
VAD 15-20ms 20-30ms
STT 200-300ms 400-600ms
LLM 100-200ms 500-1000ms
TTS 100-150ms 200-300ms
Total ~415ms 1.1s-2s

Any TTFA under a second is solid. With preemptive generation enabled, you should see numbers consistently under a second. Without it, the LLM waits for the full transcription before it starts thinking. The difference is typically 100 to 300 milliseconds.

When you stop the agent with Ctrl+C, the shutdown callback fires and prints a usage summary showing total tokens used, audio duration processed, and estimated costs.

Built-in observability with LiveKit Cloud

Local testing and metrics work at first, but you'll want something more robust in production.

LiveKit Cloud includes built-in observability for agent sessions. You can view transcripts, traces, logs, and audio recordings directly in the dashboard without any additional setup.

The dashboard shows three synchronized views:

  • Transcripts with audio playback: Scrub through the conversation and hear exactly what happened.
  • Traces: Each turn breaks down into spans for STT, LLM, TTS, and tool calls with detailed timestamps and durations.
  • Logs: Info, warning, error, and debug messages from your agent server, the media server, and client connections.

All data is synchronized to a single timeline, so you can correlate what the user heard with what happened under the hood.

LiveKit's observability is OpenTelemetry-compatible, so you can also export traces to any OpenTelemetry-compatible backend like Langfuse if you prefer.

Enabling observability

Observability is enabled at the project level:

  1. Navigate to your project settings in LiveKit Cloud
  2. Open the Data and privacy section
  3. Toggle on Agent observability

Once enabled, all agent sessions automatically record data. The SDK version must be 1.3.0 or higher for Python, or 1.0.18 or higher for Node.js.

Deploying to LiveKit Cloud

To view observability data in the dashboard, deploy your agent to LiveKit Cloud. Deploying gives you:

  • Automatic scaling and load balancing
  • Built-in observability integration
  • Global network infrastructure
  • No server management

Setup and deploy

First, install the LiveKit CLI if you haven't already:

# macOS
brew install livekit-cli

# Linux
curl -sSL https://get.livekit.io/cli | bash

# Windows
winget install LiveKit.LiveKitCLI

Authenticate with LiveKit Cloud:

lk cloud auth

This opens a browser window to link your LiveKit Cloud project to the CLI. If you have multiple projects, list them with lk project list and set a default with lk project set-default.

Deploy your agent:

lk agent create

This command:

  • Registers your agent with LiveKit Cloud and assigns a unique ID
  • Writes configuration to a livekit.toml file
  • Creates a Dockerfile if you don't have one
  • Builds and deploys a container image

You'll see build logs stream to your terminal. Once deployment completes, your agent is live and ready to handle sessions.

Viewing session data

Test your deployed agent by opening the Agents page in LiveKit Cloud dashboard. Click on your agent, then click Playground to open a voice interface in your browser. Have a conversation to generate session data.

After the session completes, observability data uploads within a few seconds. Open the Sessions page, find your session, and click on it. The Agent insights tab contains all the observability data.

Transcript view

Shows the conversation timeline with audio playback. Inline alerts highlight key events like tool calls. Good for spotting interruptions and understanding conversation flow.

Trace view

Where you'll spend most of your optimization time. Each agent response breaks down into spans for STT, LLM, and TTS. Click any span to see detailed timestamps and duration.

Example trace breakdown:

Span Duration
STT 180ms
LLM (TTFT) 320ms
TTS 150ms
Total TTFA ~650ms

If your LLM span is consistently over 500ms, consider a faster model or enabling preemptive generation. If TTS is the bottleneck, try a different voice or provider.

Logs view

Shows runtime messages from your agent code in chronological order. Check here for technical details when something fails or behaves unexpectedly.

Disabling recording per session

To disable recording for a specific session, pass record=False to the session start:

await session.start(
    # ... agent, room_options, etc.
    record=False  # Disables upload of audio, transcripts, traces, and logs
)

Use this when handling sensitive data or to reduce storage costs for specific sessions.

Data retention and privacy

All observability data is stored in the US and retained for 30 days. Data older than 30 days is automatically deleted.

You can download audio, transcripts, and logs directly from the session page.

For projects on the free Build plan, some anonymized session data may be retained longer for model improvement purposes. Paid plans (Ship, Scale, Enterprise) have full data deletion after the 30-day window.

Latency budget and pipeline streaming

Low-latency voice requires parallel work across the stack:

  • Stream everything: STT, LLM, and TTS should operate incrementally, not in big batches.
  • Parallelize: Start TTS as soon as the LLM emits the first words. Don't wait for a full sentence.
  • Avoid blocking I/O: Keep tool calls and storage async. Bound your timeouts and retries.
  • Keep prompts tight: Fewer tokens means faster first audio and lower cost.

The metrics you collect at the application level only tell part of the story. Network conditions also affect perceived latency. WebRTC handles this with Opus audio compression, jitter buffering, and adaptive bitrate. If you see unexpectedly high TTFA that doesn't correlate with your pipeline metrics, network conditions may be the culprit.

Use Agent Observability to validate and fine-tune performance. The trace view shows exactly where time is spent in each turn, making it easier to identify bottlenecks.

Metrics to track (recap)

Key metrics worth tracking as you continue building:

  • TTFT (time to first token)
  • TTFA (time to first audio)
  • Interruption rate
  • Tool latency
  • Fallback activations

Aim for time to first audio under one second in most cases. Use the trace breakdown to identify which component to optimize when you're over budget.

Wrap-up

You've covered:

  • Key metrics (TTFA, TTFT) and why they matter
  • Usage collection for tokens and audio duration
  • Preemptive generation for lower latency
  • Deploying to LiveKit Cloud
  • Agent Observability for transcripts, traces, and logs

With metrics and observability in place, you can make data-driven decisions as you add tools and workflows in upcoming lessons.