Metrics, Deployment, and Latency Optimization
Measuring performance and deploying to production
Your agent now sounds great and survives outages. But how do you know if it's actually performing well? This lesson covers three connected topics: metrics collection to understand your agent's behavior, deployment to LiveKit Cloud for production observability, and latency optimization techniques including preemptive generation.
Key metrics that matter
Before jumping into code, here's what you're measuring:
- TTFA (time to first audio): The delay from when the user stops speaking to when they hear the first word of the response. This is the number users feel most directly.
- TTFT (time to first token): When the LLM starts generating, which helps isolate whether delays come from transcription, the model, or speech synthesis.
- Token usage: For cost estimates.
- Interruption rate: How often users cut off the agent.
- Fallback activations: When primary providers fail over.
These metrics show up in console logs during development and in LiveKit Cloud's dashboard in production. Target TTFA under 1000ms for most responses.
Capturing usage metrics
Add these imports near the top of agent.py:
from livekit.agents import AgentStateChangedEvent, MetricsCollectedEvent, metrics
logger = logging.getLogger(__name__)
Inside entrypoint, before session.start(...), wire up the collectors:
# Aggregate data across all conversation turns
usage_collector = metrics.UsageCollector()
# Track End of Utterance timing (when turn detector decides user finished speaking)
last_eou_metrics: metrics.EOUMetrics | None = None
@session.on("metrics_collected")
def _on_metrics_collected(ev: MetricsCollectedEvent):
nonlocal last_eou_metrics
# Capture EOU metrics for TTFA calculation
if ev.metrics.type == "eou_metrics":
last_eou_metrics = ev.metrics
# Log each metric as it arrives and add to usage collector
metrics.log_metrics(ev.metrics)
usage_collector.collect(ev.metrics)
async def log_usage():
# Print per-session summary (tokens, audio duration, costs)
summary = usage_collector.get_summary()
logger.info("Usage summary: %s", summary)
# Fire log_usage when worker shuts down
ctx.add_shutdown_callback(log_usage)
The usage_collector aggregates data across all conversation turns, including token counts for the LLM, audio duration for STT and TTS, and cost estimates when available.
Tracking time to first audio
The agent cycles through several states during a conversation: listening, thinking, and speaking. The agent_state_changed event fires each time the agent transitions between these states.
When the agent enters the speaking state, measure how long the human waited:
@session.on("agent_state_changed")
def _on_agent_state_changed(ev: AgentStateChangedEvent):
if ev.new_state == "speaking":
if last_eou_metrics:
# Calculate time since user finished speaking
elapsed = time.time() - last_eou_metrics.timestamp
logger.info(f"Time to first audio: {elapsed:.3f}s")
Add the time import at the top of your file:
import time
This gives you a real number to optimize against.
Enabling preemptive generation
Preemptive generation lets the LLM start forming a response before the user finishes speaking. The agent waits for a clear end-of-turn before actually speaking, but the thinking has already begun.
Enable it with a single flag:
session = AgentSession(
# ... existing config ...
preemptive_generation=True,
)
This can shave hundreds of milliseconds off perceived latency, especially for longer user turns.
When to be careful
Preemptive generation trades accuracy for speed. A few scenarios where this tradeoff might not work in your favor:
- Mid-sentence direction changes. If the user says "Book me a flight to New York... actually, make that Chicago," the LLM may have already committed to New York.
- Complex multi-part instructions. Long requests with multiple steps can get partially answered before the full context is available.
- High-accuracy domains. Medical, legal, or financial applications where getting it right matters more than getting it fast.
For many conversational use cases, preemptive generation works well. But if you notice the agent jumping to conclusions or misinterpreting intent, consider disabling it or tuning your turn detection thresholds.
Testing metrics locally
Run the agent in console mode:
uv run agent.py console
As you interact with the agent, watch the console output. You'll see metrics logged after each component processes, including token counts, audio duration, and timing information.
Each time the agent starts speaking, you'll see the time to first audio log. Here's a reference again for what to expect:
| Component | Best case | Typical |
|---|---|---|
| VAD | 15-20ms | 20-30ms |
| STT | 200-300ms | 400-600ms |
| LLM | 100-200ms | 500-1000ms |
| TTS | 100-150ms | 200-300ms |
| Total | ~415ms | 1.1s-2s |
Any TTFA under a second is solid. With preemptive generation enabled, you should see numbers consistently under a second. Without it, the LLM waits for the full transcription before it starts thinking. The difference is typically 100 to 300 milliseconds.
When you stop the agent with Ctrl+C, the shutdown callback fires and prints a usage summary showing total tokens used, audio duration processed, and estimated costs.
Built-in observability with LiveKit Cloud
Local testing and metrics work at first, but you'll want something more robust in production.
LiveKit Cloud includes built-in observability for agent sessions. You can view transcripts, traces, logs, and audio recordings directly in the dashboard without any additional setup.
The dashboard shows three synchronized views:
- Transcripts with audio playback: Scrub through the conversation and hear exactly what happened.
- Traces: Each turn breaks down into spans for STT, LLM, TTS, and tool calls with detailed timestamps and durations.
- Logs: Info, warning, error, and debug messages from your agent server, the media server, and client connections.
All data is synchronized to a single timeline, so you can correlate what the user heard with what happened under the hood.
LiveKit's observability is OpenTelemetry-compatible, so you can also export traces to any OpenTelemetry-compatible backend like Langfuse if you prefer.
Enabling observability
Observability is enabled at the project level:
- Navigate to your project settings in LiveKit Cloud
- Open the Data and privacy section
- Toggle on Agent observability
Once enabled, all agent sessions automatically record data. The SDK version must be 1.3.0 or higher for Python, or 1.0.18 or higher for Node.js.
Deploying to LiveKit Cloud
To view observability data in the dashboard, deploy your agent to LiveKit Cloud. Deploying gives you:
- Automatic scaling and load balancing
- Built-in observability integration
- Global network infrastructure
- No server management
Setup and deploy
First, install the LiveKit CLI if you haven't already:
# macOS
brew install livekit-cli
# Linux
curl -sSL https://get.livekit.io/cli | bash
# Windows
winget install LiveKit.LiveKitCLI
Authenticate with LiveKit Cloud:
lk cloud auth
This opens a browser window to link your LiveKit Cloud project to the CLI. If you have multiple projects, list them with lk project list and set a default with lk project set-default.
Deploy your agent:
lk agent create
This command:
- Registers your agent with LiveKit Cloud and assigns a unique ID
- Writes configuration to a
livekit.tomlfile - Creates a Dockerfile if you don't have one
- Builds and deploys a container image
You'll see build logs stream to your terminal. Once deployment completes, your agent is live and ready to handle sessions.
Viewing session data
Test your deployed agent by opening the Agents page in LiveKit Cloud dashboard. Click on your agent, then click Playground to open a voice interface in your browser. Have a conversation to generate session data.
After the session completes, observability data uploads within a few seconds. Open the Sessions page, find your session, and click on it. The Agent insights tab contains all the observability data.
Transcript view
Shows the conversation timeline with audio playback. Inline alerts highlight key events like tool calls. Good for spotting interruptions and understanding conversation flow.
Trace view
Where you'll spend most of your optimization time. Each agent response breaks down into spans for STT, LLM, and TTS. Click any span to see detailed timestamps and duration.
Example trace breakdown:
| Span | Duration |
|---|---|
| STT | 180ms |
| LLM (TTFT) | 320ms |
| TTS | 150ms |
| Total TTFA | ~650ms |
If your LLM span is consistently over 500ms, consider a faster model or enabling preemptive generation. If TTS is the bottleneck, try a different voice or provider.
Logs view
Shows runtime messages from your agent code in chronological order. Check here for technical details when something fails or behaves unexpectedly.
Disabling recording per session
To disable recording for a specific session, pass record=False to the session start:
await session.start(
# ... agent, room_options, etc.
record=False # Disables upload of audio, transcripts, traces, and logs
)
Use this when handling sensitive data or to reduce storage costs for specific sessions.
Data retention and privacy
All observability data is stored in the US and retained for 30 days. Data older than 30 days is automatically deleted.
You can download audio, transcripts, and logs directly from the session page.
For projects on the free Build plan, some anonymized session data may be retained longer for model improvement purposes. Paid plans (Ship, Scale, Enterprise) have full data deletion after the 30-day window.
Latency budget and pipeline streaming
Low-latency voice requires parallel work across the stack:
- Stream everything: STT, LLM, and TTS should operate incrementally, not in big batches.
- Parallelize: Start TTS as soon as the LLM emits the first words. Don't wait for a full sentence.
- Avoid blocking I/O: Keep tool calls and storage async. Bound your timeouts and retries.
- Keep prompts tight: Fewer tokens means faster first audio and lower cost.
The metrics you collect at the application level only tell part of the story. Network conditions also affect perceived latency. WebRTC handles this with Opus audio compression, jitter buffering, and adaptive bitrate. If you see unexpectedly high TTFA that doesn't correlate with your pipeline metrics, network conditions may be the culprit.
Use Agent Observability to validate and fine-tune performance. The trace view shows exactly where time is spent in each turn, making it easier to identify bottlenecks.
Metrics to track (recap)
Key metrics worth tracking as you continue building:
- TTFT (time to first token)
- TTFA (time to first audio)
- Interruption rate
- Tool latency
- Fallback activations
Aim for time to first audio under one second in most cases. Use the trace breakdown to identify which component to optimize when you're over budget.
Wrap-up
You've covered:
- Key metrics (TTFA, TTFT) and why they matter
- Usage collection for tokens and audio duration
- Preemptive generation for lower latency
- Deploying to LiveKit Cloud
- Agent Observability for transcripts, traces, and logs
With metrics and observability in place, you can make data-driven decisions as you add tools and workflows in upcoming lessons.