Metrics and Preemptive Generation

Measuring performance and shaving latency

Your agent now sounds great and survives outages. Next, we need visibility. Metrics help you understand usage, latency, and failure modes.

We'll also add something called preemptive generation. This reduces response delay by streaming the first tokens while the user is still speaking.

Capturing usage metrics

Add these imports near the top of agent.py:

from livekit.agents import AgentStateChangedEvent, MetricsCollectedEvent, metrics

Inside entrypoint, before session.start(...), we'll wire up the collectors:

usage_collector = metrics.UsageCollector()
last_eou_metrics: metrics.EOUMetrics | None = None

@session.on("metrics_collected")
def _on_metrics_collected(ev: MetricsCollectedEvent):
    nonlocal last_eou_metrics
    if ev.metrics.type == "eou_metrics":
        last_eou_metrics = ev.metrics

    metrics.log_metrics(ev.metrics)
    usage_collector.collect(ev.metrics)


async def log_usage():
    summary = usage_collector.get_summary()
    logger.info("Usage summary: %s", summary)


ctx.add_shutdown_callback(log_usage)

This captures per-turn statistics and logs a summary when the worker shuts down.

Tracking time to first audio

When the agent begins speaking, we want to know how long the human waited:

@session.on("agent_state_changed")
def _on_agent_state_changed(ev: AgentStateChangedEvent):
    if (
        ev.new_state == "speaking"
        and last_eou_metrics
        and session.current_speech
        and last_eou_metrics.speech_id == session.current_speech.id
    ):
        delta = ev.created_at - last_eou_metrics.last_speaking_time
        logger.info("Time to first audio frame: %sms", delta.total_seconds() * 1000)

Keep an eye on this metric as you change providers or prompts.

Enabling preemptive generation

Preemptive generation lets the LLM start crafting a reply before the user fully finishes their sentence. Add this flag to AgentSession:

session = AgentSession(
    # ...
    preemptive_generation=True,
)

You can use our new metrics to compare before and after. You should see faster first audio frames, especially for long user turns.

Optional: Langfuse tracing

Langfuse is an open source platform that we can use to display our traces and metrics on an easy-to-grok dashboard.

This is what Langfuse looks like:
Langfuse Dashboard

It doesn't require a credit card, and we'll get an API key right after signing up.

If you want end-to-end tracing in Langfuse:

Add credentials to .env.local:

   LANGFUSE_PUBLIC_KEY=
   LANGFUSE_SECRET_KEY=
   LANGFUSE_HOST=

Import the telemetry helpers:

   from livekit.agents.telemetry import set_tracer_provider
   import base64
   import os

Define a helper:

   def setup_langfuse(host=None, public_key=None, secret_key=None):
       from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
       from opentelemetry.sdk.trace import TracerProvider
       from opentelemetry.sdk.trace.export import BatchSpanProcessor

       public_key = public_key or os.getenv("LANGFUSE_PUBLIC_KEY")
       secret_key = secret_key or os.getenv("LANGFUSE_SECRET_KEY")
       host = host or os.getenv("LANGFUSE_HOST")

       if not public_key or not secret_key or not host:
           raise ValueError("LANGFUSE keys are required")

       auth = base64.b64encode(f"{public_key}:{secret_key}".encode()).decode()
       os.environ["OTEL_EXPORTER_OTLP_ENDPOINT"] = f"{host.rstrip('/')}/api/public/otel"
       os.environ["OTEL_EXPORTER_OTLP_HEADERS"] = f"Authorization=Basic {auth}"

       provider = TracerProvider()
       provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
       set_tracer_provider(provider)

Call it at the top of entrypoint if you want traces:

   async def entrypoint(ctx: JobContext):
       setup_langfuse()
       # rest of setup…

And that's it! All of your call traces and telemetry will now be sent to Langfuse in addition to being reported in the console by your agent.

You don't need to use Langfuse, but it's helpful to know how you can push telemetry to any opentelemetry compatible endpoint using your Agent.

In this section we added:

Console logs show per-component metrics and time-to-first-audio
Latency improved when you enabled preemptive_generation
Optional tracing emits spans to Langfuse when configured

With metrics in place, you can make data-driven decisions as you add tools and workflows in upcoming lessons.

Latency budget and pipeline streaming

Low-latency voice requires parallel work across the stack:

Stream everything: STT, LLM, and TTS should operate incrementally, not in big batches.
Parallelize: start TTS as soon as the LLM emits the first words; don’t wait for a full sentence.
Avoid blocking I/O: keep tool calls and storage async; bound timeouts and retries.
Keep prompts tight: fewer tokens means faster first audio and lower cost.

Track where time goes: STT first-token, LLM first-token, TTS first-audio, and overall time-to-first-audio per turn, and compare to your target budget (aim for <1000ms in the large majority of cases)

LLMs that give higher quality responses may have a higher time to first token. This may be worth the trade for you, but you may want to trade away latency on another part of the stack to keep response times fast.

This is one of the biggest benefits of having total control over the pipeline. You can optimize for exactly what your business needs.

Why WebRTC for voice

HTTP (TCP) adds head-of-line blocking and lacks audio semantics; good for text, bad for real time.
WebSockets improve persistence but still ride on TCP with the same blocking issues.
WebRTC builds on UDP with audio-first features: Opus compression, per-packet timestamps, and real-time network adaptation for conversational latency.

Metrics that you should track

Time to first LLM token (TTFT)
User interruption/barge-in rate
Tool latency and failure rate; fallback activations
Time to first audio frame (TTFA)
STT accuracy proxies (e.g., correction requests, intent reversals)