Skip to content

feat(stt): back-date START_OF_SPEECH onset via server-provided timestamp#5479

Open
gsharp-aai wants to merge 6 commits intolivekit:mainfrom
gsharp-aai:gsharp/stt-speech-start-time-fallback
Open

feat(stt): back-date START_OF_SPEECH onset via server-provided timestamp#5479
gsharp-aai wants to merge 6 commits intolivekit:mainfrom
gsharp-aai:gsharp/stt-speech-start-time-fallback

Conversation

@gsharp-aai
Copy link
Copy Markdown
Contributor

@gsharp-aai gsharp-aai commented Apr 18, 2026

Summary

When turn_detection="stt" is used alongside a local VAD plugin (e.g. Silero), the framework records _speech_start_time from whichever event handler — VAD or STT — sets it first.

When local VAD fires for the audio, this works fine — the VAD handler back-dates _speech_start_time via time.time() - speech_duration - inference_duration and the existing None-guard prevents the later STT START_OF_SPEECH event from overwriting it.

But when the local VAD does not fire for that audio (different model version, different acoustic threshold, different preprocessing — common at quiet/borderline volumes), _speech_start_time stays None until the STT START_OF_SPEECH arrives — at which point the framework falls back to time.time() at message arrival. If the STT's speech-onset signal and its first transcript arrive close together, the framework's _speech_start_time and _last_speaking_time end up pinned near the same wall-clock instant. Result: MetricsReport.started_speaking_at ≈ stopped_speaking_at, i.e. speech_duration ≈ 0s for any turn the local VAD missed.

This is unphysical (real audio was transcribed), breaks downstream analytics keyed on speech duration, and creates state inconsistency where a user turn commits without ever entering a meaningful "speaking" window.

What this PR changes

Framework

  1. Adds an optional speech_start_time: float | None = None field on SpeechEvent. Plugins that receive a separate speech-onset signal with timing can populate it; when left None the framework's STT SOS handler falls back to time.time() at message arrival (its current behavior).
  2. Updates the STT SOS handler in audio_recognition.py to set _speech_start_time from ev.speech_start_time only when it's still None (i.e. local VAD hasn't fired first). When VAD has already set it, the VAD-back-dated value is preserved.
  3. Extends the RecognitionHooks.on_start_of_speech protocol with a required speech_start_time: float parameter and threads the authoritative onset through it from both SOS handlers. Each handler computes its own authoritative onset locally (VAD back-dates from the VAD event; STT reads _speech_start_time) and passes a concrete value in — no ambiguity about which input wins at call time. All downstream state — DynamicEndpointing._utterance_started_at, _user_speaking_span.start_time, UserStateChangedEvent.created_at — reads from that single value.
  4. Simplifies AgentActivity.on_start_of_speech to drop its internal VAD-event back-dating and time.time() fallback, since the caller now always provides the authoritative onset.

AssemblyAI plugin

  1. Parses the SpeechStarted.timestamp field (stream-relative ms) that the plugin currently discards, converts it to wall-clock via a _stream_wall_start anchor recorded when the first audio frame is sent, and populates SpeechEvent.speech_start_time on the emitted START_OF_SPEECH event.

Why this is a safe fallback

Strictly additive. Every turn where local VAD fires is unaffected — _speech_start_time is already set by the VAD handler (its back-date is more accurate, computed locally with no network delay), the None-guard preserves it, and the same value flows through the hook to every downstream consumer. The STT-provided timestamp is only consulted when _speech_start_time is still None at STT START_OF_SPEECH arrival, i.e. exactly the case where local VAD missed the audio the STT caught.

Provider-side fallback (if you'd prefer not to add the field)

If a SpeechEvent schema change isn't desirable, the same outcome can be achieved without touching the framework: a plugin can pass through the back-dated time on the existing SpeechData.start_time field by attaching a synthetic SpeechData to the START_OF_SPEECH event's alternatives list. Functionally equivalent but semantically off (SpeechData is meant for transcription hypotheses, not event metadata), so the explicit field is preferred. Happy to switch if maintainers prefer to defer the schema change.

Files changed

  • livekit-agents/livekit/agents/stt/stt.py — new optional field on SpeechEvent
  • livekit-agents/livekit/agents/voice/audio_recognition.py — tighten RecognitionHooks.on_start_of_speech to require speech_start_time: float; set _speech_start_time from ev.speech_start_time under the None-guard; pass self._speech_start_time to the hook at the STT SOS call site; pass the locally-computed back-date at the VAD SOS call site
  • livekit-agents/livekit/agents/voice/agent_activity.py — accept the required speech_start_time kwarg on AgentActivity.on_start_of_speech; internal fallback logic removed
  • livekit-plugins/livekit-plugins-assemblyai/livekit/plugins/assemblyai/stt.py — anchor _stream_wall_start on first frame; parse SpeechStarted.timestamp and populate SpeechEvent.speech_start_time

Test plan

  • make format lint type-check pass
  • Unit/integration coverage for the new field — happy to add on request

Adds an optional SpeechEvent.speech_start_time field for STT plugins that
receive a separate speech-onset signal with timing data, and uses it in
audio_recognition.py to back-date _speech_start_time on STT START_OF_SPEECH
events when local VAD has not fired.

Without this, when local VAD does not detect audio that the STT does
(e.g. quiet utterances near the activation threshold), _speech_start_time
gets pinned to message arrival wall-clock. Because providers like
AssemblyAI gate SpeechStarted behind the first partial transcript (so
SpeechStarted and the first transcript arrive in the same network burst),
this collapses _speech_start_time and _last_speaking_time onto the same
timestamp, producing MetricsReport.speech_duration = 0.0s exactly.

The framework's existing None-guard makes this strictly additive: VAD wins
when it fires (its back-date is more accurate, computed locally on the
audio path with no network delay). The STT timestamp is consulted only
when _speech_start_time remains None at STT SOS arrival.

Populates the new field from the AssemblyAI plugin by parsing
SpeechStarted.timestamp (stream-relative ms), anchored to wall-clock via
a new _stream_wall_start recorded when the first audio frame is sent.
Contributing guide says contributors don't need to touch CHANGELOG or
package manifests \u2014 maintainers handle versioning. Shortening the
docstring to match local conventions on existing fields.
Previous implementation computed a local stt_speech_start_time and
unconditionally passed it to the on_start_of_speech hook, even when local
VAD had already fired and set _speech_start_time. Downstream consumers of
the hook (e.g. DynamicEndpointing._utterance_started_at) unconditionally
overwrote their own state with that value, causing the STT server's
back-dated onset to shift endpointing statistics by up to ~750ms whenever
the VAD-fires-first path was exercised.

Tighten to a single source of truth: if _speech_start_time is already set
(VAD fired first), preserve it and pass it through to the hook. Only fall
back to the STT's server-provided onset when _speech_start_time is None
(VAD didn't fire). Zero observable change in the common case; corrects
downstream state in the edge case.
devin-ai-integration[bot]

This comment was marked as resolved.

@gsharp-aai gsharp-aai marked this pull request as draft April 18, 2026 00:25
Previous revision had two sources of onset time in `on_start_of_speech`:
an optional `speech_start_time` kwarg and a `VADEvent` that could be
back-dated. The "who wins" policy lived partly inside the function and
partly at the STT call site, making the contract harder to read.

Make `speech_start_time` a required parameter and push back-dating to
each call site. `audio_recognition` now computes the authoritative onset
at both SOS handlers (VAD's back-dated time for the VAD handler; VAD's
back-date or the STT server timestamp for the STT handler) and hands
a single value in. `AgentActivity.on_start_of_speech` drops its internal
fallback logic and simply uses what it's given.

No behavior change.
…amp=0

Two fixes from review:

1. _stream_wall_start was set in __init__ and only re-set on the first
   audio frame, so after the base class's _run() retry path reconnects
   the WebSocket, the anchor still pointed at the original connection's
   first frame while the server's timestamps restarted at 0. All
   subsequent SpeechStarted-derived onsets were shifted into the past
   by however long prior connections ran. Reset at the top of _run()
   so the next first-frame send re-anchors it.

2. data.get(\"timestamp\", 0) + truthy check conflated an absent field
   with a legitimate timestamp=0 (onset at stream start). Use
   data.get(\"timestamp\") + \`is not None\` so a real 0-ms onset
   converts to wall-clock instead of falling back to arrival time.
@gsharp-aai gsharp-aai marked this pull request as ready for review April 18, 2026 00:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant