feat(stt): back-date START_OF_SPEECH onset via server-provided timestamp#5479
Open
gsharp-aai wants to merge 6 commits intolivekit:mainfrom
Open
feat(stt): back-date START_OF_SPEECH onset via server-provided timestamp#5479gsharp-aai wants to merge 6 commits intolivekit:mainfrom
gsharp-aai wants to merge 6 commits intolivekit:mainfrom
Conversation
Adds an optional SpeechEvent.speech_start_time field for STT plugins that receive a separate speech-onset signal with timing data, and uses it in audio_recognition.py to back-date _speech_start_time on STT START_OF_SPEECH events when local VAD has not fired. Without this, when local VAD does not detect audio that the STT does (e.g. quiet utterances near the activation threshold), _speech_start_time gets pinned to message arrival wall-clock. Because providers like AssemblyAI gate SpeechStarted behind the first partial transcript (so SpeechStarted and the first transcript arrive in the same network burst), this collapses _speech_start_time and _last_speaking_time onto the same timestamp, producing MetricsReport.speech_duration = 0.0s exactly. The framework's existing None-guard makes this strictly additive: VAD wins when it fires (its back-date is more accurate, computed locally on the audio path with no network delay). The STT timestamp is consulted only when _speech_start_time remains None at STT SOS arrival. Populates the new field from the AssemblyAI plugin by parsing SpeechStarted.timestamp (stream-relative ms), anchored to wall-clock via a new _stream_wall_start recorded when the first audio frame is sent.
Contributing guide says contributors don't need to touch CHANGELOG or package manifests \u2014 maintainers handle versioning. Shortening the docstring to match local conventions on existing fields.
Previous implementation computed a local stt_speech_start_time and unconditionally passed it to the on_start_of_speech hook, even when local VAD had already fired and set _speech_start_time. Downstream consumers of the hook (e.g. DynamicEndpointing._utterance_started_at) unconditionally overwrote their own state with that value, causing the STT server's back-dated onset to shift endpointing statistics by up to ~750ms whenever the VAD-fires-first path was exercised. Tighten to a single source of truth: if _speech_start_time is already set (VAD fired first), preserve it and pass it through to the hook. Only fall back to the STT's server-provided onset when _speech_start_time is None (VAD didn't fire). Zero observable change in the common case; corrects downstream state in the edge case.
Previous revision had two sources of onset time in `on_start_of_speech`: an optional `speech_start_time` kwarg and a `VADEvent` that could be back-dated. The "who wins" policy lived partly inside the function and partly at the STT call site, making the contract harder to read. Make `speech_start_time` a required parameter and push back-dating to each call site. `audio_recognition` now computes the authoritative onset at both SOS handlers (VAD's back-dated time for the VAD handler; VAD's back-date or the STT server timestamp for the STT handler) and hands a single value in. `AgentActivity.on_start_of_speech` drops its internal fallback logic and simply uses what it's given. No behavior change.
…amp=0 Two fixes from review: 1. _stream_wall_start was set in __init__ and only re-set on the first audio frame, so after the base class's _run() retry path reconnects the WebSocket, the anchor still pointed at the original connection's first frame while the server's timestamps restarted at 0. All subsequent SpeechStarted-derived onsets were shifted into the past by however long prior connections ran. Reset at the top of _run() so the next first-frame send re-anchors it. 2. data.get(\"timestamp\", 0) + truthy check conflated an absent field with a legitimate timestamp=0 (onset at stream start). Use data.get(\"timestamp\") + \`is not None\` so a real 0-ms onset converts to wall-clock instead of falling back to arrival time.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When
turn_detection="stt"is used alongside a local VAD plugin (e.g. Silero), the framework records_speech_start_timefrom whichever event handler — VAD or STT — sets it first.When local VAD fires for the audio, this works fine — the VAD handler back-dates
_speech_start_timeviatime.time() - speech_duration - inference_durationand the existing None-guard prevents the later STTSTART_OF_SPEECHevent from overwriting it.But when the local VAD does not fire for that audio (different model version, different acoustic threshold, different preprocessing — common at quiet/borderline volumes),
_speech_start_timestaysNoneuntil the STTSTART_OF_SPEECHarrives — at which point the framework falls back totime.time()at message arrival. If the STT's speech-onset signal and its first transcript arrive close together, the framework's_speech_start_timeand_last_speaking_timeend up pinned near the same wall-clock instant. Result:MetricsReport.started_speaking_at ≈ stopped_speaking_at, i.e.speech_duration ≈ 0sfor any turn the local VAD missed.This is unphysical (real audio was transcribed), breaks downstream analytics keyed on speech duration, and creates state inconsistency where a user turn commits without ever entering a meaningful "speaking" window.
What this PR changes
Framework
speech_start_time: float | None = Nonefield onSpeechEvent. Plugins that receive a separate speech-onset signal with timing can populate it; when leftNonethe framework's STT SOS handler falls back totime.time()at message arrival (its current behavior).audio_recognition.pyto set_speech_start_timefromev.speech_start_timeonly when it's stillNone(i.e. local VAD hasn't fired first). When VAD has already set it, the VAD-back-dated value is preserved.RecognitionHooks.on_start_of_speechprotocol with a requiredspeech_start_time: floatparameter and threads the authoritative onset through it from both SOS handlers. Each handler computes its own authoritative onset locally (VAD back-dates from the VAD event; STT reads_speech_start_time) and passes a concrete value in — no ambiguity about which input wins at call time. All downstream state —DynamicEndpointing._utterance_started_at,_user_speaking_span.start_time,UserStateChangedEvent.created_at— reads from that single value.AgentActivity.on_start_of_speechto drop its internal VAD-event back-dating andtime.time()fallback, since the caller now always provides the authoritative onset.AssemblyAI plugin
SpeechStarted.timestampfield (stream-relative ms) that the plugin currently discards, converts it to wall-clock via a_stream_wall_startanchor recorded when the first audio frame is sent, and populatesSpeechEvent.speech_start_timeon the emittedSTART_OF_SPEECHevent.Why this is a safe fallback
Strictly additive. Every turn where local VAD fires is unaffected —
_speech_start_timeis already set by the VAD handler (its back-date is more accurate, computed locally with no network delay), the None-guard preserves it, and the same value flows through the hook to every downstream consumer. The STT-provided timestamp is only consulted when_speech_start_timeis stillNoneat STTSTART_OF_SPEECHarrival, i.e. exactly the case where local VAD missed the audio the STT caught.Provider-side fallback (if you'd prefer not to add the field)
If a
SpeechEventschema change isn't desirable, the same outcome can be achieved without touching the framework: a plugin can pass through the back-dated time on the existingSpeechData.start_timefield by attaching a syntheticSpeechDatato theSTART_OF_SPEECHevent'salternativeslist. Functionally equivalent but semantically off (SpeechDatais meant for transcription hypotheses, not event metadata), so the explicit field is preferred. Happy to switch if maintainers prefer to defer the schema change.Files changed
livekit-agents/livekit/agents/stt/stt.py— new optional field onSpeechEventlivekit-agents/livekit/agents/voice/audio_recognition.py— tightenRecognitionHooks.on_start_of_speechto requirespeech_start_time: float; set_speech_start_timefromev.speech_start_timeunder the None-guard; passself._speech_start_timeto the hook at the STT SOS call site; pass the locally-computed back-date at the VAD SOS call sitelivekit-agents/livekit/agents/voice/agent_activity.py— accept the requiredspeech_start_timekwarg onAgentActivity.on_start_of_speech; internal fallback logic removedlivekit-plugins/livekit-plugins-assemblyai/livekit/plugins/assemblyai/stt.py— anchor_stream_wall_starton first frame; parseSpeechStarted.timestampand populateSpeechEvent.speech_start_timeTest plan
make format lint type-checkpass