feat(assemblyai): add opt-in filter_disfluencies to STT plugin#5473
Open
dlange-aai wants to merge 2 commits intolivekit:mainfrom
Open
feat(assemblyai): add opt-in filter_disfluencies to STT plugin#5473dlange-aai wants to merge 2 commits intolivekit:mainfrom
dlange-aai wants to merge 2 commits intolivekit:mainfrom
Conversation
Suppresses transcript events whose content is entirely English
backchannel/filler tokens ("mm", "mm-hmm", "uh-huh", "um", ...), and
replaces em-dashes with spaces in transcript/utterance/word text. Off
by default; enable via filter_disfluencies=True on STT() or at runtime
via update_options.
Prevents accidental agent interruption: pure-disfluency interim and
preflight events are dropped, and the FINAL_TRANSCRIPT is emitted with
empty text so AudioRecognition's _final_transcript_received flag still
flips (commit_user_turn remains unblocked) while no preemptive
generation or EOU detection is triggered. START_OF_SPEECH and
END_OF_SPEECH are unaffected.
Covers 16 English disfluency tokens including hyphenated compound
forms (mm-hm, mm-hmm, uh-huh, uh-uh) observed in AssemblyAI
universal-streaming output.
Common AssemblyAI spelling variants of "um" and "mmhmm" that were leaking through the filter. Extends the set to 19 tokens.
davidzhao
reviewed
Apr 17, 2026
Member
davidzhao
left a comment
There was a problem hiding this comment.
I understand the problem you are trying to solve, but doing so in the plugin isn't the right spot.
this is also a solved problem with adaptive interruption handling, a new model we released in March: https://livekit.com/blog/adaptive-interruption-handling
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an opt-in
filter_disfluenciesflag to the AssemblyAI STT plugin that suppresses transcript events whose content is entirely English backchannel/filler tokens (e.g.mm,mm-hmm,uh-huh,um,oh). It also replaces em-dashes (U+2014) with spaces in the transcript/utterance/word text when enabled. Off by default; no behavior change for existing users.Motivation. In production voice sessions, short backchannels like
Mm-hmm.routinely reachon_preemptive_generationand_run_eou_detection, causing the agent to start replying — and thereby interrupt its own in-flight TTS — even though the user was only acknowledging. AssemblyAI'suniversal-streaming-englishAPI doesn't currently expose a server-side disfluency filter, so this is handled in the plugin.How it works. Inside
SpeechStream._process_stream_eventon theTurnbranch only:INTERIM_TRANSCRIPT/PREFLIGHT_TRANSCRIPTwith pure-disfluency text are not emitted.FINAL_TRANSCRIPTwith pure-disfluency text is emitted withtext=\"\"(not skipped). This preserves theself._final_transcript_received.set()signal inAudioRecognition._on_stt_eventsocommit_user_turn()isn't blocked, while the handler's existingif not transcript: returnearly-exit suppresseson_final_transcript, preemptive generation, and EOU detection. See comment at the emission site.END_OF_SPEECH,START_OF_SPEECH,Begin,Termination,RECOGNITION_USAGEpaths are untouched — the state machine still reflects that the user made a sound.Tokenization.
text.lower().split()+ strip trailing ASCII punctuation per token → check membership in a 19-token English disfluency frozenset. Multi-word disfluency-only utterances (mm mm) are filtered; any substantive token (mm hello) passes the whole utterance through.Verified in the wild. Exercised across two live sessions against a hotel-booking agent over SIP. Disfluency-only turns (
Um.,Mm-hmm.) are suppressed at all three event levels and produce no agent reply. Mixed-content utterances (Um, what?,Oh, 10/28.) correctly pass through. Interim-level suppression released the moment real content arrives (Umfiltered →Um, what?passes).Usage
Test plan
ruff check/ruff format --checkcleanuv run mypy livekit-plugins/livekit-plugins-assemblyai/livekit/plugins/assemblyai/stt.py→ no issuesuv run pytest tests/test_plugin_assemblyai_stt.py→ 23 passed (16 new tests covering config, helpers, and integration with_process_stream_event)universal-streaming-englishNotes