Skip to content

feat(assemblyai): add opt-in filter_disfluencies to STT plugin#5473

Open
dlange-aai wants to merge 2 commits intolivekit:mainfrom
dlange-aai:assemblyai-filter_disfluencies
Open

feat(assemblyai): add opt-in filter_disfluencies to STT plugin#5473
dlange-aai wants to merge 2 commits intolivekit:mainfrom
dlange-aai:assemblyai-filter_disfluencies

Conversation

@dlange-aai
Copy link
Copy Markdown
Contributor

@dlange-aai dlange-aai commented Apr 17, 2026

Summary

Adds an opt-in filter_disfluencies flag to the AssemblyAI STT plugin that suppresses transcript events whose content is entirely English backchannel/filler tokens (e.g. mm, mm-hmm, uh-huh, um, oh). It also replaces em-dashes (U+2014) with spaces in the transcript/utterance/word text when enabled. Off by default; no behavior change for existing users.

Motivation. In production voice sessions, short backchannels like Mm-hmm. routinely reach on_preemptive_generation and _run_eou_detection, causing the agent to start replying — and thereby interrupt its own in-flight TTS — even though the user was only acknowledging. AssemblyAI's universal-streaming-english API doesn't currently expose a server-side disfluency filter, so this is handled in the plugin.

How it works. Inside SpeechStream._process_stream_event on the Turn branch only:

  • INTERIM_TRANSCRIPT / PREFLIGHT_TRANSCRIPT with pure-disfluency text are not emitted.
  • FINAL_TRANSCRIPT with pure-disfluency text is emitted with text=\"\" (not skipped). This preserves the self._final_transcript_received.set() signal in AudioRecognition._on_stt_event so commit_user_turn() isn't blocked, while the handler's existing if not transcript: return early-exit suppresses on_final_transcript, preemptive generation, and EOU detection. See comment at the emission site.
  • END_OF_SPEECH, START_OF_SPEECH, Begin, Termination, RECOGNITION_USAGE paths are untouched — the state machine still reflects that the user made a sound.

Tokenization. text.lower().split() + strip trailing ASCII punctuation per token → check membership in a 19-token English disfluency frozenset. Multi-word disfluency-only utterances (mm mm) are filtered; any substantive token (mm hello) passes the whole utterance through.

Verified in the wild. Exercised across two live sessions against a hotel-booking agent over SIP. Disfluency-only turns (Um., Mm-hmm.) are suppressed at all three event levels and produce no agent reply. Mixed-content utterances (Um, what?, Oh, 10/28.) correctly pass through. Interim-level suppression released the moment real content arrives (Um filtered → Um, what? passes).

Usage

from livekit.plugins import assemblyai

stt = assemblyai.STT(filter_disfluencies=True)

# or toggle at runtime
stt.update_options(filter_disfluencies=True)

Test plan

  • ruff check / ruff format --check clean
  • uv run mypy livekit-plugins/livekit-plugins-assemblyai/livekit/plugins/assemblyai/stt.py → no issues
  • uv run pytest tests/test_plugin_assemblyai_stt.py → 23 passed (16 new tests covering config, helpers, and integration with _process_stream_event)
  • Integration-tested in two live LiveKit sessions with SIP telephony + Cartesia TTS + AssemblyAI universal-streaming-english

Notes

  • Feature is fully opt-in; existing users see no change.
  • The disfluency token list is intentionally hardcoded (19 English tokens) — not user-configurable — to keep the contract narrow. Happy to make it injectable if maintainers prefer.
  • I'm aware CONTRIBUTING suggests opening an issue for new features first; happy to split into issue + PR if that's preferred for this contribution.

Suppresses transcript events whose content is entirely English
backchannel/filler tokens ("mm", "mm-hmm", "uh-huh", "um", ...), and
replaces em-dashes with spaces in transcript/utterance/word text. Off
by default; enable via filter_disfluencies=True on STT() or at runtime
via update_options.

Prevents accidental agent interruption: pure-disfluency interim and
preflight events are dropped, and the FINAL_TRANSCRIPT is emitted with
empty text so AudioRecognition's _final_transcript_received flag still
flips (commit_user_turn remains unblocked) while no preemptive
generation or EOU detection is triggered. START_OF_SPEECH and
END_OF_SPEECH are unaffected.

Covers 16 English disfluency tokens including hyphenated compound
forms (mm-hm, mm-hmm, uh-huh, uh-uh) observed in AssemblyAI
universal-streaming output.
Common AssemblyAI spelling variants of "um" and "mmhmm" that were
leaking through the filter. Extends the set to 19 tokens.
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

Open in Devin Review

Copy link
Copy Markdown
Member

@davidzhao davidzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the problem you are trying to solve, but doing so in the plugin isn't the right spot.

this is also a solved problem with adaptive interruption handling, a new model we released in March: https://livekit.com/blog/adaptive-interruption-handling

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants