feat(assemblyai): add opt-in filter_disfluencies to STT plugin by dlange-aai · Pull Request #5473 · livekit/agents

dlange-aai · 2026-04-17T04:46:34Z

Summary

Adds an opt-in filter_disfluencies flag to the AssemblyAI STT plugin that suppresses transcript events whose content is entirely English backchannel/filler tokens (e.g. mm, mm-hmm, uh-huh, um, oh). It also replaces em-dashes (U+2014) with spaces in the transcript/utterance/word text when enabled. Off by default; no behavior change for existing users.

Motivation. In production voice sessions, short backchannels like Mm-hmm. routinely reach on_preemptive_generation and _run_eou_detection, causing the agent to start replying — and thereby interrupt its own in-flight TTS — even though the user was only acknowledging. AssemblyAI's universal-streaming-english API doesn't currently expose a server-side disfluency filter, so this is handled in the plugin.

How it works. Inside SpeechStream._process_stream_event on the Turn branch only:

INTERIM_TRANSCRIPT / PREFLIGHT_TRANSCRIPT with pure-disfluency text are not emitted.
FINAL_TRANSCRIPT with pure-disfluency text is emitted with text=\"\" (not skipped). This preserves the self._final_transcript_received.set() signal in AudioRecognition._on_stt_event so commit_user_turn() isn't blocked, while the handler's existing if not transcript: return early-exit suppresses on_final_transcript, preemptive generation, and EOU detection. See comment at the emission site.
END_OF_SPEECH, START_OF_SPEECH, Begin, Termination, RECOGNITION_USAGE paths are untouched — the state machine still reflects that the user made a sound.

Tokenization. text.lower().split() + strip trailing ASCII punctuation per token → check membership in a 19-token English disfluency frozenset. Multi-word disfluency-only utterances (mm mm) are filtered; any substantive token (mm hello) passes the whole utterance through.

Verified in the wild. Exercised across two live sessions against a hotel-booking agent over SIP. Disfluency-only turns (Um., Mm-hmm.) are suppressed at all three event levels and produce no agent reply. Mixed-content utterances (Um, what?, Oh, 10/28.) correctly pass through. Interim-level suppression released the moment real content arrives (Um filtered → Um, what? passes).

Usage

from livekit.plugins import assemblyai

stt = assemblyai.STT(filter_disfluencies=True)

# or toggle at runtime
stt.update_options(filter_disfluencies=True)

Test plan

ruff check / ruff format --check clean
uv run mypy livekit-plugins/livekit-plugins-assemblyai/livekit/plugins/assemblyai/stt.py → no issues
uv run pytest tests/test_plugin_assemblyai_stt.py → 23 passed (16 new tests covering config, helpers, and integration with _process_stream_event)
Integration-tested in two live LiveKit sessions with SIP telephony + Cartesia TTS + AssemblyAI universal-streaming-english

Notes

Feature is fully opt-in; existing users see no change.
The disfluency token list is intentionally hardcoded (19 English tokens) — not user-configurable — to keep the contract narrow. Happy to make it injectable if maintainers prefer.
I'm aware CONTRIBUTING suggests opening an issue for new features first; happy to split into issue + PR if that's preferred for this contribution.

Suppresses transcript events whose content is entirely English backchannel/filler tokens ("mm", "mm-hmm", "uh-huh", "um", ...), and replaces em-dashes with spaces in transcript/utterance/word text. Off by default; enable via filter_disfluencies=True on STT() or at runtime via update_options. Prevents accidental agent interruption: pure-disfluency interim and preflight events are dropped, and the FINAL_TRANSCRIPT is emitted with empty text so AudioRecognition's _final_transcript_received flag still flips (commit_user_turn remains unblocked) while no preemptive generation or EOU detection is triggered. START_OF_SPEECH and END_OF_SPEECH are unaffected. Covers 16 English disfluency tokens including hyphenated compound forms (mm-hm, mm-hmm, uh-huh, uh-uh) observed in AssemblyAI universal-streaming output.

Common AssemblyAI spelling variants of "um" and "mmhmm" that were leaking through the filter. Extends the set to 19 tokens.

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

davidzhao

I understand the problem you are trying to solve, but doing so in the plugin isn't the right spot.

this is also a solved problem with adaptive interruption handling, a new model we released in March: https://livekit.com/blog/adaptive-interruption-handling

dlange-aai added 2 commits April 17, 2026 00:13

feat(assemblyai): add "uhm" and "mhmm" disfluency variants

f9021b3

Common AssemblyAI spelling variants of "um" and "mmhmm" that were leaking through the filter. Extends the set to 19 tokens.

devin-ai-integration bot reviewed Apr 17, 2026

View reviewed changes

davidzhao reviewed Apr 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(assemblyai): add opt-in filter_disfluencies to STT plugin#5473

feat(assemblyai): add opt-in filter_disfluencies to STT plugin#5473
dlange-aai wants to merge 2 commits intolivekit:mainfrom
dlange-aai:assemblyai-filter_disfluencies

dlange-aai commented Apr 17, 2026 •

edited

Loading

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

davidzhao left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dlange-aai commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Usage

Test plan

Notes

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

davidzhao left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dlange-aai commented Apr 17, 2026 •

edited

Loading