Skip to content

Improved speaker detection with GLiNER + expanded regex#6735

Open
MithilSaiReddy wants to merge 4 commits intoBasedHardware:mainfrom
MithilSaiReddy:main
Open

Improved speaker detection with GLiNER + expanded regex#6735
MithilSaiReddy wants to merge 4 commits intoBasedHardware:mainfrom
MithilSaiReddy:main

Conversation

@MithilSaiReddy
Copy link
Copy Markdown

@MithilSaiReddy MithilSaiReddy commented Apr 17, 2026

Improve Speaker Detection (Issue #3039)

Description
Improves speaker name extraction accuracy by combining GLiNER (NER) with enhanced regex detection, better handling of lowercase ASR output, and fixes for incorrect extraction in self-introduction phrases.


What this delivers

  • Detects speaker names from natural phrases like “this is bob”, “hey it’s charlie”, etc.
  • Handles lowercase ASR output and normalizes it to properly capitalized names.
  • Prevents GLiNER from returning full phrases instead of names in self-introductions.
  • Improves support for multi-word names (e.g., “John Smith”).
  • Maintains backward compatibility with existing detection logic.

Key implementation

GLiNER filtering

  • Skips NER for self-introduction phrases across 30+ languages.
  • Introduces _contains_intro_phrase() to avoid incorrect entity extraction.

Expanded regex patterns

  • Adds support for common patterns missed by GLiNER:
    • “This is X”
    • “Hey it’s X”
    • “Call me X”
    • “You’re speaking with X”
    • “You’re talking to X”
    • “The name’s X”
  • Supports both single and multi-word names.

Lowercase ASR handling

  • Processes raw inputs like “this is bob”.
  • Removes filler words (e.g., “hi”, “uh”, “um”).
  • Filters non-name tokens.
  • Normalizes output to properly capitalized names.

Multi-word name fix

  • Ensures correct capitalization for names like “John Smith”.

Impact

Improves speaker identification accuracy in real-world ASR scenarios, especially for noisy, lowercase, and conversational inputs, without introducing regressions or additional dependencies.


Test Results

  • ✅ 48/48 tests passing
  • 🌍 Coverage across 33 languages
  • 🔒 No regressions observed

Considerations

  • Performance: No new dependencies; regex operates in O(n); GLiNER usage remains cached.
  • Privacy: No additional data collection; operates on transient ASR text only.
  • Reliability: Fully covered by test suite with fallback handling for edge cases.

Docs

  • No user-facing changes; existing documentation and inline comments are sufficient.

AI Usage

  • Claude and OpenCode were used to assist with regex design, edge case handling, and PR structuring.
  • All logic and changes were manually reviewed and validated.

Files

  • backend/utils/speaker_identification.py
  • .gitignore

Video Recording (Test Cases)

Speaker Detection Demo

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 17, 2026

Greptile Summary

This PR integrates GLiNER NER as a speaker-detection layer in speaker_identification.py, adds expanded English regex patterns ("This is X", "Call me X", etc.), and introduces lowercase ASR fallback handling. The new batch_detect_speakers_from_texts async helper and a 258-line test suite are also included.

  • P1 – in-function import: from gliner import GLiNER inside _get_gliner_model() directly violates the project's "no in-function imports" rule; since gliner is now in requirements.txt, it should be a top-level import.
  • P1 – _contains_intro_phrase false positives: plain substring matching on short tokens ("я", "olen", "sono", "soy") bypasses GLiNER for entire Slavic-language conversations and for common proper nouns, silently degrading detection quality.
  • P1 – _clean_person_name two-letter name bug: the len(first_word) <= 2 guard discards valid two-letter first names (Ed, Bo, Jo) and returns the second word instead.

Confidence Score: 2/5

Not safe to merge — three P1 logic bugs need fixes before landing

The in-function import is a clear rule violation per CLAUDE.md; the substring-matching false-positive in _contains_intro_phrase silently breaks GLiNER bypass for entire languages (e.g., all Russian text containing 'я'); and _clean_person_name incorrectly discards two-letter first names. All three are present-defect correctness issues on the changed code path.

backend/utils/speaker_identification.py (all three P1 issues); backend/requirements.txt (unpinned gliner version)

Important Files Changed

Filename Overview
backend/utils/speaker_identification.py Adds GLiNER-based NER for speaker detection; contains in-function import violation, substring false-positive logic in _contains_intro_phrase for short Slavic tokens, and a 2-letter first-name truncation bug in _clean_person_name
backend/requirements.txt Adds gliner as a new ML dependency with a loose minimum-version constraint (>=0.2.0) contrary to the exact-pin convention used throughout the rest of the file
backend/tests/unit/test_gliner_ner.py New unit test file with reasonable coverage of _clean_person_name and detect_speaker_from_text; tests for 'This is John' actually exercise the lowercase-fallback path, not GLiNER itself
backend/test.sh New test file correctly added to the CI test runner
.gitignore Adds /tmp/, *.pyc, and pycache/ ignore entries; pycache/ is usually already covered by defaults and /tmp/ is very broad

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["detect_speaker_from_text(text)"] --> B{text empty\nor len < 3?}
    B -->|Yes| Z[return None]
    B -->|No| C["_detect_person_entities_cached(text)"]
    C --> D{"_contains_intro_phrase(text)?"}
    D -->|Yes| E["Skip GLiNER\nreturn ([], True)"]
    D -->|No| F["GLiNER model.predict_entities()"]
    F -->|Exception| G["return ([], False)"]
    F -->|Success| H["Filter intro-phrase words\n+ len >= 2"]
    H --> I["return (persons, True)"]
    E --> J{ner_available\nand persons?}
    I --> J
    G --> J
    J -->|Yes| K["_clean_person_name(person)\nfor each person"]
    K --> L{cleaned not None?}
    L -->|Yes| M[return cleaned name]
    L -->|No| N[try next person]
    J -->|No| O["Try regex patterns\n(patterns_to_check)"]
    N --> O
    O --> P{regex match?}
    P -->|Yes| Q[return capitalized name]
    P -->|No| R["Strip filler words\nfrom text_lower"]
    R --> S{"startswith 'this is'?"}
    S -->|Yes| T[return first_word.capitalize]
    S -->|No| V{"startswith intro_phrases?"}
    V -->|Yes| W[return first_word.capitalize]
    V -->|No| Z2[return None]
Loading

Reviews (1): Last reviewed commit: "Merge branch 'BasedHardware:main' into m..." | Re-trigger Greptile

Comment thread backend/utils/speaker_identification.py
Comment on lines +313 to +319
def _contains_intro_phrase(text: str) -> bool:
"""Check if text contains any intro phrase."""
text_lower = text.lower()
for phrase in GLINER_INTRO_PHRASES:
if phrase in text_lower:
return True
return False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Substring matching on short phrases causes widespread false GLiNER bypasses

_contains_intro_phrase uses plain substring containment (phrase in text_lower) over phrases that include single or very short tokens. For example: "я" (Cyrillic "I") is in GLINER_INTRO_PHRASES and will match as a substring inside меня, моя, твоя, and practically every sentence of Russian text, causing GLiNER to be silently bypassed for the whole language even on non-introduction utterances. Similarly, "olen" matches inside last names like "Bolen", and "sono" matches inside "Sonoma". The function should use word-boundary matching for short tokens:

Suggested change
def _contains_intro_phrase(text: str) -> bool:
"""Check if text contains any intro phrase."""
text_lower = text.lower()
for phrase in GLINER_INTRO_PHRASES:
if phrase in text_lower:
return True
return False
def _contains_intro_phrase(text: str) -> bool:
"""Check if text contains any intro phrase (word-boundary aware)."""
text_lower = text.lower()
for phrase in GLINER_INTRO_PHRASES:
if len(phrase) <= 4:
# Short tokens: require word boundaries to avoid substring false positives
if re.search(r'\b' + re.escape(phrase) + r'\b', text_lower):
return True
else:
if phrase in text_lower:
return True
return False

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this issue is fixed

Comment on lines +122 to +128
if first_word in NAME_PREFIXES and len(words) > 1:
return words[1].capitalize()

if len(first_word) <= 2 and first_word not in NAME_PREFIXES:
if len(words) > 1:
return words[1].capitalize()
return None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Two-letter first names incorrectly treated as non-name tokens

The <= 2 guard is designed to skip pronouns like "I" or articles like "A", but it also catches valid two-letter first names (Bo, Ed, Al, Jo, Li). For a GLiNER-extracted entity like "Ed Sullivan", first_word = "ed", len("ed") == 2 <= 2, so the function returns "Sullivan" (the last name) instead of "Ed". The condition should only exclude single-character tokens:

Suggested change
if first_word in NAME_PREFIXES and len(words) > 1:
return words[1].capitalize()
if len(first_word) <= 2 and first_word not in NAME_PREFIXES:
if len(words) > 1:
return words[1].capitalize()
return None
if len(first_word) == 1 and first_word not in NAME_PREFIXES:
if len(words) > 1:
return words[1].capitalize()
return None

Comment thread backend/requirements.txt Outdated
fsspec==2024.6.1
gitdb==4.0.11
GitPython==3.1.43
gliner>=0.2.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unpinned version constraint breaks reproducible builds

Every other dependency in this file is pinned to an exact version (==). gliner>=0.2.0 allows any future major version to be installed, which can introduce breaking changes silently. Pin to the specific tested version:

Suggested change
gliner>=0.2.0
gliner==0.2.17

(Replace with whatever version was tested locally.)

MithilSaiReddy and others added 2 commits April 17, 2026 08:37
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
@MithilSaiReddy
Copy link
Copy Markdown
Author

@aaravgarg @beastoin guys i cleared the issues by the bot btw !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant