Improved speaker detection with GLiNER + expanded regex#6735
Improved speaker detection with GLiNER + expanded regex#6735MithilSaiReddy wants to merge 4 commits intoBasedHardware:mainfrom
Conversation
Greptile SummaryThis PR integrates GLiNER NER as a speaker-detection layer in
Confidence Score: 2/5Not safe to merge — three P1 logic bugs need fixes before landing The in-function import is a clear rule violation per CLAUDE.md; the substring-matching false-positive in _contains_intro_phrase silently breaks GLiNER bypass for entire languages (e.g., all Russian text containing 'я'); and _clean_person_name incorrectly discards two-letter first names. All three are present-defect correctness issues on the changed code path. backend/utils/speaker_identification.py (all three P1 issues); backend/requirements.txt (unpinned gliner version) Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["detect_speaker_from_text(text)"] --> B{text empty\nor len < 3?}
B -->|Yes| Z[return None]
B -->|No| C["_detect_person_entities_cached(text)"]
C --> D{"_contains_intro_phrase(text)?"}
D -->|Yes| E["Skip GLiNER\nreturn ([], True)"]
D -->|No| F["GLiNER model.predict_entities()"]
F -->|Exception| G["return ([], False)"]
F -->|Success| H["Filter intro-phrase words\n+ len >= 2"]
H --> I["return (persons, True)"]
E --> J{ner_available\nand persons?}
I --> J
G --> J
J -->|Yes| K["_clean_person_name(person)\nfor each person"]
K --> L{cleaned not None?}
L -->|Yes| M[return cleaned name]
L -->|No| N[try next person]
J -->|No| O["Try regex patterns\n(patterns_to_check)"]
N --> O
O --> P{regex match?}
P -->|Yes| Q[return capitalized name]
P -->|No| R["Strip filler words\nfrom text_lower"]
R --> S{"startswith 'this is'?"}
S -->|Yes| T[return first_word.capitalize]
S -->|No| V{"startswith intro_phrases?"}
V -->|Yes| W[return first_word.capitalize]
V -->|No| Z2[return None]
Reviews (1): Last reviewed commit: "Merge branch 'BasedHardware:main' into m..." | Re-trigger Greptile |
| def _contains_intro_phrase(text: str) -> bool: | ||
| """Check if text contains any intro phrase.""" | ||
| text_lower = text.lower() | ||
| for phrase in GLINER_INTRO_PHRASES: | ||
| if phrase in text_lower: | ||
| return True | ||
| return False |
There was a problem hiding this comment.
Substring matching on short phrases causes widespread false GLiNER bypasses
_contains_intro_phrase uses plain substring containment (phrase in text_lower) over phrases that include single or very short tokens. For example: "я" (Cyrillic "I") is in GLINER_INTRO_PHRASES and will match as a substring inside меня, моя, твоя, and practically every sentence of Russian text, causing GLiNER to be silently bypassed for the whole language even on non-introduction utterances. Similarly, "olen" matches inside last names like "Bolen", and "sono" matches inside "Sonoma". The function should use word-boundary matching for short tokens:
| def _contains_intro_phrase(text: str) -> bool: | |
| """Check if text contains any intro phrase.""" | |
| text_lower = text.lower() | |
| for phrase in GLINER_INTRO_PHRASES: | |
| if phrase in text_lower: | |
| return True | |
| return False | |
| def _contains_intro_phrase(text: str) -> bool: | |
| """Check if text contains any intro phrase (word-boundary aware).""" | |
| text_lower = text.lower() | |
| for phrase in GLINER_INTRO_PHRASES: | |
| if len(phrase) <= 4: | |
| # Short tokens: require word boundaries to avoid substring false positives | |
| if re.search(r'\b' + re.escape(phrase) + r'\b', text_lower): | |
| return True | |
| else: | |
| if phrase in text_lower: | |
| return True | |
| return False |
| if first_word in NAME_PREFIXES and len(words) > 1: | ||
| return words[1].capitalize() | ||
|
|
||
| if len(first_word) <= 2 and first_word not in NAME_PREFIXES: | ||
| if len(words) > 1: | ||
| return words[1].capitalize() | ||
| return None |
There was a problem hiding this comment.
Two-letter first names incorrectly treated as non-name tokens
The <= 2 guard is designed to skip pronouns like "I" or articles like "A", but it also catches valid two-letter first names (Bo, Ed, Al, Jo, Li). For a GLiNER-extracted entity like "Ed Sullivan", first_word = "ed", len("ed") == 2 <= 2, so the function returns "Sullivan" (the last name) instead of "Ed". The condition should only exclude single-character tokens:
| if first_word in NAME_PREFIXES and len(words) > 1: | |
| return words[1].capitalize() | |
| if len(first_word) <= 2 and first_word not in NAME_PREFIXES: | |
| if len(words) > 1: | |
| return words[1].capitalize() | |
| return None | |
| if len(first_word) == 1 and first_word not in NAME_PREFIXES: | |
| if len(words) > 1: | |
| return words[1].capitalize() | |
| return None |
| fsspec==2024.6.1 | ||
| gitdb==4.0.11 | ||
| GitPython==3.1.43 | ||
| gliner>=0.2.0 |
There was a problem hiding this comment.
Unpinned version constraint breaks reproducible builds
Every other dependency in this file is pinned to an exact version (==). gliner>=0.2.0 allows any future major version to be installed, which can introduce breaking changes silently. Pin to the specific tested version:
| gliner>=0.2.0 | |
| gliner==0.2.17 |
(Replace with whatever version was tested locally.)
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
|
@aaravgarg @beastoin guys i cleared the issues by the bot btw ! |
Improve Speaker Detection (Issue #3039)
Description
Improves speaker name extraction accuracy by combining GLiNER (NER) with enhanced regex detection, better handling of lowercase ASR output, and fixes for incorrect extraction in self-introduction phrases.
What this delivers
Key implementation
GLiNER filtering
_contains_intro_phrase()to avoid incorrect entity extraction.Expanded regex patterns
Lowercase ASR handling
Multi-word name fix
Impact
Improves speaker identification accuracy in real-world ASR scenarios, especially for noisy, lowercase, and conversational inputs, without introducing regressions or additional dependencies.
Test Results
Considerations
Docs
AI Usage
Files
backend/utils/speaker_identification.py.gitignoreVideo Recording (Test Cases)