[Fix] Skip FTS search for queries shorter than 2 characters (#391)#557
Conversation
…-ai#391) `globalSearch` and `searchArtifacts` turn the query into an FTS5 prefix match by appending `*`. A single-character query like `"a"` becomes `"a*"`, which matches almost every indexed token — effectively a full-table scan that blocks the main process for hundreds of milliseconds on a sizeable workspace. Extract the shared normalization into `toFtsPrefixQuery`, which returns `null` for empty or single-character queries so both functions bail out before preparing and executing the statement. Word and CJK characters are still preserved. Added unit tests for the single-character guard, punctuation-only input, and the normal two-character path. Note: this also skips single-character CJK queries (e.g. a lone `中`), which carry the same short-prefix scan cost. See the PR description for a question on whether single CJK-character search should be exempted. Release note: Fixed a brief UI freeze when typing a single character in search; queries now run from two characters onward. Signed-off-by: kainam <tek@dpan-bug.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Hi @dpanbug, DetailsInstructions for interacting with me using comments are available here. |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a performance issue where single-character search queries triggered inefficient full-table scans, causing UI freezes. By enforcing a minimum length of two characters for FTS5 prefix matches, the application now avoids unnecessary database operations for short or invalid queries. The change includes a refactor to unify query normalization and adds robust test coverage to prevent regressions. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a minimum query length guard of two characters for FTS5 prefix matches in global and artifact searches to prevent performance degradation, along with corresponding unit tests. The review feedback identifies a critical internationalization bug where non-ASCII and non-CJK characters are stripped, as well as a usability issue where single-character CJK queries are blocked. It is recommended to use JavaScript Unicode property escapes to preserve international characters and dynamically allow single-character searches for CJK scripts, and to add tests covering these scenarios.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| /** | ||
| * Normalize a raw user query into an FTS5 prefix expression, or return `null` | ||
| * when it is empty or shorter than {@link MIN_FTS_QUERY_LENGTH}. Word and CJK | ||
| * characters are preserved; everything else is treated as a separator. | ||
| */ | ||
| function toFtsPrefixQuery(query: string): string | null { | ||
| const normalized = query.replace(/[^\w\u4e00-\u9fff]/g, ' ').trim(); | ||
| if (normalized.length < MIN_FTS_QUERY_LENGTH) return null; | ||
| return normalized + '*'; | ||
| } |
There was a problem hiding this comment.
🌍 Internationalization & CJK Search Usability Issues
There are two significant issues with the current implementation of toFtsPrefixQuery:
- Internationalization Bug (High Severity): The regex
/[^\w\u4e00-\u9fff]/gstrips out all non-ASCII and non-CJK letters (such as accented characters likeé,ö, Cyrillic, Greek, Japanese Hiragana/Katakana, and Korean Hangul) and replaces them with spaces. This completely breaks search for users in non-English/non-Chinese locales. - CJK Usability (Medium Severity): As you noted in the PR description, skipping single-character CJK queries (e.g.,
中) degrades usability because single-character words are extremely common and meaningful in Chinese, Japanese, and Korean.
💡 Solution
We can use modern JavaScript Unicode property escapes (/gu flag) to:
- Correctly preserve letters and numbers across all languages (
\p{L}and\p{N}). - Detect CJK/Japanese/Korean scripts (
Han,Hiragana,Katakana,Hangul) to dynamically allow single-character searches for those languages while keeping the 2-character limit for Latin/alphanumeric scripts.
/**
* Normalize a raw user query into an FTS5 prefix expression, or return `null`
* when it is empty or shorter than the minimum required length.
* Letters (including international scripts like CJK, Hiragana, Katakana, Hangul,
* Cyrillic, etc.) and numbers are preserved; everything else is treated as a separator.
*/
function toFtsPrefixQuery(query: string): string | null {
// Use Unicode property escapes to preserve letters and numbers across all languages,
// while collapsing multiple non-alphanumeric characters into a single space.
const normalized = query.replace(/[^\p{L}\p{N}_]+/gu, ' ').trim();
// For CJK/Japanese/Korean scripts, single-character queries are highly meaningful
// and should not be blocked.
const isCjk = /[\p{Script=Han}\p{Script=Hiragana}\p{Script=Katakana}\p{Script=Hangul}]/u.test(normalized);
const minLength = isCjk ? 1 : MIN_FTS_QUERY_LENGTH;
if (normalized.length < minLength) return null;
return normalized + '*';
}| it('runs the prefix query once the minimum length is reached', () => { | ||
| const { db, all } = mockDb([]); | ||
|
|
||
| globalSearch(db as never, 'ab'); | ||
|
|
||
| expect(all).toHaveBeenCalledWith('ab*', 'ab*', 'ab*'); | ||
| }); | ||
| }); |
There was a problem hiding this comment.
🧪 Add Tests for Internationalization and CJK Queries
To ensure that international characters are correctly preserved and single-character CJK queries are allowed, we should add unit tests covering these cases.
it('runs the prefix query once the minimum length is reached', () => {
const { db, all } = mockDb([]);
globalSearch(db as never, 'ab');
expect(all).toHaveBeenCalledWith('ab*', 'ab*', 'ab*');
});
it('allows single-character CJK/Japanese/Korean queries', () => {
const { db, all } = mockDb([]);
globalSearch(db as never, '中');
expect(all).toHaveBeenCalledWith('中*', '中*', 'Middle*');
});
it('preserves international and accented characters', () => {
const { db, all } = mockDb([]);
globalSearch(db as never, 'Café');
expect(all).toHaveBeenCalledWith('Café*', 'Café*', 'Café*');
});
});|
/auto-cc |
Addresses the Gemini review: the previous `[^\w一-鿿]` character
class kept only ASCII word characters plus CJK ideographs, so it stripped
accented Latin (é, ñ, ü), Cyrillic, Arabic — and, despite the CJK intent,
Japanese kana and Korean Hangul (both outside U+4E00–U+9FFF), which the app
ships locales for.
Switch to `[^\p{L}\p{N}_]` with the `u` flag so letters and digits of any
script are preserved. Added tests for accented Latin and Japanese kana.
The single-character guard still applies uniformly, so a lone CJK/kana/
Hangul character is still skipped — see the PR discussion for the open
question on exempting single-character searches for those scripts.
Signed-off-by: kainam <tek@dpan-bug.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Thanks @gemini-code-assist — good catch on the tokenization. Fixed in df1829d: the character class is now Still open for maintainers: the minimum-length guard applies uniformly, so a single-character CJK/kana/Hangul query is skipped along with single ASCII characters. A one-character prefix carries the same near-full-scan cost in any script, so I kept it uniform — but if single-character search should stay enabled for those scripts, I'm happy to special-case it. Let me know your preference. |
|
Great work on the tokenization fix! Using the Regarding your question about single-character CJK queries: keeping the threshold uniform is a sound decision for performance, especially since FTS5 prefix matching on a single character can still be quite expensive depending on the index size and distribution. If you find that users are frequently searching for single CJK characters and it's a critical use case, you could consider a more nuanced approach later (e.g., checking the script type before applying the length guard), but for now, maintaining consistency is cleaner and safer for the UI performance goal. |
Closes #391.
Problem
globalSearchandsearchArtifactsturn the query into an FTS5 prefix match by appending*. A single-character query like"a"becomes"a*", which matches almost every indexed token — effectively a full-table scan that blocks the main process for hundreds of milliseconds on a sizeable workspace.Fix
Extract the shared normalization into
toFtsPrefixQuery, which returnsnullfor empty or single-character queries so both functions bail out before preparing and executing the statement (MIN_FTS_QUERY_LENGTH = 2). Word and CJK characters are still preserved.Question for maintainers (CJK)
This guard also skips single-character CJK queries (e.g. a lone
中) — a single-character prefix carries the same full-scan cost regardless of script, so I applied the threshold uniformly (matching the issue's suggested< 3). If single CJK-character search should stay enabled, I'm happy to special-case it — just let me know your preference.Tests
Added unit tests covering the single-character guard, punctuation-only input, and the normal two-character path. Search tests pass (
vitest run test/artifact-search.test.ts→ 7 passed).Release note
Fixed a brief UI freeze when typing a single character in search; queries now run from two characters onward.