Skip to content

[Fix] Skip FTS search for queries shorter than 2 characters (#391)#557

Merged
samzong merged 2 commits into
clawwork-ai:mainfrom
dpanbug:fix/search-min-query-length
Jul 3, 2026
Merged

[Fix] Skip FTS search for queries shorter than 2 characters (#391)#557
samzong merged 2 commits into
clawwork-ai:mainfrom
dpanbug:fix/search-min-query-length

Conversation

@dpanbug

@dpanbug dpanbug commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Closes #391.

Problem

globalSearch and searchArtifacts turn the query into an FTS5 prefix match by appending *. A single-character query like "a" becomes "a*", which matches almost every indexed token — effectively a full-table scan that blocks the main process for hundreds of milliseconds on a sizeable workspace.

Fix

Extract the shared normalization into toFtsPrefixQuery, which returns null for empty or single-character queries so both functions bail out before preparing and executing the statement (MIN_FTS_QUERY_LENGTH = 2). Word and CJK characters are still preserved.

Question for maintainers (CJK)

This guard also skips single-character CJK queries (e.g. a lone ) — a single-character prefix carries the same full-scan cost regardless of script, so I applied the threshold uniformly (matching the issue's suggested < 3). If single CJK-character search should stay enabled, I'm happy to special-case it — just let me know your preference.

Tests

Added unit tests covering the single-character guard, punctuation-only input, and the normal two-character path. Search tests pass (vitest run test/artifact-search.test.ts → 7 passed).

Release note

Fixed a brief UI freeze when typing a single character in search; queries now run from two characters onward.

…-ai#391)

`globalSearch` and `searchArtifacts` turn the query into an FTS5 prefix
match by appending `*`. A single-character query like `"a"` becomes
`"a*"`, which matches almost every indexed token — effectively a
full-table scan that blocks the main process for hundreds of
milliseconds on a sizeable workspace.

Extract the shared normalization into `toFtsPrefixQuery`, which returns
`null` for empty or single-character queries so both functions bail out
before preparing and executing the statement. Word and CJK characters
are still preserved. Added unit tests for the single-character guard,
punctuation-only input, and the normal two-character path.

Note: this also skips single-character CJK queries (e.g. a lone `中`),
which carry the same short-prefix scan cost. See the PR description for a
question on whether single CJK-character search should be exempted.

Release note: Fixed a brief UI freeze when typing a single character in
search; queries now run from two characters onward.

Signed-off-by: kainam <tek@dpan-bug.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Hi @dpanbug,
Thanks for your pull request!
If the PR is ready, use the /auto-cc command to assign Reviewer to Review.
We will review it shortly.

Details

Instructions for interacting with me using comments are available here.
If you have questions or suggestions related to my behavior, please file an issue against the gh-ci-bot repository.

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a performance issue where single-character search queries triggered inefficient full-table scans, causing UI freezes. By enforcing a minimum length of two characters for FTS5 prefix matches, the application now avoids unnecessary database operations for short or invalid queries. The change includes a refactor to unify query normalization and adds robust test coverage to prevent regressions.

Highlights

  • Search Performance Optimization: Introduced a minimum query length requirement of 2 characters for FTS5 prefix searches to prevent expensive full-table scans.
  • Code Refactoring: Centralized query normalization logic into a new toFtsPrefixQuery helper function used by both global and artifact search modules.
  • Test Coverage: Added comprehensive unit tests to verify the new length guard, ensuring single-character and punctuation-only queries are correctly ignored.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a minimum query length guard of two characters for FTS5 prefix matches in global and artifact searches to prevent performance degradation, along with corresponding unit tests. The review feedback identifies a critical internationalization bug where non-ASCII and non-CJK characters are stripped, as well as a usability issue where single-character CJK queries are blocked. It is recommended to use JavaScript Unicode property escapes to preserve international characters and dynamically allow single-character searches for CJK scripts, and to add tests covering these scenarios.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +43 to +52
/**
* Normalize a raw user query into an FTS5 prefix expression, or return `null`
* when it is empty or shorter than {@link MIN_FTS_QUERY_LENGTH}. Word and CJK
* characters are preserved; everything else is treated as a separator.
*/
function toFtsPrefixQuery(query: string): string | null {
const normalized = query.replace(/[^\w\u4e00-\u9fff]/g, ' ').trim();
if (normalized.length < MIN_FTS_QUERY_LENGTH) return null;
return normalized + '*';
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

🌍 Internationalization & CJK Search Usability Issues

There are two significant issues with the current implementation of toFtsPrefixQuery:

  1. Internationalization Bug (High Severity): The regex /[^\w\u4e00-\u9fff]/g strips out all non-ASCII and non-CJK letters (such as accented characters like é, ö, Cyrillic, Greek, Japanese Hiragana/Katakana, and Korean Hangul) and replaces them with spaces. This completely breaks search for users in non-English/non-Chinese locales.
  2. CJK Usability (Medium Severity): As you noted in the PR description, skipping single-character CJK queries (e.g., ) degrades usability because single-character words are extremely common and meaningful in Chinese, Japanese, and Korean.

💡 Solution

We can use modern JavaScript Unicode property escapes (/gu flag) to:

  • Correctly preserve letters and numbers across all languages (\p{L} and \p{N}).
  • Detect CJK/Japanese/Korean scripts (Han, Hiragana, Katakana, Hangul) to dynamically allow single-character searches for those languages while keeping the 2-character limit for Latin/alphanumeric scripts.
/**
 * Normalize a raw user query into an FTS5 prefix expression, or return `null`
 * when it is empty or shorter than the minimum required length.
 * Letters (including international scripts like CJK, Hiragana, Katakana, Hangul,
 * Cyrillic, etc.) and numbers are preserved; everything else is treated as a separator.
 */
function toFtsPrefixQuery(query: string): string | null {
  // Use Unicode property escapes to preserve letters and numbers across all languages,
  // while collapsing multiple non-alphanumeric characters into a single space.
  const normalized = query.replace(/[^\p{L}\p{N}_]+/gu, ' ').trim();

  // For CJK/Japanese/Korean scripts, single-character queries are highly meaningful
  // and should not be blocked.
  const isCjk = /[\p{Script=Han}\p{Script=Hiragana}\p{Script=Katakana}\p{Script=Hangul}]/u.test(normalized);
  const minLength = isCjk ? 1 : MIN_FTS_QUERY_LENGTH;

  if (normalized.length < minLength) return null;
  return normalized + '*';
}

Comment on lines +107 to +114
it('runs the prefix query once the minimum length is reached', () => {
const { db, all } = mockDb([]);

globalSearch(db as never, 'ab');

expect(all).toHaveBeenCalledWith('ab*', 'ab*', 'ab*');
});
});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

🧪 Add Tests for Internationalization and CJK Queries

To ensure that international characters are correctly preserved and single-character CJK queries are allowed, we should add unit tests covering these cases.

  it('runs the prefix query once the minimum length is reached', () => {
    const { db, all } = mockDb([]);

    globalSearch(db as never, 'ab');

    expect(all).toHaveBeenCalledWith('ab*', 'ab*', 'ab*');
  });

  it('allows single-character CJK/Japanese/Korean queries', () => {
    const { db, all } = mockDb([]);

    globalSearch(db as never, '中');

    expect(all).toHaveBeenCalledWith('中*', '中*', 'Middle*');
  });

  it('preserves international and accented characters', () => {
    const { db, all } = mockDb([]);

    globalSearch(db as never, 'Café');

    expect(all).toHaveBeenCalledWith('Café*', 'Café*', 'Café*');
  });
});

@dpanbug

dpanbug commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

/auto-cc

@github-actions github-actions Bot requested a review from yankay July 1, 2026 06:07
Addresses the Gemini review: the previous `[^\w一-鿿]` character
class kept only ASCII word characters plus CJK ideographs, so it stripped
accented Latin (é, ñ, ü), Cyrillic, Arabic — and, despite the CJK intent,
Japanese kana and Korean Hangul (both outside U+4E00–U+9FFF), which the app
ships locales for.

Switch to `[^\p{L}\p{N}_]` with the `u` flag so letters and digits of any
script are preserved. Added tests for accented Latin and Japanese kana.

The single-character guard still applies uniformly, so a lone CJK/kana/
Hangul character is still skipped — see the PR discussion for the open
question on exempting single-character searches for those scripts.

Signed-off-by: kainam <tek@dpan-bug.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@dpanbug

dpanbug commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

Thanks @gemini-code-assist — good catch on the tokenization. Fixed in df1829d: the character class is now [^\p{L}\p{N}_] (with the u flag), which preserves accented Latin, Cyrillic, Japanese kana, and Korean Hangul instead of stripping them. The previous \w-based class only kept ASCII word chars plus CJK ideographs, so kana/Hangul were dropped despite the CJK intent — even though the app ships ja/ko locales. Added tests for accented Latin (café) and kana (ひらがな).

Still open for maintainers: the minimum-length guard applies uniformly, so a single-character CJK/kana/Hangul query is skipped along with single ASCII characters. A one-character prefix carries the same near-full-scan cost in any script, so I kept it uniform — but if single-character search should stay enabled for those scripts, I'm happy to special-case it. Let me know your preference.

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Great work on the tokenization fix! Using the u flag with the \p{L} and \p{N} Unicode property escapes is definitely the right approach for robust, script-agnostic normalization. It correctly handles the wide range of characters your users might input across different locales.

Regarding your question about single-character CJK queries: keeping the threshold uniform is a sound decision for performance, especially since FTS5 prefix matching on a single character can still be quite expensive depending on the index size and distribution. If you find that users are frequently searching for single CJK characters and it's a critical use case, you could consider a more nuanced approach later (e.g., checking the script type before applying the length guard), but for now, maintaining consistency is cleaner and safer for the UI performance goal.

@samzong samzong merged commit 182a861 into clawwork-ai:main Jul 3, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] globalSearch allows single-character queries that trigger full-table FTS scans

2 participants