dfm-search performance optimization by Johnson-zs · Pull Request #299 · linuxdeepin/util-dfm

Johnson-zs · 2026-05-28T05:37:40Z

feat: add filename search to content and OCR search
feat: add natural language semantic search
feat: add relative time support for Chinese search
feat: implement file size range filtering
feat: add file size constraint support in semantic search
feat: add action-based time field search support
fix: automatically handle hidden path search conditions
feat: add location-based search support for Chinese NLP
feat: add semantic query detection and multi-path search support
feat: add file size range filter to search strategies
fix: unify dfm-search library and path names
feat: add file metadata attributes to search results
fix: improve Chinese NLP search functionality
feat: add semantic search with detailed results
feat: enhance semantic search with explicit directories
feat: add max results limit for semantic search
test: add search target control tests
feat: add chinese NLP parsing for relative time and size constraints
feat: add NGram analyzer and tokenizer for Lucene++
fix: improve content search engine validation and analyzer
refactor: optimize search filtering and query building
feat: add on-demand content highlight retrieval
refactor: improve NGramTokenizer and search factory
refactor: improve OCR text search validation and analyzer selection
perf: optimize search performance with field selector
refactor: disable unit tests in release builds
feat: optimize ngram search query building
refactor: remove NGram analyzer and tokenizer components
fix: adjust N-gram token position calculation
feat: enhance ContentRetriever with content fetching capabilities
test: add test utility libraries for content search
perf: optimize OCR text search document loading
perf: replace chinese analyzer with ngram search
test: add filename search engine test cases
docs: update license files and cleanup

1. Added filename keyword search capability to both content and OCR text search 2. Implemented new API methods setFilenameKeyword() and filenameKeyword() in TextSearchOptionsAPI 3. Added --filename command line option for filename searches 4. Modified search validation to allow short queries when filename search is used 5. Implemented Lucene query building logic that combines filename and content searches with AND logic 6. Added filename search support to command line interface and API Log: Added filename search capability to file content and OCR text search Influence: 1. Test filename-only searches with --filename parameter 2. Test combined content+filename searches 3. Verify handling of short queries when filename search is used 4. Test command line help output for new --filename option 5. Verify API-level filename search functionality feat: 为内容搜索和OCR搜索添加文件名搜索功能 1. 为内容搜索和OCR文本搜索添加文件名关键词搜索功能 2. 在TextSearchOptionsAPI中实现setFilenameKeyword()和filenameKeyword()新 API方法 3. 添加用于文件名搜索的--filename命令行选项 4. 修改搜索验证逻辑以在使用文件名搜索时允许短查询 5. 实现将文件名和内容搜索与AND逻辑相结合的Lucene查询构建逻辑 6. 为命令行界面和API添加文件名搜索支持 Log: 为文件内容和OCR文本搜索添加文件名搜索功能 Influence: 1. 使用--filename参数测试仅文件名搜索 2. 测试组合的内容+文件名搜索 3. 验证使用文件名搜索时对短查询的处理 4. 测试新--filename选项的命令行帮助输出 5. 验证API级别的文件名搜索功能

This commit introduces a comprehensive natural language processing (NLP) based semantic search system for the file manager. Key changes include: 1. Added SemanticRuleEngine to load and process regex-based rules from JSON config files. Rules are organized into groups (time, filetype, keyword, noise) with priorities and metadata. 2. Implemented DimensionExtractor base class and concrete extractors: - TimeExtractor: parses relative/preset time ("today", "last week") and specific dates - FileTypeExtractor: maps terms to file extensions (e.g. "pdf", "document", "images") - KeywordExtractor: handles structured patterns and unconsumed text as keywords 3. Added IntentParser to coordinate extractors and produce ParsedIntent with: - Time constraints - File type filters - Search keywords - Consumed text spans 4. Implemented SemanticQueryBuilder to convert ParsedIntent into concrete: - SearchQuery objects (filename, content, OCR) - SearchOptions (time ranges, pinyin matching etc) 5. Added SemanticSearcher as main entry point with async/sync search APIs: - Parses natural language queries - Parallel searches across filename/content/OCR indexes - Deduplicates results - Timeout handling 6. Includes over 200 test cases covering Chinese NLP parsing for: - Time expressions - File type synonyms - Keyword patterns - Combined scenarios - Error cases 7. Adds rule files for Chinese language support: - Time expressions (relative/absolute) - File type mappings (precise/general) - Keyword patterns (contains/named/content) - Noise words (actions/polite/suffix) Log: Added semantic search with natural language support for Chinese Influence: 1. Test "find yesterday's pdf documents" with various time expressions 2. Verify file type mappings work for precise (pdf) and general (document) terms 3. Check keyword extraction from patterns and unconsumed text 4. Test combination searches with time+type+keywords 5. Validate results deduplication across search paths 6. Confirm timeout and cancellation works properly feat: 新增自然语言语义搜索功能本次提交为文件管理器引入了全面的基于自然语言处理(NLP)的语义搜索系统。主要变更包括: 1. 添加SemanticRuleEngine从JSON配置文件加载和处理基于正则表达式的规则。规则按组(时间、文件类型、关键词、噪音词)组织，带有优先级和元数据。 2. 实现DimensionExtractor基类和具体提取器: - TimeExtractor: 解析相对/预设时间("今天"、"上周")和具体日期 - FileTypeExtractor: 将术语映射到文件扩展名(如"pdf"、"文档"、"图片") - KeywordExtractor: 处理结构化模式和未消费文本作为关键词 3. 添加IntentParser协调提取器并生成ParsedIntent，包含: - 时间约束条件 - 文件类型过滤器 - 搜索关键词 - 已消费文本范围 4. 实现SemanticQueryBuilder将ParsedIntent转换为具体: - SearchQuery对象(文件名、内容、OCR) - SearchOptions(时间范围、拼音匹配等) 5. 添加SemanticSearcher作为主要入口点，提供异步/同步搜索API: - 解析自然语言查询 - 并行搜索文件名/内容/OCR索引 - 结果去重 - 超时处理 6. 包含200多个测试用例，覆盖中文NLP解析: - 时间表达式 - 文件类型同义词 - 关键词模式 - 组合场景 - 错误情况 7. 添加中文语言规则文件: - 时间表达式(相对/绝对) - 文件类型映射(精确/通用) - 关键词模式(包含/名为/内容) - 噪音词(动作/礼貌/后缀) Log: 新增支持中文的自然语言语义搜索功能 Influence: 1. 测试"查找昨天的pdf文档"等不同时间表达式 2. 验证文件类型映射对精确(pdf)和通用(文档)术语有效 3. 检查从模式和未消费文本中提取关键词 4. 测试时间+类型+关键词的组合搜索 5. 确认跨搜索路径的结果去重 6. 验证超时和取消功能正常工作

1. Added new relative time parsing rules in Chinese NLP with 4 categories: just now (2h), recent days (3d), past few days (3-7d), and a while ago (30+ days) 2. Implemented test cases for relative time queries with detailed time range validation 3. Added support for various synonyms for each time category 4. Implemented priority handling when relative and preset time rules conflict 5. Extended TimeExtractor to properly handle relative time metadata 6. Updated query builder to use custom time ranges for relative time constraints Log: Added support for "just now", "recent days", "past few days" and "a while ago" time ranges in Chinese file search Influence: 1. Test Chinese queries with "刚刚", "最近", "前几天", "之前" and synonyms 2. Verify time ranges match expected periods (2h, 3d, 3-7d, 30d+) 3. Verify priority when combined with preset times like "今天之前" 4. Check file type filtering still works with relative time queries 5. Test edge cases like exact minute boundaries for "just now" feat: 增加中文搜索的相对时间支持 1. 在中文NLP中新增4类相对时间解析规则：刚刚(2小时内)、最近(3天内)、前几天(3-7天前)、之前(30天前) 2. 实现了相对时间查询的测试用例，包含详细的时间范围验证 3. 为每个时间类别添加了多种同义词支持 4. 实现了当相对时间和预设时间规则冲突时的优先级处理 5. 扩展TimeExtractor以正确处理相对时间元数据 6. 更新查询构建器以使用自定义时间范围处理相对时间约束 Log: 在中文文件搜索中新增对"刚刚"、"最近"、"前几天"和"之前"时间范围的支持 Influence: 1. 测试包含"刚刚"、"最近"、"前几天"、"之前"及其同义词的中文查询 2. 验证时间范围是否符合预期(2小时、3天、3-7天、30天以上) 3. 验证与预设时间如"今天之前"组合时的优先级 4. 检查文件类型过滤在相对时间查询中是否仍然有效 5. 测试边缘情况，如"刚刚"查询的精确分钟边界

1. Added SizeRangeFilter class for file size range filtering functionality 2. Added SizeParser utility for human-readable size string parsing (e.g. "1K", "10M") 3. Implemented file size filtering in both indexed and real-time search strategies 4. Added file size numeric field support in search results 5. Integrated size filtering into CLI with --size-min and --size-max options 6. Added comprehensive unit tests for all new functionality Log: Added file size range filtering support for search operations Influence: 1. Test searching with various size ranges (min, max, both) 2. Test different size formats (K, M, G, T suffixes) 3. Verify boundary cases (0-size, max qint64 values) 4. Test combination with other filters like time range 5. Verify CLI interface with --size-min and --size-max feat: 实现文件大小范围过滤功能 1. 新增 SizeRangeFilter 类用于文件大小范围筛选 2. 添加 SizeParser 工具类用于解析可读性良好的大小字符串（如"1K"，"10M"） 3. 在索引和实时搜索策略中实现文件大小过滤功能 4. 在搜索结果中添加文件大小数值字段支持 5. 在命令行接口中集成大小过滤功能（--size-min 和 --size-max 选项） 6. 为所有新功能添加全面的单元测试 Log: 为搜索操作添加文件大小范围过滤支持 Influence: 1. 测试不同大小范围的搜索（最小值、最大值、同时设置） 2. 测试不同大小格式（K、M、G、T 后缀） 3. 验证边界情况（0大小、qint64最大值） 4. 测试与其他过滤条件的组合使用（如时间范围） 5. 验证命令行接口的 --size-min 和 --size-max 选项

Added comprehensive support for file size constraints in the semantic search system. Implemented size-related NLP rules (preset ranges like "large files", exact sizes like "greater than 100MB"), a new SizeExtractor class, size constraint parsing, and integration with query building. The changes include: 1. New SizeConstraint type in semantic_types.h to represent parsed size constraints 2. SizeExtractor class to process size expressions in natural language 3. Over 20 test cases covering all size constraint variations 4. Size constraint rules in size_rules.json with fuzzy ranges, exact values, and range expressions 5. Integration with SemanticQueryBuilder to build size filters during search Log: Added support for natural language size constraints in file search (e.g. "large files", ">100MB") Influence: 1. Test fuzzy size expressions ("large files", "small files") 2. Test exact size expressions (">500M", "<100K") 3. Test ranges ("1M-10M files") 4. Test combined constraints with time/type ("yesterday's large videos") 5. Verify all size units (B, KB, MB, GB) 6. Test edge cases (invalid formats, zero sizes) feat: 在语义搜索中添加文件大小约束支持为语义搜索系统全面添加了文件大小约束的支持。实现了大小相关的自然语言处理规则（预设范围如"大文件"、精确大小如"大于100MB"）、新的SizeExtractor类、大小约束解析以及与查询构建的集成。具体更改包括： 1. 在semantic_types.h中添加SizeConstraint类型表示解析后的大小约束 2. 处理自然语言大小表达式的SizeExtractor类 3. 20多个测试用例覆盖所有大小约束变体 4. size_rules.json中的大小约束规则，包括模糊范围、精确值和范围表达式 5. 与SemanticQueryBuilder集成以在搜索时构建大小过滤器 Log: 在文件搜索中添加了对自然语言大小约束的支持（如"大文件"、">100MB"） Influence: 1. 测试模糊大小表达式（"大文件"、"小文件"） 2. 测试精确大小表达式（">500M"、"<100K"） 3. 测试范围表达式（"1M-10M的文件"） 4. 测试与时间/类型组合的约束（"昨天的大视频"） 5. 验证所有大小单位（B、KB、MB、GB） 6. 测试边界情况（无效格式、零大小）

Added full support for action-based time field queries in semantic search: 1. Added new action_rules.json with create/modify action patterns 2. Added ActionExtractor to parse action rules 3. Implemented TimeField enum with Unspecified/Both options 4. Updated search builder and searcher to handle multiple time fields 5. Added comprehensive test cases for action-time field interaction Log: Added support for searching by file create/modify time using natural language actions like "新建的图片" or "修改过的文档" Influence: 1. Test natural language searches with create/modify action words (新/ 修改) 2. Verify single and compound time searches (e.g. "昨天修改过的文件") 3. Check default behavior with unspecified time fields 4. Test search performance with different time field combinations 5. Verify search results accuracy for both creation and modification times feat: 新增基于操作时间的文件搜索功能为语义搜索添加了完整的基于操作时间的查询支持： 1. 新增action_rules.json包含创建/修改操作规则 2. 添加ActionExtractor解析操作规则 3. 实现TimeField枚举包含未指定/双重选项 4. 更新搜索构建器和搜索器处理多时间字段 5. 添加完整的操作-时间字段交互测试用例 Log: 新增支持使用自然语言操作词（如"新建的图片"/"修改过的文档"）按文件创建/修改时间搜索 Influence: 1. 测试带创建/修改操作词的自然语言搜索 2. 验证单个和复合时间搜索（如"昨天修改过的文件"） 3. 检查未指定时间字段时的默认行为 4. 测试不同时间字段组合下的搜索性能 5. 验证创建时间和修改时间搜索结果的准确性

The changes implement automatic handling of hidden file search when user provides a path that contains hidden directories. Previously, users needed to explicitly use --include-hidden flag even when searching within hidden directories (like ~/.local/share/Trash). The modification: 1. Moves the includeHidden flag processing earlier in the parsing sequence 2. Adds auto-enable logic when detecting hidden path components 3. Preserves explicit user settings when --include-hidden is used Log: Search now automatically includes hidden files when searching within hidden directories Influence: 1. Test searches in hidden directories without --include-hidden flag 2. Verify searches in normal directories still respect includeHidden setting 3. Test explicit --include-hidden flag overrides the auto-detection 4. Verify mixed path cases (both hidden and non-hidden components) 5. Check backward compatibility with existing config files fix: 自动处理隐藏路径搜索条件改动实现了当用户提供包含隐藏目录的路径时自动处理隐藏文件搜索。之前即使是搜索隐藏目录（如~/.local/share/Trash）也要用户显式使用--include-hidden标志。本次修改： 1. 将includeHidden标志处理移到解析顺序的前面 2. 增加检测到隐藏路径时自动启用的逻辑 3. 当使用--include-hidden时保留用户的显式设置 Log: 现在搜索隐藏目录时会自动包含隐藏文件 Influence: 1. 测试不使用--include-hidden标志搜索隐藏目录的情况 2. 验证在普通目录中搜索仍会遵守includeHidden设置 3. 测试显式的--include-hidden标志是否覆盖自动检测 4. 验证混合路径情况（包含隐藏和非隐藏部分） 5. 检查与现有配置文件的向后兼容性

1. Implemented LocationExtractor to parse and resolve standard directory names in Chinese 2. Updated ParsedIntent to include searchDirectories and includeHidden fields 3. Added location_rules.json with mappings for desktop, downloads, documents etc. 4. Included trash/recycle bin location handling with includeHidden flag 5. Modified search builder and searcher to support multiple directory searches 6. Added comprehensive test cases for all location scenarios Log: Added support for searching in specific directories like Desktop and Download with natural language queries in Chinese Influence: 1. Test natural language searches with location terms (e.g., "桌面上的文档") 2. Verify correct path resolution for all standard directories 3. Test combined searches with locations and other filters 4. Verify trash/recycle bin searches include hidden files 5. Test multiple directory searches (e.g., "桌面和下载的图片") 6. Ensure backward compatibility with non-location queries feat: 为中文自然语言搜索添加基于位置的支持 1. 实现LocationExtractor解析和处理中文标准目录名称 2. 更新ParsedIntent结构体增加searchDirectories和includeHidden字段 3. 添加location_rules.json规则文件，包含桌面、下载、文档等目录映射 4. 支持回收站/垃圾箱的特殊处理（包含隐藏文件） 5. 修改搜索构建器和搜索器以支持多目录搜索 6. 添加全面的测试用例覆盖各种位置场景 Log: 新增支持中文自然语言查询在特定目录(如桌面、下载)的搜索功能 Influence: 1. 测试包含位置术语的自然语言搜索(如"桌面上的文档") 2. 验证所有标准目录的路径解析是否正确 3. 测试位置与其他条件组合的搜索 4. 验证回收站搜索是否包含隐藏文件 5. 测试多目录搜索功能(如"桌面和下载的图片") 6. 确保与非位置查询的向后兼容性

1. Added isSemanticQuery() API to check for semantic intent in search queries 2. Implemented multiple path search support via setSearchPaths() interface 3. Added test cases for isSemanticQuery functionality 4. Optimized path prefix queries for multi-path scenarios 5. Improved handling of TimeField::Both in time range filters 6. Restructured query building to avoid duplicate engines creation Log: Improved time range filter handling for both creation and modification times Influence: 1. Test semantic query detection with various inputs (time, size, file types, locations) 2. Verify multi-path search functionality with different path combinations 3. Test path prefix optimization with single and multiple paths 4. Verify time range filtering with TimeField::Both option 5. Test edge cases in isSemanticQuery (empty, whitespace, pure keywords) feat: 添加语义查询检测和多路径搜索支持 1. 添加isSemanticQuery()接口用于检测搜索查询中的语义意图 2. 通过setSearchPaths()接口实现多路径搜索支持 3. 新增isSemanticQuery功能测试用例 4. 针对多路径场景优化路径前缀查询 5. 改进时间范围过滤器中TimeField::Both的处理 6. 重构查询构建逻辑，避免创建重复的搜索引擎 Log: 改进创建时间和修改时间的范围过滤处理 Influence: 1. 测试语义查询检测功能，使用不同输入（时间、大小、文件类型、路径） 2. 验证多路径搜索功能的正确性，测试不同路径组合 3. 测试单路径和多路径下的路径前缀优化 4. 验证TimeField::Both选项下的时间范围过滤效果 5. 测试isSemanticQuery的边界情况（空输入、纯空格、纯关键词等）

1. Implemented file size range filtering for both content and OCR text search strategies 2. Added logic to build numeric range queries for file sizes using TimeRangeUtils 3. Integrated size filter with existing search queries using BooleanQuery 4. The filter works both in combination with other queries and as standalone query 5. Supports inclusive/exclusive bounds through SizeRangeFilter settings Log: Added file size range filtering in search results Influence: 1. Test search with various size filters (min, max, both) 2. Verify inclusive/exclusive bounds work correctly 3. Test in combination with other search criteria (filename, content, etc) 4. Check performance impact with large file collections 5. Verify edge cases (empty range, invalid values) 6. Test with extreme size values (0, very large files) feat: 在搜索策略中添加文件大小范围过滤 1. 为内容和OCR文本搜索策略实现了文件大小范围过滤 2. 添加了使用TimeRangeUtils构建文件大小数字范围查询的逻辑 3. 通过BooleanQuery将大小过滤器与现有搜索查询集成 4. 该过滤器既可与其他查询组合使用，也可作为独立查询 5. 通过SizeRangeFilter设置支持包含/排除边界 Log: 在搜索结果中添加文件大小范围过滤功能 Influence: 1. 测试不同大小范围过滤条件的搜索（最小值、最大值、两者组合） 2. 验证包含/排除边界工作正常 3. 测试与其他搜索条件组合使用的情况（文件名、内容等） 4. 检查大文件集合的性能影响 5. 验证边界情况（空范围、无效值） 6. 测试极端大小值（0、超大文件）

1. Changed installation paths to consistently use 'dfm-search' instead of dynamic library name 2. Removed DFM_SEARCH_LIB_NAME compile definition as it's no longer needed 3. Updated path references in comments to match new standardized naming 4. Simplified semantic rules path construction by removing library name dependency These changes standardize the naming scheme and paths across the codebase, making configuration and maintenance simpler. Previously different versions (dfm-search vs dfm6-search) had slightly different paths and configuration, which could lead to inconsistencies. Influence: 1. Verify semantic rule loading from standard /usr/share/deepin/dfm- search path 2. Check user configuration fallback in ~/.config/deepin/dfm-search 3. Test search functionality to ensure rules are properly loaded from new paths 4. Verify installation creates correct directory structure under /usr/ share/deepin/dfm-search fix: 统一 dfm-search 库名称和路径命名 1. 将安装路径统一改为使用 'dfm-search' 而不是动态库名称 2. 移除了不再需要的 DFM_SEARCH_LIB_NAME 编译定义 3. 更新注释中的路径引用以匹配新的标准化命名 4. 通过移除库名称依赖简化了语义规则路径构建这些更改在整个代码库中标准化了命名方案和路径，使配置和维护更简单。之前不同版本(dfm-search/d6m-search)有稍微不同的路径和配置，可能导致不一致。 Influence: 1. 验证从标准路径/usr/share/deepin/dfm-search加载语义规则 2. 检查用户配置回退路径~/.config/deepin/dfm-search 3. 测试搜索功能确保从新路径正确加载规则 4. 验证安装时在/usr/share/deepin/dfm-search下创建了正确的目录结构

1. Added checksum support for OCR search results via new methods checksum() and setChecksum() 2. Added file size attribute support for both text and OCR search results 3. Implemented file size retrieval from Lucene index 4. Updated output formatting to display both checksum and file size 5. Added file size processing in indexed search strategies These changes enhance search result metadata by incorporating file verification (checksums) and size information, which improves file identification and management capabilities. Log: Added file checksum and size information to search results Influence: 1. Test OCR searches verify checksum display 2. Verify file size appears correctly in both text and OCR search results 3. Test with files of various sizes 4. Verify empty/null cases when attributes are not available 5. Test search performance with new metadata attributes feat: 为搜索结果添加文件元数据属性 1. 通过新增 checksum() 和 setChecksum() 方法为 OCR 搜索结果添加校验和支持 2. 为文本和 OCR 搜索结果添加文件大小属性支持 3. 实现从 Lucene 索引中检索文件大小 4. 更新输出格式以显示校验和和文件大小信息 5. 在索引搜索策略中添加文件大小处理这些改动通过整合文件验证(校验和)和尺寸信息增强了搜索结果元数据，提高了文件识别和管理能力。 Log: 在搜索结果中添加文件校验和和大小信息 Influence: 1. 测试 OCR 搜索验证校验和显示 2. 验证文件大小在文本和 OCR 搜索结果中正确显示 3. 测试各种大小文件的处理 4. 验证当属性不可用时的空/空值情况处理 5. 测试新增元数据属性对搜索性能的影响

1. Added support for Chinese unit size queries (兆/GB/MB/KB) in SizeExtractor 2. Renamed ruleFileNames() to ruleFilePaths() and improved rule file loading 3. Removed QFileSystemWatcher for rule file changes to simplify implementation 4. Updated noise rules with more common search lead-in phrases 5. Consolidated document type synonyms in filetype_rules.json 6. Improved test coverage for Chinese unit size queries Log: Enhanced Chinese NLP search with better size unit support and simplified rule management Influence: 1. Test searching with Chinese size units (e.g. "大于100兆的文件") 2. Verify document type search after synonym updates 3. Check handling of raw byte size queries 4. Test with various noise/lead-in phrases 5. Verify rule loading from both user and system directories fix: 改进中文自然语言搜索功能 1. 在SizeExtractor中添加对中文单位查询的支持(兆/GB/MB/KB) 2. 将ruleFileNames()重命名为ruleFilePaths()并改进规则文件加载 3. 移除了规则文件变更的QFileSystemWatcher以简化实现 4. 更新了噪声规则，包含更多常见搜索引导词 5. 合并了文件类型同义词文档 6. 增加了中文单位查询的测试覆盖率 Log: 改进中文自然语言搜索，优化尺寸单位支持和简化规则管理 Influence: 1. 测试使用中文尺寸单位的搜索(如"大于100兆的文件") 2. 验证更新同义词后的文档类型搜索 3. 检查原始字节大小查询的处理 4. 使用各种噪声词/引导词测试 5. 验证从用户和系统目录加载规则

1. Added SearchType::Semantic enum value for semantic/natural-language search 2. Implemented detailed results control in SemanticSearcher with setDetailedResultsEnabled() 3. Modified search engines to always use path prefix query optimization for faster results 4. Changed semantic search to collect all results at once instead of streaming for consistency 5. Updated JSON and text output formatters to handle semantic search results with custom attributes Log: Added semantic search type and detailed results configuration Influence: 1. Test semantic search with various natural language queries 2. Verify detailed results contain all expected metadata when enabled 3. Check JSON output format for semantic search results 4. Verify path prefix optimization works for all search paths 5. Test performance impact of detailed results collection 6. Verify semantic search works with existing filters (time, size) feat: 添加语义搜索与详细结果功能 1. 新增SearchType::Semantic枚举值用于语义/自然语言搜索 2. 在SemanticSearcher中实现详细结果控制setDetailedResultsEnabled() 3. 修改搜索引擎始终使用路径前缀查询优化以获得更快的搜索结果 4. 将语义搜索改为一次性收集所有结果而非流式传输以保证一致性 5. 更新JSON和文本输出格式化器以处理带有自定义属性的语义搜索结果 Log: 新增语义搜索类型和详细结果配置功能 Influence: 1. 测试不同自然语言查询的语义搜索功能 2. 验证启用详细结果时包含所有预期元数据 3. 检查JSON输出格式是否符合语义搜索结果 4. 确认路径前缀优化对所有搜索路径有效 5. 测试详细结果收集对性能的影响 6. 验证语义搜索与现有筛选器(时间、大小)的兼容性

1. Added new search() and searchSync() overloads that accept explicit search directories parameter 2. Implemented search directory priority: explicit > NLP-parsed > home directory 3. Added intentParsed() signal to emit NLP parsing results before search starts 4. Extended JsonOutput to serialize ParsedIntent to JSON 5. Updated CLI client to support search path in semantic mode 6. Improved semantic search infrastructure with better directory handling Log: Added support for explicit search directories in semantic search Log: Added intentParsed signal showing NLP parsing results Influence: 1. Test semantic search with explicit directories vs natural language- parsed 2. Verify JSON output contains ParsedIntent details when available 3. Check search directory priority handling 4. Test intentParsed signal timing relative to searchStarted 5. Verify backward compatibility with single-parameter search() calls 6. Test JSON output format in both streaming and complete modes 7. Validate CLI behavior with and without specified search paths feat: 增强语义搜索功能，支持显式搜索目录 1. 新增支持显式搜索目录参数的 search() 和 searchSync() 方法重载 2. 实现搜索目录优先级：显式指定 > NLP解析 > 家目录 3. 添加 intentParsed() 信号，在搜索开始前发送NLP解析结果 4. 扩展 JsonOutput 以支持 ParsedIntent 的JSON序列化 5. 更新命令行客户端支持语义模式下的搜索路径 6. 改进语义搜索基础设施，提供更好的目录处理能力 Log: 新增语义搜索中显式搜索目录支持 Log: 添加显示NLP解析结果的intentParsed信号 Influence: 1. 测试显式目录与自然语言解析目录的语义搜索 2. 验证JSON输出是否包含可用的ParsedIntent详情 3. 检查搜索目录优先级处理是否正确 4. 测试intentParsed信号与searchStarted信号的时序关系 5. 验证单参数search()调用的向后兼容性 6. 测试流式输出和完整输出模式下的JSON格式 7. 验证带路径和不带路径情况下CLI的行为

1. Added max results parameter to control search result volume 2. Implemented result truncation after deduplication in semantic search 3. Added CLI option to set max results from command line 4. Results are now limited in both individual engines and final output 5. Search options are properly forwarded to all sub-engines Log: Added ability to limit maximum search results in semantic search Influence: 1. Test search with max results set to various values (0, 10, 1000) 2. Verify results are properly truncated while maintaining deduplication 3. Check command line option --max-results functionality 4. Test interaction with detailed results mode 5. Verify engine-level and final-level result limiting feat: 为语义搜索添加最大结果数限制 1. 添加最大结果数参数以控制搜索结果数量 2. 在语义搜索中去重后实现结果截断 3. 添加命令行选项设置最大结果数 4. 单个引擎和最终输出结果都受到限制 5. 搜索选项正确转发到所有子引擎 Log: 新增语义搜索结果数量限制功能 Influence: 1. 测试使用不同最大结果数(0, 10, 1000)的搜索 2. 验证结果在去重后正确截断 3. 检查命令行选项--max-results的功能 4. 测试与详细结果模式的交互 5. 验证引擎级别和最终级别的结果限制

1. Added SearchTarget enum to specify where to search (filename/content/ all) in semantic search 2. Implemented keyword extraction rule metadata to detect target from user query 3. Modified SemanticQueryBuilder to selectively enable search paths based on target 4. Added comprehensive test cases for: - Search target detection from user queries - Query builder behavior for different targets - Default fallback behavior 5. Updated Chinese keyword rules with search_target metadata Influence: 1. Test Chinese queries with filename/content keywords 2. Verify default fallback to all search paths 3. Check query builder produces correct search plans 4. Test boundary cases like empty keywords or invalid rules 5. Verify search target metadata handling in rules test: 添加搜索目标控制测试 1. 新增SearchTarget枚举类型用于指定语义搜索的范围(文件名/内容/全部) 2. 实现关键字提取规则元数据，从用户查询中检测搜索目标 3. 修改语义查询构建器，根据目标选择性启用搜索路径 4. 添加完整的测试用例包括: - 从用户查询检测搜索目标 - 查询构建器对不同目标的处理 - 默认回退行为测试 5. 更新中文关键字规则文件，添加search_target元数据 Influence: 1. 测试包含文件名/内容关键字的中文查询 2. 验证默认回退到全路径搜索的行为 3. 检查查询构建器生成的搜索计划是否正确 4. 测试边界情况如空关键字或无效规则 5. 验证规则文件中搜索目标元数据的处理

1. Added support for suffix-only size constraints in Chinese (e.g. "10M 以上", "1G以下") 2. Implemented parsing for dynamic relative time expressions (e.g. "最近 3天", "近2小时") 3. Added unit tests for various combinations: size + filetype, time + filetype 4. Supported Chinese numerals in time and size expressions (e.g. "一百兆", "近一周") 5. Added locale-aware number conversion for Chinese numerals (e.g. "一" to 1, "二" to 2) Log: Added Chinese NLP support for relative time and size constraint parsing Influence: 1. Test various size constraint patterns with Chinese characters 2. Verify combinations of size/date constraints with different file types 3. Check edge cases like maximum allowed values 4. Test Chinese numeral conversions in all contexts 5. Verify time calculations for relative time expressions feat: 添加中文自然语言处理对相对时间和大小约束的解析支持 1. 添加对中文后缀形式大小约束的支持（例如"10M以上", "1G以下"） 2. 实现对动态相对时间表达式的解析（例如"最近3天", "近2小时"） 3. 添加多种组合情况的单元测试：大小+文件类型，时间+文件类型 4. 支持时间和大小的中文数字表达式（例如"一百兆", "近一周"） 5. 添加针对中文数字的本地化数字转换（例如"一"转1，"二"转2） Log: 新增中文自然语言解析对相对时间和大小约束的支持 Influence: 1. 测试包含中文字符的各种大小约束模式 2. 验证大小/日期约束与不同文件类型的组合情况 3. 检查边界情况如最大允许值 4. 测试所有上下文中的中文数字转换 5. 验证相对时间表达式的时间计算

1. Implemented NGramAnalyzer class for Lucene++ that generates overlapping word n-grams 2. Implemented NGramTokenizer class that performs the actual n-gram generation 3. Added support for configurable min/max n-gram sizes (default 2-4) 4. Includes buffer management for efficient text processing 5. Implements position handling and attribute management for search integration Log: Added NGram analyzer and tokenizer for enhanced text searching Influence: 1. Test search functionality with different min/max gram sizes 2. Verify correct token generation for various input lengths 3. Check buffer handling with large input texts 4. Test position increment behavior in search results 5. Verify case insensitivity in token generation 6. Test edge cases with very short/long input strings feat: 为Lucene++添加NGram分析器和分词器 1. 实现了针对Lucene++的NGramAnalyzer类，用于生成重叠的词n-gram 2. 实现了NGramTokenizer类，执行实际的n-gram生成 3. 添加了可配置的最小/最大n-gram大小支持（默认为2-4） 4. 包含用于高效文本处理的缓冲区管理 5. 实现了搜索集成的位置处理和属性管理 Log: 新增NGram分析器和分词器以增强文本搜索功能 Influence: 1. 使用不同最小/最大gram大小测试搜索功能 2. 验证各种输入长度下正确的token生成 3. 检查大数据量文本下的缓冲区处理 4. 测试搜索结果中的位置增量行为 5. 验证token生成中的大小写不敏感处理 6. 测试极短/极长输入字符串的边缘情况

1. Changed keyword length validation from UTF-8 byte count to character count in ContentSearchEngine 2. Replaced ChineseAnalyzer with NGramAnalyzer(2,2) in IndexedStrategy for better search performance 3. Removed unused highlight function and related ChineseAnalyzer dependency 4. Cleaned up unnecessary headers and code The changes improve search accuracy by validating keyword length based on characters rather than bytes, and enhance search performance by using NGramAnalyzer instead of ChineseAnalyzer. The removed highlight functionality was unused and potentially problematic. Influence: 1. Test search with short keywords to verify proper validation 2. Verify search accuracy with different keyword lengths 3. Test performance with various search queries 4. Ensure search results are still properly highlighted when applicable fix: 改进内容搜索引擎验证和分析器 1. 在 ContentSearchEngine 中优化关键词长度验证方式，从 UTF-8 字节数改为字符数 2. 将 IndexedStrategy 中的 ChineseAnalyzer 替换为 NGramAnalyzer(2,2)，提升搜索性能 3. 移除未使用的 highlight 功能及相关 ChineseAnalyzer 依赖 4. 清理不必要的头文件和代码这些改进通过基于字符而非字节的关键词长度验证提高了搜索准确性，并且通过使用 NGramAnalyzer 替代 ChineseAnalyzer 提升了搜索性能。移除的 highlight 功能未被使用且可能存在隐患。 Influence: 1. 测试短关键词搜索验证正确性 2. 验证不同长度关键词的搜索准确度 3. 测试各种搜索查询的性能表现 4. 确保在适用情况下搜索结果仍能正确高亮显示

1. Move path filtering, exclusion and hidden file checks to Lucene query layer for better performance 2. Remove obsolete SearchUtility ancestor paths support checks since all indexes now support this feature 3. Pre-allocate result vectors to avoid reallocations during append 4. Use move semantics for SearchResult objects where applicable 5. Remove unused searchPath parameter from query building methods 6. Consolidate path and permission checks into single filtering step 7. Remove unused SearchUtility headers and functionality Log: Optimized search performance by moving filtering to query layer Influence: 1. Test content search with various path filters and exclusions 2. Verify hidden file filtering works correctly 3. Test OCR text search performance with large results 4. Verify filename search maintains all previous functionality 5. Check all types of searches with verbose mode enabled 6. Test with multiple search paths and complex exclusion paths refactor: 优化搜索过滤和查询构建逻辑 1. 将路径过滤、排除和隐藏文件检查移至 Lucene 查询层以提高性能 2. 移除过时的 SearchUtility 祖先路径支持检查，所有索引现在均支持此功能 3. 预分配结果向量以避免追加时的重新分配 4. 适用处使用移动语义处理 SearchResult 对象 5. 从查询构建方法中移除未使用的 searchPath 参数 6. 将路径和权限检查整合为单个过滤步骤 7. 移除未使用的 SearchUtility 头文件和相关功能 Log: 通过将过滤移至查询层优化了搜索性能 Influence: 1. 测试带有各种路径过滤和排除的内容搜索 2. 验证隐藏文件过滤功能正常工作 3. 测试包含大量结果的OCR文本搜索性能 4. 验证文件名搜索保持所有原有功能 5. 测试启用详细模式的所有类型搜索 6. 测试包含多个搜索路径和复杂排除路径的情况

1. Implement ContentRetriever class for fetching highlighted content from Lucene index 2. Add subcommand support in CLI with "highlight" mode 3. Support both single file and batch highlight fetching 4. Implement text and JSON output formats 5. Add configuration options for snippet length and HTML wrapping 6. Handle OCR content and regular content search separately 7. Include error handling and graceful fallbacks Log: 1. Added standalone highlight extraction feature via new ContentRetriever class 2. Added CLI subcommand: "highlight" mode supports fetching snippets without full search 3. Supports both text and machine-readable JSON output formats Influence: 1. Test highlight retrieval with various file types (txt, pdf, images with OCR) 2. Verify CLI highlight subcommand with different combinations of parameters 3. Test boundary cases - empty input, non-existent files, invalid paths 4. Verify JSON output format is valid and complete 5. Test error handling when index is corrupted or unavailable 6. Validate performance with large batches of paths feat: 添加按需内容高亮检索功能 1. 实现 ContentRetriever 类用于从 Lucene 索引获取高亮内容 2. 在 CLI 中添加子命令支持，实现"highlight"模式 3. 支持单文件和批量高亮内容获取 4. 实现文本和 JSON 两种输出格式 5. 添加配置选项用于控制片段长度和 HTML 包裹 6. 区分处理 OCR 内容和常规内容搜索 7. 包含错误处理和优雅降级机制 Log： 1. 新增独立的高亮内容提取功能，通过新增的 ContentRetriever 类实现 2. 新增 CLI 子命令：支持在不执行完整搜索的情况下通过"highlight"模式获取内容片段 3. 支持文本和机器可读的 JSON 两种输出格式 Influence： 1. 测试不同类型文件的高亮检索功能（txt、pdf、带OCR的图片等） 2. 使用不同参数组合验证 CLI highlight 子命令 3. 测试边界情况-空输入、不存在的文件、无效路径 4. 验证 JSON 输出格式的有效性和完整性 5. 测试索引损坏或不可用时的错误处理 6. 验证大批量路径请求时的性能表现

1. Added normalizeGramSize method to enforce valid n-gram sizes between 1 and kIoBufferSize 2. Added resetState method to consolidate common reset logic 3. Added new reset method with ReaderPtr parameter for better resource management 4. Improved buffer initialization using std::fill_n instead of memset for consistency 5. Added Semantic search type support in SearchFactory 6. Fixed code formatting and alignment in header file 7. Enhanced NGramTokenizer robustness by validating gram sizes upon construction Log: Added Semantic search type support in search factory Influence: 1. Test NGramTokenizer with various min/max gram inputs including edge cases 2. Verify search factory properly handles Semantic search type 3. Ensure tokenizer correctly processes input after reset operations 4. Validate normalization of extreme gram size values 5. Test buffer handling with different input sizes refactor: 改进 NGramTokenizer 和搜索工厂 1. 新增 normalizeGramSize 方法确保 n-gram 大小在 1 至 kIoBufferSize 之间 2. 添加 resetState 方法统一重置逻辑 3. 新增带 ReaderPtr 参数的 reset 方法改进资源管理 4. 使用 std::fill_n 代替 memset 以提高缓冲区初始化一致性 5. 在搜索工厂中添加对 Semantic 搜索类型的支持 6. 修复头文件中的格式和对齐问题 7. 通过在构造函数中验证 gram 大小增强 NGramTokenizer 鲁棒性 Log: 搜索工厂新增支持 Semantic 搜索类型 Influence: 1. 测试 NGramTokenizer 处理各种 min/max gram 输入，包括边界情况 2. 验证搜索工厂正确处理 Semantic 搜索类型 3. 确保分词器在重置操作后正确处理输入 4. 测试极端 gram 大小值的规范化 5. 测试不同输入大小的缓冲区处理

1. Changed keyword length validation to use unicode character count instead of UTF-8 byte size for more accurate measurement 2. Replaced ChineseAnalyzer with NGramAnalyzer(2,2) for better OCR text fuzzy matching 3. Removed dependency on 3rdparty fulltext/chineseanalyzer.h 4. Added dfm-search/lucene++/ngramanalyzer.h include instead These changes were made because: 1. Measuring text length in characters is more appropriate than bytes for validation 2. NGram analyzer provides better results for OCR text which often contains recognition errors 3. Using built-in analyzer removes external dependency and improves maintainability Log: Improved accuracy of OCR text search results Influence: 1. Test search with various Unicode characters and check validation behavior 2. Verify search results quality with partially recognized OCR text 3. Test with both short and long search queries 4. Check performance impact with new analyzer refactor: 优化OCR文本搜索验证和分析器选择 1. 将关键词长度验证从UTF-8字节数改为Unicode字符数，以获得更准确的测量 2. 用NGramAnalyzer(2,2)替换ChineseAnalyzer实现更好的OCR文本模糊匹配 3. 移除了对3rdparty fulltext/chineseanalyzer.h的依赖 4. 添加了dfm-search/lucene++/ngramanalyzer.h包含这些修改的原因： 1. 使用字符数而非字节数进行文本长度验证更为恰当 2. NGram分析器能为常含识别错误的OCR文本提供更好的搜索结果 3. 使用内置分析器移除了外部依赖，提升了可维护性 Log: 提升了OCR文本搜索结果的准确性 Influence: 1. 使用各种Unicode字符测试搜索并检查验证行为 2. 验证部分识别OCR文本的搜索结果质量 3. 测试短查询和长查询的结果 4. 检查新分析器对性能的影响

Added field selector to only load necessary fields during search operations to reduce disk I/O. 1. Implemented MapFieldSelector to selectively load document fields 2. Contents field is now only loaded when full text retrieval is enabled 3. Detailed metadata fields are conditionally loaded based on detailedResults option 4. Optimized path field loading by making it always included Significantly reduces memory usage and disk access when: 1. Only searching paths without content previews 2. Showing basic results without detailed metadata 3. Operating on large indexes with many documents Influence: 1. Test search performance with detailed results enabled/disabled 2. Verify content preview still works when enabled 3. Check basic path-only searches are faster 4. Verify all detailed metadata fields appear when requested 5. Test with large document collections to confirm reduced memory usage perf: 使用字段选择器优化搜索性能通过字段选择器只加载搜索所需的字段，减少磁盘I/O操作 1. 实现MapFieldSelector来选择性加载文档字段 2. 内容字段现在只在需要全文检索时加载 3. 详细元数据字段根据detailedResults选项条件性加载 4. 通过始终加载路径字段进行优化在以下场景显著减少内存使用和磁盘访问: 1. 仅搜索路径不预览内容时 2. 显示基础结果不需要详细元数据时 3. 处理包含大量文档的索引时 Influence: 1. 测试启用/禁用详细结果时的搜索性能 2. 验证内容预览在启用时仍能正常工作 3. 检查仅搜索路径的基本查询是否更快 4. 确保请求时所有详细元数据字段都能显示 5. 使用大型文档集合测试确认内存使用减少

1. Modified CMake configuration to automatically disable unit tests in Release and MinSizeRel builds 2. Moved BUILD_UNIT_TESTS option definition to main CMakeLists.txt for better visibility 3. Added safety guard in autotests/CMakeLists.txt to prevent accidental inclusion when tests are disabled 4. Improved build system organization by centralizing test configuration logic Influence: 1. Verify unit tests are excluded from Release/MinSizeRel builds 2. Check Debug/RelWithDebInfo builds still include tests 3. Confirm builds complete successfully in all configurations 4. Test that build system behaves correctly when autotests directory is accessed directly refactor: 在发布版本中禁用单元测试 1. 修改 CMake 配置，在 Release 和 MinSizeRel 构建中自动禁用单元测试 2. 将 BUILD_UNIT_TESTS 选项定义移至主 CMakeLists.txt 以提高可见性 3. 在 autotests/CMakeLists.txt 中添加安全防护，防止测试被禁用时意外包含 4. 通过集中测试配置逻辑改进了构建系统组织 Influence: 1. 验证单元测试是否在 Release/MinSizeRel 构建中被排除 2. 检查 Debug/RelWithDebInfo 构建是否仍包含测试 3. 确认所有配置下的构建都能顺利完成 4. 测试当直接访问 autotests 目录时构建系统的行为是否正确

The changes refactor how ngram search queries are built to improve performance and simplify the query building process: 1. Added new buildNGramSearchQuery utility function that directly constructs TermQuery and PhraseQuery for ngram searching 2. Removed QueryParser and NGramAnalyzer dependencies from content and OCR search strategies 3. Unified query building logic between content and OCR search 4. Modified keyword length validation to use UTF-8 byte count instead of character count These changes eliminate the need for real-time analysis during searches and provide more precise control over query generation. The implementation specifically handles: - Single and two-character terms as TermQuery - Longer terms as PhraseQuery with proper positions - Case sensitivity handling Log: Optimized ngram search query building for better performance Influence: 1. Test content search with various keyword lengths 2. Verify special character handling in search terms 3. Check case sensitivity behavior 4. Verify mixed filename and content search results 5. Test OCR text search functionality feat: 优化 ngram 搜索查询构建这些变更重构了 ngram 搜索查询的构建方式以提高性能并简化流程： 1. 新增 buildNGramSearchQuery 工具函数直接构建TermQuery和PhraseQuery 2. 从内容和OCR搜索策略中移除QueryParser和NGramAnalyzer依赖 3. 统一了内容和OCR搜索的查询构建逻辑 4. 修改关键词长度验证改用UTF-8字节计数而非字符计数这些变更消除了实时分析的需求并提供更精确的查询控制。具体实现了： - 单字符和两字符关键词使用TermQuery - 更长关键词使用带位置的PhraseQuery - 大小写敏感处理 Log: 优化ngram搜索查询构建提升性能 Influence: 1. 测试不同长度关键词的内容搜索 2. 验证搜索词中的特殊字符处理 3. 检查大小写敏感行为 4. 验证混合文件名和内容搜索结果 5. 测试OCR文本搜索功能

Removed the NGram analyzer and tokenizer implementation from the Lucene+ + integration. These components were responsible for generating n- gram tokens from input text but are no longer needed in the search functionality. The removal includes: 1. NGramAnalyzer header and implementation 2. NGramTokenizer header and implementation 3. All associated utility functions and constants This cleanup is part of ongoing efforts to simplify the Lucene++ integration and remove unused components. The n-gram token generation functionality wasn't being utilized in the current search implementation and was complicating the codebase. Influence: Testing should verify that basic search functionality still works as expected, particularly: 1. File content searching 2. File name searching 3. Special character handling in searches 4. Various query types (exact match, wildcard, etc.) refactor: 移除NGram分析器和分词器组件从Lucene++集成中移除了NGram分析器和分词器实现。这些组件原本负责从输入文本生成n-gram令牌，但在当前搜索功能中已不再需要。此次移除包括： 1. NGramAnalyzer头文件和实现文件 2. NGramTokenizer头文件和实现文件 3. 所有相关的工具函数和常量此次清理是简化Lucene++集成和移除未使用组件工作的一部分。n-gram令牌生成功能在当前搜索实现中并未使用，并且使代码库变得复杂。 Influence: 测试应验证基本搜索功能仍按预期工作，特别是： 1. 文件内容搜索 2. 文件名搜索 3. 搜索中的特殊字符处理 4. 各种查询类型（精确匹配、通配符等）

The change fixes the position calculation in N-gram search queries to properly align with Lucene++'s NGramTokenizer behavior. The tokenizer emits 1-gram and 2-gram tokens at each character offset, advancing position by 1 for each token emitted. The new position calculation uses the formula 2*i+1 where i is the start offset to accurately reflect token positions. 1. Added helper function phrasePositionForStandardNGram2 to calculate proper positions 2. Updated test cases to expect new position values (1,5,9 for even length, 1,5,7 for odd) 3. Modified query building logic to use new position calculation Influence: 1. Test N-gram search functionality with various input lengths 2. Verify search accuracy with different character combinations 3. Check position-dependent search operations 4. Validate edge cases with very short/long search terms fix: 调整N元语法标记位置计算本次修改修正了N元语法搜索查询中的位置计算，使之与Lucene++的 NGramTokenizer行为正确对齐。该分词器在每个字符偏移量处发射1元和2元标记，并为每个发射的标记将位置前进1。新的位置计算使用公式2*i+1（i为起始偏移量）来准确反映标记位置。 1. 添加辅助函数phrasePositionForStandardNGram2来计算正确位置 2. 更新测试用例以期望新的位置值（偶数长度为1,5,9，奇数长度为1,5,7） 3. 修改查询构建逻辑以使用新的位置计算 Influence: 1. 测试不同输入长度的N元语法搜索功能 2. 验证不同字符组合的搜索准确性 3. 检查依赖位置的搜索操作 4. 验证非常短/长的搜索词边缘情况

1. Added fetchContent() and fetchContents() methods to retrieve full stored content from Lucene index 2. Implemented index directory override and caching mechanism 3. Improved thread safety with mutex protection for index reader operations 4. Added unit tests for new functionality 5. Refactored existing highlight methods to use common internal APIs 6. Added field_names.h kCheckSum constant for future use The changes enable: - Retrieving full document content for both text and OCR search types - Flexible index directory configuration for testing scenarios - Better performance through index reader caching - Thread-safe operations in concurrent environments - Maintainable code structure with shared internal APIs Log: Added content retrieval to text search capabilities Influence: 1. Test fetchContent() with valid and invalid file paths 2. Verify batch content retrieval with fetchContents() 3. Test index directory override functionality 4. Verify thread safety in concurrent access scenarios 5. Check performance with large content sets 6. Verify backward compatibility with existing highlight methods feat: 增强ContentRetriever的内容获取能力 1. 新增fetchContent()和fetchContents()方法用于从Lucene索引获取完整存储内容 2. 实现索引目录覆盖和缓存机制 3. 通过互斥锁保护提升线程安全性 4. 添加新功能的单元测试 5. 重构现有高亮方法使用通用内部API 6. 在field_names.h中添加kCheckSum常量供未来使用变更内容包括: - 支持获取文本和OCR搜索类型的完整文档内容 - 灵活的索引目录配置支持测试场景 - 通过索引读取器缓存提升性能 - 并发环境中的线程安全操作 - 使用共享内部API的更可维护代码结构 Log: 新增文本搜索内容获取功能 Influence: 1. 测试fetchContent()方法的有效和无效文件路径 2. 验证fetchContents()批量内容获取功能 3. 测试索引目录覆盖功能 4. 验证并发访问场景下的线程安全性 5. 检查大数据集下的性能表现 6. 验证与现有高亮方法的向后兼容性

1. Added elfio library for ELF file parsing and manipulation 2. Implemented addr_any.h for address resolution across executable sections 3. Added addr_pri.h for accessing private class members 4. Added stub.h for function hooking/replacement 5. Added stub-ext extensions with stub-shadow support 6. Integrated utilities into test framework Log: Implemented test utilities including elf parsing, function hooking and private member access for content search testing Influence: 1. Test content search index building and querying 2. Verify search result accuracy with controlled inputs 3. Test private API behaviors through reflection 4. Validate text and OCR search functionality test: 添加内容搜索测试工具库 1. 新增elfio库用于ELF文件解析和操作 2. 实现addr_any.h用于跨可执行段地址解析 3. 添加addr_pri.h用于访问私有类成员 4. 添加stub.h用于函数钩子/替换 5. 添加带stub-shadow支持的stub-ext扩展 6. 将工具集成到测试框架中 Log: 实现了内容搜索测试工具，包括ELF解析、函数钩子和私有成员访问 Influence: 1. 测试内容搜索索引构建和查询 2. 使用受控输入验证搜索结果准确性 3. 通过反射测试私有API行为 4. 验证文本和OCR搜索功能

Added selective field loading for OCR text search results to significantly reduce disk I/O when detailed results are not needed. The change introduces a MapFieldSelector to only load necessary fields (path is always loaded, ocr_contents only when full text retrieval is enabled, and additional metadata fields only when detailed results are requested). Technical details: 1. Implemented field selector to skip loading large OCR text content (ocr_contents) unless full text retrieval is enabled 2. Only loads additional metadata fields (filename, timestamps, etc.) when detailed results are requested 3. Maintains all existing functionality while reducing memory usage and disk I/O 4. Preserves backward compatibility with existing search options Influence: 1. Test search functionality with both simple and detailed results requests 2. Verify performance improvement when detailed results are disabled 3. Check memory usage reduction during large result set searches 4. Validate that all required fields are correctly loaded when detailed results are enabled 5. Test edge cases with empty fields or missing document attributes perf: 优化OCR文本搜索文档加载性能为OCR文本搜索结果添加了选择性字段加载功能，当不需要详细结果时显著减少磁盘I/O。该变更引入了MapFieldSelector，仅加载必要的字段（路径总是加载，只有在启用全文检索时才加载ocr_contents，仅在请求详细结果时才加载额外的元数据字段）。技术细节: 1. 实现字段选择器以跳过加载大型OCR文本内容(ocr_contents)，除非启用全文检索 2. 只有请求详细结果时才加载额外元数据字段(文件名、时间戳等) 3. 在减少内存使用和磁盘I/O的同时保留所有现有功能 4. 保持与现有搜索选项的向后兼容性 Influence: 1. 使用简单和详细结果请求测试搜索功能 2. 验证关闭详细结果时的性能提升 3. 检查大型结果集搜索时的内存使用减少情况 4. 验证启用详细结果时是否正确加载所有所需字段 5. 测试空字段或缺失文档属性的边界情况

1. Removed chinese analyzer and tokenizer files from fulltext 3rdparty 2. Updated CMakeLists.txt to exclude fulltext 3rdparty files 3. Modified QueryBuilder to use ngram search instead of chinese analyzer 4. Simplified query building methods and removed analyzer dependencies 5. Improved pinyin and acronym search using ngram queries This change replaces the complex chinese analyzer implementation with a simpler ngram-based search approach which should provide more consistent results while being easier to maintain. The ngram search is now used for all text matching including pinyin and acronym searches. Log: Improved search functionality using ngram matching instead of chinese analyzer Influence: 1. Test basic search functionality with chinese characters 2. Verify pinyin and acronym search results 3. Test combined search queries with multiple terms 4. Verify wildcard searches work correctly 5. Test case sensitive/insensitive searches 6. Check performance impact from analyzer removal perf: 使用ngram搜索替换中文分词器 1. 删除fulltext第三方库中的中文分词器和分析器文件 2. 更新CMakeLists.txt以排除fulltext第三方文件 3. 修改QueryBuilder使用ngram搜索代替中文分词器 4. 简化查询构建方法并移除分析器依赖 5. 使用ngram查询改进拼音和拼音首字母搜索此次变更用更简单的基于ngram的搜索方法替代了复杂的中文分词器实现，应能提供更一致的结果同时更易于维护。ngram搜索现在用于包括拼音和拼音首字母在内的所有文本匹配。 Log: 使用ngram匹配替代中文分词器改进搜索功能 Influence: 1. 测试中文汉字的基础搜索功能 2. 验证拼音和拼音首字母搜索结果 3. 测试包含多条件的组合搜索查询 4. 验证通配符搜索是否正常工作 5. 测试区分大小写/不区分大小写的搜索 6. 检查删除分词器对性能的影响

Added comprehensive test cases for FileNameSearchEngine functionality including: 1. Basic keyword search with indexed and realtime modes 2. Boolean AND/OR queries and wildcard pattern matching 3. File type and extension filters 4. Hidden files and excluded paths handling 5. Size and time range filters 6. Pinyin and acronym search support 7. Detailed result attributes verification 8. Error handling for invalid inputs The tests validate both indexed (Lucene-based) and realtime (filesystem scan) search modes to ensure consistent behavior across different search methods. Influence: 1. Verify all test cases pass with different combinations of search parameters 2. Test with various file types and naming patterns 3. Check behavior with hidden files and excluded directories 4. Validate time and size filter boundaries 5. Confirm detailed result attributes are accurate 6. Test error conditions like empty queries and invalid file types test: 新增文件名搜索引擎测试用例添加了全面的文件名搜索引擎功能测试，包括： 1. 基本关键字搜索（索引模式与实时模式） 2. 布尔AND/OR查询和通配符匹配 3. 文件类型和后缀过滤 4. 隐藏文件和排除路径处理 5. 大小和时间范围过滤 6. 拼音和拼音首字母搜索支持 7. 详细结果属性验证 8. 无效输入的错误处理这些测试覆盖了索引（基于Lucene）和实时（文件系统扫描）两种搜索模式，确保不同搜索方法的行为一致性。 Influence: 1. 验证所有测试用例在不同搜索参数组合下都能通过 2. 使用各种文件类型和命名模式进行测试 3. 检查隐藏文件和排除目录的处理行为 4. 验证时间和大小过滤的边界条件 5. 确认详细结果属性的准确性 6. 测试空查询和无效文件类型等错误条件

deepin-ci-robot · 2026-05-28T05:37:50Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Johnson-zs

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

debian/deepin/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-actions · 2026-05-28T05:37:53Z

Warning

[Debian检查]

检测到debian目录文件有变更: debian/libdfm-search.install,debian/libdfm6-search.install

1. Updated .reuse/dep5 with new copyright entries for: - cpp-stub (MIT) - ELFIO (MIT) - semantic rules (GPL-3.0-or-later) 2. Removed LGPL-3.0-or-later.txt which is no longer needed 3. Added MIT.txt license file for new MIT-licensed components 4. Removed unused .gitkeep from tools directory These changes reflect updated copyright and license information for third-party components used in the project, and cleanup of unused files. Influence: 1. Verify build system continues to work with new licensing information 2. Check that all new license files are properly referenced 3. Confirm removed files were actually obsolete 4. Verify project documentation references correct licenses docs: 更新许可证文件并清理 1. 更新.reuse/dep5，新增以下版权条目： - cpp-stub (MIT) - ELFIO (MIT) - 语义规则 (GPL-3.0-or-later) 2. 删除不再需要的LGPL-3.0-or-later.txt 3. 新增MIT.txt许可证文件用于新增的MIT授权组件 4. 删除tools目录中未使用的.gitkeep文件这些变更反映了项目中使用的第三方组件的版权和许可证信息更新，以及对未使用文件的清理。 Influence: 1. 验证构建系统能继续工作包括新的授权信息 2. 检查所有新许可证文件是否正确引用 3. 确认被删除的文件确实不再需要 4. 验证项目文档引用了正确的许可证

github-actions · 2026-05-28T05:41:53Z

Warning

[Debian检查]

检测到debian目录文件有变更: debian/libdfm-search.install,debian/libdfm6-search.install

1.3.57 Log:

github-actions · 2026-05-28T06:04:20Z

TAG Bot

TAG: 1.3.57
EXISTED: no
DISTRIBUTION: unstable

github-actions · 2026-05-28T06:04:27Z

Warning

[Debian检查]

检测到debian目录文件有变更: debian/libdfm-search.install,debian/libdfm6-search.install

deepin-ci-robot · 2026-05-28T06:05:36Z

deepin pr auto review

这份代码变更非常庞大，涵盖了从底层搜索策略重构、新增语义搜索功能、测试工具引入到构建系统调整等多个方面。我将从语法逻辑、代码质量、代码性能和代码安全四个维度对核心代码进行详细审查，并提供改进意见。

一、语法与逻辑问题

1. `addr_any.h` - 潜在的内存越界与资源泄漏

问题：get_lib_pathname_and_baseaddr 函数中，如果 regcomp 成功但在函数返回前发生异常或提前返回，pathname_regex 将不会被释放，导致资源泄漏。
问题：get_func_addr 中，regcomp(&pathname_regex, func_name_regex_str.c_str(), 0); 如果失败（返回非0），后续的 regexec 会导致未定义行为。
建议：检查 regcomp 的返回值；使用 RAII 包装器管理 regex_t 资源，确保异常安全。

// addr_any.h - get_func_addr 修复示例
if (0 != regcomp(&pathname_regex, func_name_regex_str.c_str(), 0)) {
    return -1; // 必须检查返回值
}
// ... 使用后确保释放
regfree(&pathname_regex);

2. `stub.h` - 严重的安全与逻辑漏洞

问题：Stub::set 方法中，如果 mprotect 失败，代码直接 throw("stub set memory protect to w+r+x faild");。抛出 const char* 异常是危险的，且在修改内存权限失败后继续执行会导致程序崩溃。
问题：Stub::reset 和 Stub::clear 中同样存在 mprotect 失败直接 throw 的问题。
建议：将 throw("...") 改为抛出标准异常（如 std::runtime_error），或在权限修改失败时记录错误并安全返回，而不是抛出异常导致栈展开时可能访问未授权内存。

3. `stub-shadow.h` - 类型转换越界风险

问题：FuncShadow::call 中 long id = (long)shadow;。在 64 位系统下，将指针转换为 long 再作为键查找，虽然通常可行，但标准不保证 long 足以容纳指针（Windows LLP64 模型下 long 为 32 位）。
建议：使用 intptr_t 或 uintptr_t 代替 long 进行指针与整数的转换。

4. `tst_chinese_nlp.cpp` - 测试用例中的硬编码时间

问题：大量测试用例依赖当前时间（如 QDateTime::currentDateTime()），且允许 2 秒误差（endDelta < 2）。这在高负载 CI 环境中极易产生随机失败。
建议：对于相对时间测试，考虑注入固定的时间源，或放宽误差容忍度到 5-10 秒。

二、代码质量与可维护性

1. `elfio.hpp` - 巨大的内联文件

问题：将近 5000 行的 elfio.hpp 被直接内联到代码库中。这不仅使 diff 极度难以阅读，也增加了编译时间和二进制体积。
建议：作为第三方库，应通过 CMake 的 FetchContent 或 add_subdirectory 引入，而不是直接复制源码。如果必须复制，至少应保持文件独立，不要内联到单个头文件中。

2. `addr_pri.h` - 违反 C++ 标准的 Hack

问题：此文件使用模板特化 Hack 来访问私有成员，这依赖于编译器的特定实现细节，违反了 C++ 封装原则，且在不同编译器或版本间可能失效。
建议：如果是为了测试私有成员，考虑使用友元类或测试接口。如果必须使用，需添加明确的警告注释，说明其不可移植性。

3. 代码风格不一致

问题：代码中混用了 NULL 和 nullptr，typedef 和 using。
建议：统一使用 C++11/14 特性：nullptr 替代 NULL，using 替代 typedef。

4. 错误处理策略不统一

问题：部分代码使用异常（stub.h），部分使用返回值（addr_any.h），部分使用 qWarning。
建议：在库代码中统一错误处理策略。对于不可恢复的错误使用异常，对于可预期的错误使用返回值或 std::expected/Expected 模式。

三、代码性能

1. `addr_any.h` - 不必要的字符串拷贝与低效查找

问题：get_func_addr 中对每个符号都调用 demangle，然后进行正则匹配。demangle 可能是昂贵的操作。
建议：先进行正则匹配原始符号名，匹配成功后再 demangle，或者缓存 demangle 结果。

2. `indexedstrategy.cpp` - Lucene 查询构建中的冗余对象创建

问题：在 buildLuceneQuery 中，每次调用都创建新的 BooleanQuery 对象，即使只是添加一个子句。
建议：优化查询构建逻辑，减少不必要的对象创建。考虑使用查询构建器模式。

3. `semanticruleengine.cpp` - 规则匹配时的排序

问题：match 和 matchAll 每次调用都对规则列表进行 std::stable_sort。
建议：在加载规则时排序一次，匹配时直接使用已排序的列表。

四、代码安全

1. `addr_any.h` - 符号地址暴露

问题：get_func_addr 允许通过正则表达式获取任意函数的地址，这可能被恶意利用进行函数 Hook 或 ROP 攻击。
建议：限制可查询的符号范围，或添加安全检查机制。

2. `stub.h` - 内存权限修改

问题：Stub 类通过 mprotect 修改代码段内存权限为可写可执行，这是极其危险的操作，可能被利用注入恶意代码。
建议：仅在测试构建中启用此功能，生产构建中应完全移除。添加编译期守卫。

3. `contentretriever.cpp` - 缓存失效策略

问题：ensureIndexContext 中使用 reader->isCurrent() 检查索引是否更新，如果更新则重新打开。但在高并发场景下，可能导致频繁的索引重开，影响性能。
建议：添加指数退避策略或基于时间的缓存失效机制，避免频繁重开索引。

4. `semantic_types.h` - 数据结构安全

问题：ParsedIntent 包含 searchDirectories 和 keywords 等字段，如果这些数据来自不可信输入（如网络请求），可能导致路径遍历或注入攻击。
建议：在构建搜索查询前，对 searchDirectories 进行规范化（QDir::cleanPath）和权限检查，对 keywords 进行转义。

总结与关键改进建议

移除危险代码：在生产代码中移除 stub.h 和 addr_pri.h，或添加严格的编译期守卫。
修复资源泄漏：确保 regex_t 等 C 资源在所有路径下都被正确释放。
优化性能热点：避免在热路径（如规则匹配、查询构建）中进行重复排序或昂贵操作。
统一错误处理：采用一致的错误处理策略，避免混用异常和返回值。
改进测试稳定性：减少对当前时间的依赖，使用固定时间源或增大误差容忍度。
重构第三方库引入方式：避免内联大型第三方库源码，改用包管理器或子模块方式。

这些改进将显著提升代码的健壮性、安全性和可维护性。

Johnson-zs · 2026-05-28T06:14:55Z

/forcemerge

deepin-bot · 2026-05-28T06:15:07Z

This pr force merged! (status: blocked)

deepin-bot · 2026-05-28T06:15:29Z

TAG Bot

✅ Tag created successfully

📋 Tag Details

Tag Name: 1.3.57
Tag SHA: 52232b037f378501cb5090725c3ea047cd3e59f7
Commit SHA: 81872385f13d41ff1e3fef11f44f29e0f8d77d6f
Tag Message:
```
Release util-dfm 1.3.57
```
Tagger:
- Name: Johnson-zs
Distribution: unstable

Johnson-zs added 30 commits May 26, 2026 11:07

Johnson-zs added 4 commits May 26, 2026 11:07

github-actions Bot requested a review from liujianqiang-niu May 28, 2026 05:37

Johnson-zs force-pushed the feat_semantic branch from e2052b7 to 5b9b379 Compare May 28, 2026 05:41

chore: bump version to 1.3.57

fad1858

1.3.57 Log:

deepin-bot Bot merged commit 8187238 into linuxdeepin:master May 28, 2026
19 of 22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dfm-search performance optimization#299

dfm-search performance optimization#299
deepin-bot[bot] merged 36 commits into
linuxdeepin:masterfrom
Johnson-zs:feat_semantic

Johnson-zs commented May 28, 2026

Uh oh!

deepin-ci-robot commented May 28, 2026

Uh oh!

github-actions Bot commented May 28, 2026

Uh oh!

github-actions Bot commented May 28, 2026

Uh oh!

github-actions Bot commented May 28, 2026

Uh oh!

github-actions Bot commented May 28, 2026

Uh oh!

deepin-ci-robot commented May 28, 2026

Uh oh!

Johnson-zs commented May 28, 2026

Uh oh!

deepin-bot Bot commented May 28, 2026

Uh oh!

Uh oh!

deepin-bot Bot commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Johnson-zs commented May 28, 2026

Uh oh!

deepin-ci-robot commented May 28, 2026

Uh oh!

github-actions Bot commented May 28, 2026

Uh oh!

github-actions Bot commented May 28, 2026

Uh oh!

github-actions Bot commented May 28, 2026

Uh oh!

github-actions Bot commented May 28, 2026

Uh oh!

deepin-ci-robot commented May 28, 2026

deepin pr auto review

一、 语法与逻辑问题

1. addr_any.h - 潜在的内存越界与资源泄漏

2. stub.h - 严重的安全与逻辑漏洞

3. stub-shadow.h - 类型转换越界风险

4. tst_chinese_nlp.cpp - 测试用例中的硬编码时间

二、 代码质量与可维护性

1. elfio.hpp - 巨大的内联文件

2. addr_pri.h - 违反 C++ 标准的 Hack

3. 代码风格不一致

4. 错误处理策略不统一

三、 代码性能

1. addr_any.h - 不必要的字符串拷贝与低效查找

2. indexedstrategy.cpp - Lucene 查询构建中的冗余对象创建

3. semanticruleengine.cpp - 规则匹配时的排序

四、 代码安全

1. addr_any.h - 符号地址暴露

2. stub.h - 内存权限修改

3. contentretriever.cpp - 缓存失效策略

4. semantic_types.h - 数据结构安全

总结与关键改进建议

Uh oh!

Johnson-zs commented May 28, 2026

Uh oh!

deepin-bot Bot commented May 28, 2026

Uh oh!

Uh oh!

deepin-bot Bot commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

一、语法与逻辑问题

1. `addr_any.h` - 潜在的内存越界与资源泄漏

2. `stub.h` - 严重的安全与逻辑漏洞

3. `stub-shadow.h` - 类型转换越界风险

4. `tst_chinese_nlp.cpp` - 测试用例中的硬编码时间

二、代码质量与可维护性

1. `elfio.hpp` - 巨大的内联文件

2. `addr_pri.h` - 违反 C++ 标准的 Hack

三、代码性能

1. `addr_any.h` - 不必要的字符串拷贝与低效查找

2. `indexedstrategy.cpp` - Lucene 查询构建中的冗余对象创建

3. `semanticruleengine.cpp` - 规则匹配时的排序

四、代码安全

1. `addr_any.h` - 符号地址暴露

2. `stub.h` - 内存权限修改

3. `contentretriever.cpp` - 缓存失效策略

4. `semantic_types.h` - 数据结构安全