Opsi indexes by FrancescAlted · Pull Request #613 · Blosc/python-blosc2

FrancescAlted · 2026-04-07T11:33:54Z

No description provided.

Add a first modern indexing engine for 1-D NDArray objects and structured fields, inspired by OPSI but adapted to Blosc2 chunk/block storage. Introduce four index kinds: - ultralight: chunk zone maps - light: chunk + block zone maps - medium: block-partitioned reduced-order exact index - full: global sorted values + logical positions Improve query execution by: - making full retrieval chunk-aware for scattered hits - making medium use per-block sorted values plus compact local offsets - integrating index planning into LazyExpr.where(...) - exposing will_use_index() and explain() helpers Add correctness coverage for scalar, structured, persistent, mutation, and random-distribution cases. Extend the benchmark to compare index kinds across distributions, report cold vs warm query timings, footprint metrics, reusable on-disk outputs, and configurable query width / repeat counts.

Cache persisted index descriptors per array to avoid repeated vlmeta loads during indexed queries, and keep lazy-chunk span reads for the block-aware gather path. This reduces planner overhead substantially for tiny exact-hit queries: - _load_store() becomes effectively free after the first lookup - plan_query() drops from about 0.27 ms to about 0.02 ms - arr[cond][:] on 10M random/full point queries drops to ~0.24 ms Update the benchmark to measure the clearer public indexed idiom: - keep scan baseline with cond.where(arr).compute(_use_index=False)[:] - use arr[cond][:] for indexed timings This makes benchmark results closer to real user code and shows the actual public-query latency improvements more accurately.

Replace the experimental metadata-only light path with a real block-local reduced/coarse index more in line with OPSI. The new light stores: - block-local sorted values - coarse physical bucket positions for those sorted values - block offsets into the flattened sidecars Query execution now: - prunes with chunk/block summaries - does exact searchsorted() inside each surviving block - builds a coarse bucket mask from matching sorted rows - rechecks only those physical buckets against base data Add an integer-only lossy compression knob for light.values: - light_value_lossy_bits = min(9 - optlevel, dtype.itemsize) - capped to one eighth of the integer width - default optlevel=5 - exact base-row recheck preserves correctness Extend the benchmark with --optlevel and make index reuse optlevel-aware. Update tests to cover: - persistent light indexes - lossy integer light correctness

Extend the light lossy-value experiment from integers to float32 and float64, while keeping all other non-integer dtypes exact. Use monotonic downward quantization for finite float values so light can still widen bounds safely and preserve correctness via exact base-row rechecks. Update benchmark coverage in both: - bench/ndarray/index_query_bench.py - bench/ndarray/index_query_bench_tables.py Add --dtype to both scripts, defaulting to float64, and make data generation, query construction, and persisted output reuse dtype-aware. This lets us benchmark indexing behavior consistently across boolean, integer, and floating-point columns in both python-blosc2 and PyTables.

- add streaming/out-of-core builders for light, medium, and full indexes - keep in_mem=True as the explicit switch back to in-memory builds - persist and rebuild the chosen build mode in index descriptors - speed up the OOC full builder with chunked external merge runs - fix persistent index cache reuse across reopened arrays - add coverage for OOC persistence, rebuilds, and in-memory override - switch index benchmark CLI to --in-mem with OOC as the default

- reuse full indexes for direct sort(order=...) and indices(order=...) - add itersorted(...) for streaming ordered traversal via full indexes - teach filtered ordered queries to reuse full indexes on the order key - intersect exact positions across multiple indexed fields for AND predicates - add NDArray.append(...) for 1-D arrays - keep light, medium, and full indexes current on append - preserve sorted reads and indexed filtering after append without rebuild - add regression coverage for ordered access, cross-field exact filters, and append maintenance - add examples for sorted iteration and append-aware index maintenance

- clarify that one active index is supported per field - keep name as a descriptor label rather than index identity - add target-aware descriptor metadata for field-backed indexes - document ordered access semantics as ascending and stable - document secondary-key tie refinement after primary full-index order - document append-maintained vs stale-on-mutation index behavior - add ordered-access planner introspection to will_use_index() and explain() - report ordered reuse, missing full-index cases, and filter/ordering reasons - simplify append-maintenance example to use a single csindex - add intent comments to the new indexing examples - update the follow-up indexing plan with the current implementation state - add a concrete plan section for materialized expression indexes - add regression coverage for target metadata and ordered explain behavior

- add create_expr_index(...) for explicit derived-value indexes - generalize index descriptors and sidecars to target field or expression streams - normalize expression targets by canonical expression keys and dependencies - reuse expression indexes for where(...) filtering on matching predicates - reuse full expression indexes for sort(order=...) and indices(order=...) - keep expression indexes current across append operations - persist and reopen expression indexes with target metadata intact - raise clear errors when expression ordering lacks a matching full index - add regression coverage for filtering, ordered reuse, persistence, and append maintenance - add an examples/ndarray/expression_index.py example - add bench/ndarray/expression_index_bench.py for expression-index timing comparisons - update examples to prefer the expr[:] idiom over expr.compute()[:]

Keep append-heavy full indexes cheap by storing each appended tail as a sorted run instead of rewriting the compact base sidecars on every append. Teach full loads to merge compact base + append runs on demand, with cache reuse for repeated reads, and clean up run sidecars correctly on replace/drop. Extend full descriptor metadata with run tracking while keeping the prototype index format version unchanged. Add regression tests for repeated appends on field and expression full indexes, including persistent reopen.

Replace the old block-local persistent payload format for light and medium with a chunk-local canonical layout using fully sorted chunk payloads, per-chunk offsets, chunk-level L1 boundaries, and persistent intrachunk L2 navigation sidecars. Update the builders, loaders, rebuild/append paths, and descriptor validation so rebuilt light and medium indexes only use the new chunk-local-v1 format and drop reliance on the old block-flattened payload assumptions. Add new persistent exact-query paths for light and medium that use chunk-level pruning plus L2-guided selective reads through sidecar span helpers, while preserving scan-equivalent output order. Switch light to chunk-local bucket geometry derived from the payload block length, allow wider bucket dtypes, and keep medium positions chunk-local instead of block-local. Improve explain() reporting for the new OOC lookup path with lookup_path="chunk-nav-ooc" and navigation candidate counts.

Move indexing-specific Cython helpers out of blosc2_ext.pyx into the new src/blosc2/indexing_ext.pyx module and wire indexing.py plus the CMake build to use the dedicated extension. Keep the accelerated query paths for light and medium but extend their typed dispatch beyond float64/int64 to cover the core numeric family: float32, float64, int8/16/32/64, and uint8/16/32/64. Retain the existing Python/NumPy fallback for unsupported dtypes. Add dispatch-focused indexing tests covering the accelerated numeric dtypes for medium, representative light numeric paths, and an unsupported float16 fallback case. Fix unsigned light lossy quantization masks so uint* dtypes do not overflow during index build.

Add chunk-batch threading to the OOC query path for light and the Python fallback path for medium, then extend threading to the shared downstream execution layer used by ultralight and light. Keep scan-equivalent row order by processing contiguous chunk batches and merging batch results strictly in scheduled chunk order.

- add native intra-chunk sort and linear merge in indexing_ext - keep safe NumPy fallbacks for unsupported dtypes - simplify build path to a single intra-chunk implementation - use BLOSC2_INDEX_BUILD_THREADS to control build parallelism - document that BLOSC2_INDEX_BUILD_THREADS=1 disables parallel sorting

- build persistent benchmark arrays chunk by chunk - avoid materializing the full base array in memory - generate permuted ids directly without temp disk scratch - compute query bounds analytically instead of building ordered arrays - stream cold benchmark rows as each index kind finishes - simplify cold output to a single aligned table - keep warm timings in the final summary table - rename the low-memory distribution from random to permuted

On Windows, C-Blosc2 cannot update vlmeta while another file handle holds the same path open. Tests that created `arr` with a urlpath and then opened the same file as `reopened` triggered a RuntimeError in blosc2_vlmeta_update when write operations (e.g. rebuild_index) were attempted through the second handle. Add `del arr` before every `reopened = blosc2.open(path, mode="a")` call in tests/ndarray/test_indexing.py (9 sites), following the pattern already established in test_open.py, test_mmap.py and test_schunk.py.

FrancescAlted added 30 commits April 1, 2026 21:08

Initial index implementation

6f078dd

More bench, and moderate index improvements

37ec18c

New get_1d_span_numpy for reading single blocks

241d125

Use lazychunks for avoiding a full chunk load

43e8e67

New (preliminary) algorithm for light indexes

cb2a0cd

Add missing bench for pytables indexing

3cf0df5

Improve full-index selective lookup with L1 and L2 caches

7027fef

Add bounded run fallback and document compact_index API

da8785e

Add tutorial on indexes

6b80e29

Docstrings for LazyExpr.explain()

1dd1f25

Fix OOC light/medium append rebuilds

2751776

Tune medium nav density by optlevel

e401e7a

Release compaction memmaps before unlink on Windows

87d0b96

Fix stale in-memory index store reuse

94cb5e6

Replace full OOC temp runs with Blosc2 scratch arrays

d1c3636

Add configurable chunk/block geometry to index query bench

d3e9cc8

New geometry for blocks in medium indexes

9498a1a

Release OOC temp memmaps before Windows cleanup

53d7651

FrancescAlted added 13 commits April 6, 2026 12:23

Remove memmap staging from light/medium index builds

ba9cdc1

New --kind option for selecting the kind of the index

5f24620

Better table formatting

8946c85

Honor cparams in create_index()

7c8765f

Document index build kwargs and compression controls

65e6fcb

Reduce chunk size on macos to make index sorting times reasonable

e489527

Make more common defaults

b218dbe

Some API cleanup

d90b684

Document will_use_index and add tests

99fc8d0

Fixing windows/mmap issues (I)

ce11d99

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Opsi indexes#613

Opsi indexes#613
FrancescAlted wants to merge 43 commits intomainfrom
opsi-indexes

FrancescAlted commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

FrancescAlted commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant