Skip to content

Opsi indexes#613

Open
FrancescAlted wants to merge 43 commits intomainfrom
opsi-indexes
Open

Opsi indexes#613
FrancescAlted wants to merge 43 commits intomainfrom
opsi-indexes

Conversation

@FrancescAlted
Copy link
Copy Markdown
Member

No description provided.

  Add a first modern indexing engine for 1-D NDArray objects and
  structured fields, inspired by OPSI but adapted to Blosc2 chunk/block
  storage.

  Introduce four index kinds:

  - ultralight: chunk zone maps
  - light: chunk + block zone maps
  - medium: block-partitioned reduced-order exact index
  - full: global sorted values + logical positions

  Improve query execution by:

  - making full retrieval chunk-aware for scattered hits
  - making medium use per-block sorted values plus compact local offsets
  - integrating index planning into LazyExpr.where(...)
  - exposing will_use_index() and explain() helpers

  Add correctness coverage for scalar, structured, persistent, mutation,
  and random-distribution cases.

  Extend the benchmark to compare index kinds across distributions,
  report cold vs warm query timings, footprint metrics, reusable on-disk
  outputs, and configurable query width / repeat counts.
  Cache persisted index descriptors per array to avoid repeated vlmeta
  loads during indexed queries, and keep lazy-chunk span reads for the
  block-aware gather path.

  This reduces planner overhead substantially for tiny exact-hit queries:

  - _load_store() becomes effectively free after the first lookup
  - plan_query() drops from about 0.27 ms to about 0.02 ms
  - arr[cond][:] on 10M random/full point queries drops to ~0.24 ms

  Update the benchmark to measure the clearer public indexed idiom:

  - keep scan baseline with cond.where(arr).compute(_use_index=False)[:]
  - use arr[cond][:] for indexed timings

  This makes benchmark results closer to real user code and shows the
  actual public-query latency improvements more accurately.
  Replace the experimental metadata-only light path with a real
  block-local reduced/coarse index more in line with OPSI.

  The new light stores:

  - block-local sorted values
  - coarse physical bucket positions for those sorted values
  - block offsets into the flattened sidecars

  Query execution now:

  - prunes with chunk/block summaries
  - does exact searchsorted() inside each surviving block
  - builds a coarse bucket mask from matching sorted rows
  - rechecks only those physical buckets against base data

  Add an integer-only lossy compression knob for light.values:

  - light_value_lossy_bits = min(9 - optlevel, dtype.itemsize)
  - capped to one eighth of the integer width
  - default optlevel=5
  - exact base-row recheck preserves correctness

  Extend the benchmark with --optlevel and make index reuse optlevel-aware.

  Update tests to cover:

  - persistent light indexes
  - lossy integer light correctness
  Extend the light lossy-value experiment from integers to float32
  and float64, while keeping all other non-integer dtypes exact.

  Use monotonic downward quantization for finite float values so light
  can still widen bounds safely and preserve correctness via exact
  base-row rechecks.

  Update benchmark coverage in both:

  - bench/ndarray/index_query_bench.py
  - bench/ndarray/index_query_bench_tables.py

  Add --dtype to both scripts, defaulting to float64, and make data
  generation, query construction, and persisted output reuse dtype-aware.

  This lets us benchmark indexing behavior consistently across boolean,
  integer, and floating-point columns in both python-blosc2 and
  PyTables.
  - add streaming/out-of-core builders for light, medium, and full indexes
  - keep in_mem=True as the explicit switch back to in-memory builds
  - persist and rebuild the chosen build mode in index descriptors
  - speed up the OOC full builder with chunked external merge runs
  - fix persistent index cache reuse across reopened arrays
  - add coverage for OOC persistence, rebuilds, and in-memory override
  - switch index benchmark CLI to --in-mem with OOC as the default
  - reuse full indexes for direct sort(order=...) and indices(order=...)
  - add itersorted(...) for streaming ordered traversal via full indexes
  - teach filtered ordered queries to reuse full indexes on the order key
  - intersect exact positions across multiple indexed fields for AND predicates
  - add NDArray.append(...) for 1-D arrays
  - keep light, medium, and full indexes current on append
  - preserve sorted reads and indexed filtering after append without rebuild
  - add regression coverage for ordered access, cross-field exact filters, and append maintenance
  - add examples for sorted iteration and append-aware index maintenance
  - clarify that one active index is supported per field
  - keep name as a descriptor label rather than index identity
  - add target-aware descriptor metadata for field-backed indexes
  - document ordered access semantics as ascending and stable
  - document secondary-key tie refinement after primary full-index order
  - document append-maintained vs stale-on-mutation index behavior
  - add ordered-access planner introspection to will_use_index() and explain()
  - report ordered reuse, missing full-index cases, and filter/ordering reasons
  - simplify append-maintenance example to use a single csindex
  - add intent comments to the new indexing examples
  - update the follow-up indexing plan with the current implementation state
  - add a concrete plan section for materialized expression indexes
  - add regression coverage for target metadata and ordered explain behavior
  - add create_expr_index(...) for explicit derived-value indexes
  - generalize index descriptors and sidecars to target field or expression streams
  - normalize expression targets by canonical expression keys and dependencies
  - reuse expression indexes for where(...) filtering on matching predicates
  - reuse full expression indexes for sort(order=...) and indices(order=...)
  - keep expression indexes current across append operations
  - persist and reopen expression indexes with target metadata intact
  - raise clear errors when expression ordering lacks a matching full index
  - add regression coverage for filtering, ordered reuse, persistence, and append maintenance
  - add an examples/ndarray/expression_index.py example
  - add bench/ndarray/expression_index_bench.py for expression-index timing comparisons
  - update examples to prefer the expr[:] idiom over expr.compute()[:]
  Keep append-heavy full indexes cheap by storing each appended tail as a
  sorted run instead of rewriting the compact base sidecars on every append.

  Teach full loads to merge compact base + append runs on demand, with
  cache reuse for repeated reads, and clean up run sidecars correctly on
  replace/drop.

  Extend full descriptor metadata with run tracking while keeping the
  prototype index format version unchanged.

  Add regression tests for repeated appends on field and expression full
  indexes, including persistent reopen.
  Replace the old block-local persistent payload format for light and
  medium with a chunk-local canonical layout using fully sorted chunk
  payloads, per-chunk offsets, chunk-level L1 boundaries, and persistent
  intrachunk L2 navigation sidecars.

  Update the builders, loaders, rebuild/append paths, and descriptor
  validation so rebuilt light and medium indexes only use the new
  chunk-local-v1 format and drop reliance on the old block-flattened
  payload assumptions.

  Add new persistent exact-query paths for light and medium that use
  chunk-level pruning plus L2-guided selective reads through sidecar span
  helpers, while preserving scan-equivalent output order.

  Switch light to chunk-local bucket geometry derived from the payload
  block length, allow wider bucket dtypes, and keep medium positions
  chunk-local instead of block-local.

  Improve explain() reporting for the new OOC lookup path with
  lookup_path="chunk-nav-ooc" and navigation candidate counts.
  Move indexing-specific Cython helpers out of blosc2_ext.pyx into the new
  src/blosc2/indexing_ext.pyx module and wire indexing.py plus the CMake
  build to use the dedicated extension.

  Keep the accelerated query paths for light and medium but extend their
  typed dispatch beyond float64/int64 to cover the core numeric family:
  float32, float64, int8/16/32/64, and uint8/16/32/64.

  Retain the existing Python/NumPy fallback for unsupported dtypes.

  Add dispatch-focused indexing tests covering the accelerated numeric dtypes
  for medium, representative light numeric paths, and an unsupported
  float16 fallback case.

  Fix unsigned light lossy quantization masks so uint* dtypes do not
  overflow during index build.
  Add chunk-batch threading to the OOC query path for light and the Python
  fallback path for medium, then extend threading to the shared downstream
  execution layer used by ultralight and light.

  Keep scan-equivalent row order by processing contiguous chunk batches and
  merging batch results strictly in scheduled chunk order.
  - add native intra-chunk sort and linear merge in indexing_ext
  - keep safe NumPy fallbacks for unsupported dtypes
  - simplify build path to a single intra-chunk implementation
  - use BLOSC2_INDEX_BUILD_THREADS to control build parallelism
  - document that BLOSC2_INDEX_BUILD_THREADS=1 disables parallel sorting
  - build persistent benchmark arrays chunk by chunk
  - avoid materializing the full base array in memory
  - generate permuted ids directly without temp disk scratch
  - compute query bounds analytically instead of building ordered arrays
  - stream cold benchmark rows as each index kind finishes
  - simplify cold output to a single aligned table
  - keep warm timings in the final summary table
  - rename the low-memory distribution from random to permuted
   On Windows, C-Blosc2 cannot update vlmeta while another file handle
   holds the same path open. Tests that created `arr` with a urlpath and
   then opened the same file as `reopened` triggered a RuntimeError in
   blosc2_vlmeta_update when write operations (e.g. rebuild_index) were
   attempted through the second handle.

   Add `del arr` before every `reopened = blosc2.open(path, mode="a")`
   call in tests/ndarray/test_indexing.py (9 sites), following the
   pattern already established in test_open.py, test_mmap.py and
   test_schunk.py.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant