Skip to content

Stop analysis tools from reading binary blobs#49

Merged
derekwisong merged 1 commit into
mainfrom
fix-analysis-binary-blob-read
Jun 6, 2026
Merged

Stop analysis tools from reading binary blobs#49
derekwisong merged 1 commit into
mainfrom
fix-analysis-binary-blob-read

Conversation

@derekwisong

Copy link
Copy Markdown
Owner

Stacked on #48.

Problem

Opening a large partitioned dataset and running an analysis tool (Describe, Distribution, or Correlation) made the program uninterruptibleq, Ctrl-C, and Ctrl-Z all did nothing, and the process had to be killed by PID. On a cloud-backed dataset this is also expensive (egresses every blob).

The freeze is not an event-loop bug — the loop stays responsive and allows quit while busy. The cause is that all three analysis tools collected the full `LazyFrame` including `Binary` columns:

  • Describe — `build_describe_aggregation_exprs` aggregates every column
  • Distribution — `compute_statistics_with_options` collects/samples the whole frame
  • Correlation — `collect_lazy(lf)` over all columns

Those `Binary` columns hold multi-GB blobs, so the collect read every blob across every partition, exhausting memory until the process thrashed and could no longer service input. (Ctrl-C/Ctrl-Z are no-ops in raw mode regardless.)

`#48` stubbed binary columns in the display buffer but never the statistics path.

Fix

Route the analysis `LazyFrame` through the same binary-stub expressions the display buffer already uses (renamed `display_column_exprs` → `binary_stub_exprs`, now `pub(crate)`). Binary columns become a constant `‹binary›` string that is never read from disk:

  • No blob reads → no OOM → no freeze, across all three analysis tools
  • No cloud egress of blob bytes
  • Binary columns still appear in describe, as a stub (real count, min/max = `‹binary›`) — no changes needed in `statistics.rs` or rendering

Scope note

This does not make a legitimately-expensive analysis (e.g. describe over a huge all-numeric dataset) cancellable — Polars `collect()` can't be aborted mid-flight today. That's deferred follow-up.

Test plan

  • Added `analysis_describe_stubs_binary_columns_without_reading_blobs` — asserts describe yields the stub instead of reading the bytes
  • `cargo test -p datui-lib` — 187 passed
  • Manual: open the partitioned dataset with blob columns, run Describe — completes promptly, binary column shows `‹binary›`

Base automatically changed from binary-column-stub to main June 5, 2026 21:22
Describe, distribution and correlation all collected the full LazyFrame,
including Binary columns. On a large partitioned dataset those columns hold
multi-GB blobs, so the collect read every blob across every partition,
exhausting memory until the process thrashed and could no longer service
input — the UI appeared frozen and had to be killed by PID.

Route the analysis LazyFrame through the same binary-stub expressions the
display buffer already uses (renamed display_column_exprs -> binary_stub_exprs),
so binary columns are replaced by a literal stub and their blobs are never
materialized. Binary columns still appear in describe, as a stub.
@derekwisong derekwisong force-pushed the fix-analysis-binary-blob-read branch from 643d787 to 7c0ea36 Compare June 5, 2026 21:24
@derekwisong derekwisong merged commit adaf187 into main Jun 6, 2026
4 checks passed
@derekwisong derekwisong deleted the fix-analysis-binary-blob-read branch June 6, 2026 10:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant