Stop analysis tools from reading binary blobs#49
Merged
Conversation
Describe, distribution and correlation all collected the full LazyFrame, including Binary columns. On a large partitioned dataset those columns hold multi-GB blobs, so the collect read every blob across every partition, exhausting memory until the process thrashed and could no longer service input — the UI appeared frozen and had to be killed by PID. Route the analysis LazyFrame through the same binary-stub expressions the display buffer already uses (renamed display_column_exprs -> binary_stub_exprs), so binary columns are replaced by a literal stub and their blobs are never materialized. Binary columns still appear in describe, as a stub.
643d787 to
7c0ea36
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on #48.
Problem
Opening a large partitioned dataset and running an analysis tool (Describe, Distribution, or Correlation) made the program uninterruptible —
q, Ctrl-C, and Ctrl-Z all did nothing, and the process had to be killed by PID. On a cloud-backed dataset this is also expensive (egresses every blob).The freeze is not an event-loop bug — the loop stays responsive and allows quit while busy. The cause is that all three analysis tools collected the full `LazyFrame` including `Binary` columns:
Those `Binary` columns hold multi-GB blobs, so the collect read every blob across every partition, exhausting memory until the process thrashed and could no longer service input. (Ctrl-C/Ctrl-Z are no-ops in raw mode regardless.)
`#48` stubbed binary columns in the display buffer but never the statistics path.
Fix
Route the analysis `LazyFrame` through the same binary-stub expressions the display buffer already uses (renamed `display_column_exprs` → `binary_stub_exprs`, now `pub(crate)`). Binary columns become a constant `‹binary›` string that is never read from disk:
Scope note
This does not make a legitimately-expensive analysis (e.g. describe over a huge all-numeric dataset) cancellable — Polars `collect()` can't be aborted mid-flight today. That's deferred follow-up.
Test plan