Skip to content

test(sqllogictest): stabilize parquet output_rows_skew with WITH ORDER#21898

Open
RatulDawar wants to merge 1 commit intoapache:mainfrom
RatulDawar:fix/explain-analyze-parquet-skew-with-order
Open

test(sqllogictest): stabilize parquet output_rows_skew with WITH ORDER#21898
RatulDawar wants to merge 1 commit intoapache:mainfrom
RatulDawar:fix/explain-analyze-parquet-skew-with-order

Conversation

@RatulDawar
Copy link
Copy Markdown
Contributor

Summary

Adds WITH ORDER (x) to CREATE EXTERNAL TABLE skew_parquet / skew_parquet_single in explain_analyze.slt so FileScanConfig preserves scan ordering (preserve_order), keeping per-partition output_rows stable under dynamic file scheduling (PR #21351).

Related

Testing

  • cargo test -p datafusion-sqllogictest --test sqllogictests -- explain_analyze (recommended before merge)

Sqllogictest-only change.

Made with Cursor

CREATE EXTERNAL TABLE skew_parquet now uses WITH ORDER (x) so FileScanConfig
preserves scan ordering and per-partition row counts stay deterministic with
dynamic file scheduling.

Refs discussion after PR apache#21351 / apache#21866.

Made-with: Cursor
@github-actions github-actions Bot added the sqllogictest SQL Logic Tests (.slt) label Apr 28, 2026
@2010YOUY01
Copy link
Copy Markdown
Contributor

Those tests are already stable, and I think the with order statement has no effect.

We have to fix the skipped one. The comments explains the root cause, though I'm not sure about the solution yet.

@RatulDawar
Copy link
Copy Markdown
Contributor Author

Yes, I think I worded myself incorrectly. I wanted to improved the coverage as mentioned in the comment by @alamb.
For I checked the code and saw that dynamic file stream task scheduling worked only when preserve_order is false.
So I was thinking that to make the test deterministic we can change tests to trigger preserve order.

Copy link
Copy Markdown
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, this makes sense. Thank you.

It's not obvious why this fix works, let's add more comments.

# Test parquet-only output_rows_skew metric (WITH ORDER(x) → preserve_order scan)
# --------------------------------------------

statement ok
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
statement ok
# Parquet supports dynamic work scheduling by default; per-partition data
# sources might steal work from siblings, making row counts per partition
# non-deterministic. Forcing `WITH ORDER` disables this behavior.
statement ok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants