Skip to content

docs(optimizer): add generated optimizer rules reference#21824

Open
kumarUjjawal wants to merge 2 commits intoapache:mainfrom
kumarUjjawal:feat/optimizer_rule_docs
Open

docs(optimizer): add generated optimizer rules reference#21824
kumarUjjawal wants to merge 2 commits intoapache:mainfrom
kumarUjjawal:feat/optimizer_rule_docs

Conversation

@kumarUjjawal
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

There is no reference page that lists the built-in analyzer, logical optimizer, and physical optimizer rules. This PR adds that missing reference.

What changes are included in this PR?

  • Add a generated optimizer rules reference page.
  • Add a renderer and generator binary for the optimizer rules docs.
  • Keep the rule documentation metadata private in core so this does not add new public optimizer APIs.
  • Link the new page from the query optimizer guide and the docs index.
  • Add tests that check the documented rule order matches the default analyzer, logical optimizer, and physical optimizer pipelines.
  • Fix the eliminate_join description so it matches the actual rule behavior.

Are these changes tested?

Yes

Are there any user-facing changes?

No Public API Change

@github-actions github-actions Bot added documentation Improvements or additions to documentation development-process Related to development process of DataFusion core Core DataFusion crate labels Apr 24, 2026
@kumarUjjawal
Copy link
Copy Markdown
Contributor Author

cc @comphead

@Adez017
Copy link
Copy Markdown
Contributor

Adez017 commented Apr 24, 2026

Hi @kumarUjjawal , i don't know if i am write but as per @comphead's comment

The gap: There is no reference doc that lists and describes what each built-in rule actually does. The codebase has 27 logical optimizer rules (e.g., push_down_filter, eliminate_cross_join,
common_subexpr_eliminate) and 21 physical optimizer rules (e.g., join_selection, enforce_sorting, topk_aggregation), but the only way to learn what each one does is to read the source or use EXPLAIN
VERBOSE.

i think what we really need is an end to end doc that describe all 21 optimizer both type , could you look into this ?

@kumarUjjawal
Copy link
Copy Markdown
Contributor Author

Thanks for the feedback @Adez017 The missing doc here is a reference page for the built-in rules, not another end-to-end optimizer guide. We already have docs for how the optimizer works and how to extend it.

This PR adds the missing reference: which built-in rules exist and what each one does, and links it from the existing optimizer guide.

@Adez017
Copy link
Copy Markdown
Contributor

Adez017 commented Apr 24, 2026

Thanks for the feedback @Adez017 The missing doc here is a reference page for the built-in rules, not another end-to-end optimizer guide. We already have docs for how the optimizer works and how to extend it.

This PR adds the missing reference: which built-in rules exist and what each one does, and links it from the existing optimizer guide.

Thanks for clarification @kumarUjjawal


The plan contains limits that can be proven redundant or can collapse to simpler forms.

(logical-rule-14)=
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are those logical rule entries?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are just MyST/Sphinx labels for the intra-page links. I had raw HTML anchors there first, but that caused docs warnings and the docs build failed in CI so I switched to this as it was already being used in other docs.

Copy link
Copy Markdown
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kumarUjjawal, great PR

Tbh I'm not sure if we need a separate doc generator for optimization rules, unlike to builtin functions we dont have that many rules and new rules introduced not very frequently.

Frankly speaking, I was thinking just feed optimizer crates to LLM and generate the docs, review and improve, WDYT?

@kumarUjjawal
Copy link
Copy Markdown
Contributor Author

Thanks @comphead I agree, I think a fully hand written or as you said LLM generated page could also work, but in that case I would still want some kind of sync check to stop it drifting. Given that, a small generator felt like the safer option.

The reason I leaned toward generation here is that this page is meant to be reference docs, so the rule names, order, stages, and repeated physical passes should stay aligned with the actual default pipelines and with what users see in EXPLAIN VERBOSE.

@kumarUjjawal
Copy link
Copy Markdown
Contributor Author

I can rework this if doc generator is too much for this PR.

@comphead
Copy link
Copy Markdown
Contributor

comphead commented Apr 24, 2026

I can rework this if doc generator is too much for this PR.

making documentation in sync with optimizer rules makes a lot of sense, there is no guard right now.

Reg to generator, currently we have automated generation based proc macros for built in function and having another generator would be confusing IMO, however you trigger an awesome idea of having unified documentation framework that works for the entire core. Having that we would cover all aspects of core documentation + automatic md generation

@kumarUjjawal
Copy link
Copy Markdown
Contributor Author

currently we have automated generation based proc macros for built in function and having another generator would be confusing IMO

I see we do not want to end up with a bunch of one-off doc generators.

having unified documentation framework that works for the entire core. Having that we would cover all aspects of core documentation + automatic md generation

I think we should work toward this rather than having generator for each peices like this PR.

I will remove the optimizer-specific generator, maybe you can start the discussion around the shared docs generation framework.

@comphead
Copy link
Copy Markdown
Contributor

I think we should work toward this rather than having generator for each peices like this PR.

We probably too overcomplicate things: generators are the last haven, for builtin functions it was made because the std tool like rustdoc was not enough. I feel we need to stick to rustdoc as much as possible, and this particular case should fit IMO.

For example refer to datafusion/core/src/lib.rs how it documents the Streaming Execution and here is the rendered version https://docs.rs/datafusion/latest/datafusion/#streaming-execution

@kumarUjjawal
Copy link
Copy Markdown
Contributor Author

For example refer to datafusion/core/src/lib.rs how it documents the Streaming Execution and here is the

Fair point, rustdoc with some sync guard would be better option. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate development-process Related to development process of DataFusion documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Document DataFusion optimizer rules

3 participants