chore: Add existence (semi / anti ) benchmarks for hashjoinexec#21821
chore: Add existence (semi / anti ) benchmarks for hashjoinexec#21821coderfender wants to merge 4 commits intoapache:mainfrom
Conversation
|
@Dandandan , @2010YOUY01 , Please take a look at these benchmarks I plan to refer for bitmap based optimizations : #21817 . This essentially has a cargo ben h (for faster / simpler bench tests through |
2fbe56f to
6eb40b7
Compare
|
Thank you for working on this! I have some suggestions for you to consider. High-level issueI think the main issue is using A good benchmark should reflect realistic workloads. To achieve that, we should define a set of core axes and vary them systematically, I think for equi-joins, it could be: In contrast, I believe we'd better remove For this PRFor this PR, I suggest keeping the end-to-end For the Criterion micro-benchmarks, it would be better to first focus on a few representative workloads (e.g., join size, type), and then optionally add a small number of targeted cases for specific fast paths, such as right semi/anti joins with In short, fewer end-to-end queries should be sufficient for this PR. We could add criterion micro-benches later based on the above design. |
| // RightSemi Join benchmarks with Int32 keys | ||
| // Q16: RightSemi, 100% Density, 100% Hit rate | ||
| HashJoinQuery { | ||
| sql: r###"SELECT l.k |
There was a problem hiding this comment.
It might be clearer to express these directly using RIGHT SEMI JOIN, for example:
DataFusion CLI v53.1.0
> select count(*)
from generate_series(100) as t1(v1)
right semi join generate_series(100000) as t2(v1)
on t1.v1 > t2.v1;
+----------+
| count(*) |
+----------+
| 100 |
+----------+
1 row(s) fetched.
Elapsed 0.077 seconds.
> select count(*)
from generate_series(100) as t1(v1)
right anti join generate_series(100000) as t2(v1)
on t1.v1 > t2.v1;
+----------+
| count(*) |
+----------+
| 99901 |
+----------+
1 row(s) fetched.
Elapsed 0.007 seconds.Though, I'm not sure if it's standard SQL 🤔 , but df have them and it's easier to read.
Which issue does this PR close?
Add existence benchmarks
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?