-
Notifications
You must be signed in to change notification settings - Fork 1.6k
feat: Support PiecewiseMergeJoin
to speed up single range predicate joins
#16660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
cc @alamb @ozankabak Seems like you guys were part of the discussion for range joins, this is a nice start to it? @Dandandan @comphead @my-vegetable-has-exploded might be interested |
I will try and review this Pr later this week |
Thanks @jonathanc-n let me first get familiar with this kind of join |
Co-authored-by: Oleks V <[email protected]>
let left_indices = (0..left_size as u64).collect::<UInt64Array>(); | ||
let right_indices = (0..left_size) | ||
.map(|idx| left_bit_map.get_bit(idx).then_some(0)) | ||
.collect::<UInt32Array>(); | ||
return (left_indices, right_indices); | ||
} | ||
let left_indices = if join_type == JoinType::LeftSemi { | ||
let left_indices = if join_type == JoinType::LeftSemi | ||
|| (join_type == JoinType::RightSemi && piecewise) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so piecewise works only together with RightSemi?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it falls through with left semi as well. In left and right semi/anti/mark join we use the bitmap to mark all matched sides on the buffered side (this is done in process_unmatched_buffered_batch
), we use the flag to only allow right semi/anti/mark to follow through when calling from the piecewise join. Usually the bitmap is only used to mark the unmatched rows on left side, which is why it originally only holds support for Left semi/anti/mark. I'll add a comment at the beginning of process unmatched buffered batch to explain this.
Thanks @jonathanc-n -- unfortunately I am not likely to have time to review this as my focus hasn't been on the join implementation @comphead do you know anyone who is more focused joins these days who might be able to help review this? |
Hey, sorry I missed this, this join is quite interesting concept, I'm planning to finish review #16996 this week, and switch to this PR next. |
Thanks for this great work, as always! I got some high-level questions. Is it possible that IE Join could be faster for the same workload where PMG performs well? Should we implement IE Join directly? (I’m willing to help review or implement it at this point) There was an old attempt (#12754), but unfortunately it didn’t get reviewed at the time. |
Yes PMG should perform better than IE join. they are used to tackle different things regardless. IE joins are used on multi range while PMG is for a single range. You can see that DuckDB does the same thing here https://github.com/duckdb/duckdb/tree/main/src/execution/operator/join |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great doc explaining the high level ideas! I took a read and left some comments on that, looking forward to your feedbacks.
I'll continue and review the implementations soon.
/// predicate. | ||
/// | ||
/// # Execution Plan Inputs | ||
/// For `PiecewiseMergeJoin` we label all left inputs as the `streamed' side and the right outputs as the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now in DF all join operators seem to use left side as the buffer side 🤔 we should follow this convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes this should be done, I had mixed them up in my head. Is it fine to do this in a follow up pull request? Shouldn't be too big of an issue for review as the inputs are labelled as streamed and buffered throughout the code.
/// For `PiecewiseMergeJoin` we label all left inputs as the `streamed' side and the right outputs as the | ||
/// 'buffered' side. | ||
/// | ||
/// `PiecewiseMergeJoin` takes a sorted input for the side to be buffered and is able to sort streamed record |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a clarification, this means for the buffer side: its input is a SortExec
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
/// For Left Semi, Anti, and Mark joins we swap the inputs so that the marked side is on the buffered side. | ||
/// | ||
/// Here is an example: | ||
/// We perform a `JoinType::Left` with these two batches and the operator being `Operator::Lt`(<). Because |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line's join type seems a typo? It should be one of the existence join.
One suggestion to make this example even clearer is to include the join expression, such as t1 LEFT JOIN t2 ON (t1.c1 < t2.c1)
, and also mark the column names in the figures.
Please bear with me — I might need more time to digest and outline the tasks required to get this great operator merged. Here are some updates. I looked through the implementation, I think the high-level idea is great! It's definitely efficient for joins with IE predicate.
QuestionTheoretically, as long as both the probe and build sides are sorted on the compare key (regardless of ASC or DESC), this operator should be able to execute, right? It would be more efficient to preserve any existing order, though it would require more logic in the implementation, which can be hard to wrap your head around. I’ll think about whether there’s a simpler way to implement this idea. SuggestionI think the biggest problem in the implementation right now is that it might be buffering too many output results. At the moment, this operator only tracks the memory size of all buffered batches, so the extra memory usage must be around One approach to solve this is to add some state management after joining a single row in the probe side: for example, if the join result is This util can be helpful: https://docs.rs/arrow/latest/arrow/compute/struct.BatchCoalescer.html SummaryI still need some time to understand and think about the existence join cases, but so far I suggest to include those two changes in the initial PR, I think they would be very hard to do as follow-up patches (needs significant structural changes)
|
I think this is a good idea, the logic may be a bit difficult for someone to look into the implementation to follow. |
I dont think this should be done. I looked into it, and the overhead + complexity brought by checking selectivity just for a workload which is very unlikely (equijoin filters less than the single filter). DuckDB also doesnt support this if that is an indicator at all. |
Perhaps I didn't express it clearly -- the idea was not to check the selectivity and reorder the filters, it's always evaluate the primary join predicate first (equality for HJ, and single IE for PMJ), then support optional remaining ANDed filters: SELECT *
FROM generate_series(10000) AS t1(v1)
JOIN generate_series(10000) AS t2(v1)
ON ((t1.v1 + t2.v1) % 1000 = 0) AND (t1.v1 > t2.v1); Then it should be executed as a PMJ with IE predicate The SMJ and HJ in DataFusion are all implemented this way, they're fusing the general join conditions into the operator, however DuckDB is breaking this post-filtering step into a separate filter. Probably DuckDB approach is a good idea to simplify the join operators. Example: DuckDB is evaluating the remaining join filter outside the join operator, however DF is evaluating it inside the join operator
|
@2010YOUY01 Wouldn't you need to use a cost model to estimate which one to use though when both are viable? For example, the Hash Join (do the equi-condition then the residual filter) vs. PWMJ (do the filter, then the equi condition residual). You could estimate the selectivity for the equi predicate vs. residual predicate, factor in whether the key is sorted, etc. for making the decision. Sorry if i misinterpreted this, thanks for bearing with me. |
Ah, I get it now. How about using the following simple heuristic: If predicate contains equality check: e.g. I was thinking PWMJ should cover more cases originally handled by NLJ, since in the general case it should be faster than NLJ. I don't want to implement some cost model beyond the above rule at the moment. |
Yes I think so too, I don't know if it will be worth the complexity though since this is a very niche workload (single range filter + higher selectivity for filter) I think the plan for now is that I can implement support for AND expressions in this pull request. But when we include the planner changes in a follow up PR we can discuss this there. |
I think in-the-wild join workloads most commonly involve lengthy ANDed expressions, so it’s indeed challenging to make a smart planner. |
@jonathanc-n please correct my understanding of PMJ join, its fairly new to me.
On a separate note would that possible to find a formula to calculate cost ? Reg to https://cs186berkeley.net/resources/static/notes/n09-Joins.pdf
for simple case NLJ, without optimizations(no left prebuffering, lookup S on every row from R)
Having a cost would give people more understanding the benefits of using PMJ |
get_final_indices_from_shared_bitmap(visited_left_side, self.join_type); | ||
|
||
let (left_side, right_side) = get_final_indices_from_shared_bitmap( | ||
visited_left_side, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
visited_left_side
awesome name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jonathanc-n I made some initial review, thanks for comments, again you and @2010YOUY01 saved reviewing hours by commenting on the logic.
Before going forward we def need to include PMJ
into fuzz testing join_fuzz.rs
, do you think it can be also tested by slt
?
// Tracks the state of the `PiecewiseMergeJoin` | ||
state: PiecewiseMergeJoinStreamState, | ||
// Flag for whehter or not the join_type is an existence join. | ||
existence_join: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need this flag? it can be calculated on fly, or would it be too expensive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not expensive, but it'd just make more sense to use as the calculation is done everywhere + you get that little bit of speed up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to calculate it on the fly (or at least point to the util function in the comment), I think it's easier to follow the logic, and it can't be the bottleneck.
@comphead @2010YOUY01 I have added incremental processing, left/right sides have been swapped to correspond to buffer/streamed, performance cost docs, and diagrams in the tests to verify correctness easier. I will make the incremental processing batch limit to be configurable as a runtime config once planner is implemented. I will do the following as follow-up PRs + put it in an EPIC issue (this pull request was getting quite large and cluttered, probably better to merge this for now if it looks ready):
Thank you very much so far for the reviews, really appreciate it. |
Awesome! It's on my list. To ensure the correctness, I recommend to write a POC planner first to let this operator be able to run through SQL interface, and ensure extended test passes. (planner part is expected to be made in a separate PR)
I think the strongest test for join edge cases so far is the sqlite test suite -- From my recent PR, it passed all DF test+fuzz test, but the sqlite test suite found 3 additional bugs. |
@2010YOUY01 I'll verify if it works on my local. Then make a pull request after this merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have went through part of it and left some suggestions. Will continue soon.
/// - |R|, |S|: number of tuples in `R` and `S` | ||
/// - `B`: number of buffer pages | ||
/// | ||
/// # Performance (cost) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this cost model is not correct -- In ancient times because disk are super slow, so the number of page fetches is used to model the performance. However today for OLAP systems, the bottleneck has shifted to the CPU, so I think it's better to use the work done by CPU to model the performance.
e.g.
NLJ cost = buffer-side-scan * probe-side-row-count
PWMJ cost = buffer-side-scan * probe-side-batch-count
// For left existence joins the inputs will be swapped so the sort | ||
// options are switched | ||
if is_right_existence_join(join_type) { | ||
SortOptions::new(false, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it defines a null order, I think those nulls should be special handled later, let's add a comment to point to the location that those nulls are handled.
// Tracks the state of the `PiecewiseMergeJoin` | ||
state: PiecewiseMergeJoinStreamState, | ||
// Flag for whehter or not the join_type is an existence join. | ||
existence_join: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to calculate it on the fly (or at least point to the util function in the comment), I think it's easier to follow the logic, and it can't be the bottleneck.
}; | ||
|
||
/// Batch emits this number of rows when processing | ||
pub const DEFAULT_INCREMENTAL_BATCH_VALUE: usize = 8192; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend to use the batch_size
configuration from the context
I tend to think it's better to include the planner part into the initial PR, the reason is if we do it in two steps, the executor can be incompatible with other operators, so the follow-on PR would also have a large diff. To make this task easier we want to shrink this PR, here are some ideas
|
Rationale for this change
PiecewiseMergeJoin
is a nice pre cursor to the implementation of ASOF, inequality, etc. joins (multiple range predicates).PiecewiseMergeJoin
is specialized for when there is only one range filter and can perform much faster in this case especially for semi, anti, mark joins.What changes are included in this PR?
PiecewiseMergeJoin
implementation only, there is nophysical planner -> PiecewiseMergeJoinExec
.ExecutionPlan
has been implemented forPiecewiseMergeJoinExec
compute_properties
andswap_inputs
is not implementedPiecewiseMergeJoinStream
has been implemented for the actual batch emission logicExamples have been provided for the
PiecewiseMergeJoinExec
andPiecewiseMergeJoinStream
implementations.Benchmark Results
The benchmarks were tested on a random batch of values (streamed side) against a sorted batch (buffered side).
When compared to NestedLoopJoin the queries for classic joins (left, right, inner, full) were about 10x faster 🚀
For existence joins (semi, anti), the join performed about 1000 x faster 🚀
Benchmark Results for normal joins
Benchmark Results for existence joins
If you want to replicate:
Code
Here’s the hidden content that you can put Markdown in,
including lists, code blocks, images, etc.
Next Steps
Pull request was getting large, here are the following steps for this:
Are these changes tested?
Yes unit tests