Rewrite Nested Loop Join executor for 5× speed and 1% memory usage #16996

2010YOUY01 · 2025-07-31T15:20:08Z

Which issue does this PR close?

Closes #.

Rationale for this change

Summary

This PR rewrites the NLJ operator from scratch with a different approach, to limit the extra intermediate data overhead to only 1 RecordBatch, and eliminate other redundant conversions in the old implementation.

Using the NLJ micro-bench introduced inside this PR, this rewrite can introduce up to 5X speed-up, and uses only 1% memory in extreme cases.

Speed benchmark

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ nlj-rewrite ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  220.42 ms │    86.53 ms │ +2.55x faster │
│ QQuery 2     │  241.19 ms │   117.37 ms │ +2.05x faster │
│ QQuery 3     │  335.39 ms │   165.34 ms │ +2.03x faster │
│ QQuery 4     │  761.47 ms │   367.77 ms │ +2.07x faster │
│ QQuery 5     │  540.70 ms │   231.72 ms │ +2.33x faster │
│ QQuery 6     │ 5344.76 ms │  1678.18 ms │ +3.18x faster │
│ QQuery 7     │  546.76 ms │   233.03 ms │ +2.35x faster │
│ QQuery 8     │ 5218.51 ms │  1680.25 ms │ +3.11x faster │
│ QQuery 9     │  677.36 ms │   266.70 ms │ +2.54x faster │
│ QQuery 10    │ 1698.53 ms │   500.02 ms │ +3.40x faster │
│ QQuery 11    │ 1144.87 ms │   222.46 ms │ +5.15x faster │
│ QQuery 12    │ 1138.59 ms │   225.84 ms │ +5.04x faster │
└──────────────┴────────────┴─────────────┴───────────────┘

Memory Usage Benchmark

Main

Query     Time (ms)     Peak RSS  Peak Commit  Major Page Faults
----------------------------------------------------------------
1            761.84    1042.7 MB       5.9 GB                  0
2            973.85    1486.3 MB       8.3 GB                  0
3           1197.72       2.2 GB      11.9 GB                  0
4           3644.36      12.2 GB      20.8 GB                  0
5           3729.40       3.9 GB      13.4 GB                  0
6          18882.52      22.9 GB      24.5 GB                  0
7           3128.09       3.9 GB      13.4 GB                  0
8          19002.03      22.9 GB      24.6 GB                  0
9           2968.10      10.3 GB      15.1 GB                  0
10          6064.70      19.4 GB      28.3 GB                  0

PR

Query     Time (ms)     Peak RSS  Peak Commit  Major Page Faults
----------------------------------------------------------------
1            218.46      78.6 MB    1024.1 MB                  0
2            318.69      76.7 MB    1024.2 MB                  0
3            352.38      77.0 MB    1024.2 MB                  0
4           1384.05      85.1 MB    1024.2 MB                  0
5           2217.99      82.4 MB    1024.2 MB                  0
6           4331.11      77.0 MB    1024.1 MB                  0
7           2295.73      79.1 MB    1024.1 MB                  0
8           4372.36      78.6 MB    1024.1 MB                  0
9            911.73      81.1 MB    1024.2 MB                  0
10          1212.94      71.6 MB    1024.1 MB                  0

Next Steps

I think it is ready for review, the major potential optimizations have all been done. Though there is one minor chore to be done:

Add join metrics tracking

Follow-ups

Open issue for the optimization idea from @Dandandan Rewrite Nested Loop Join executor for 5× speed and 1% memory usage #16996 (comment)

Why a Rewrite?

(TLDR: it's the easiest way to address the existing problem)

The original implementation performs a Cartesian product of (all-left-batches x right-batch), materializes that intermediate result for predicate evaluation, and then materializes the (potentially very large) final result all at once. This design is inherently inefficient, and although many patches have attempted to alleviate the problem, the fundamental issue remains.

A key challenge is that the original design and the ideal design (i.e., one that produces small intermediates during execution) are fundamentally different. As a result, it's practically impossible to make small incremental changes that fully address the inefficiency. These patches may also increase code complexity, making long-term maintenance more difficult.

Example of Prior Work

Here's a recent example of a small patch intended to improve the situation:
#16443
Even with careful engineering, I still feel the entropy in the code increases.

Since NLJ is a relatively straightforward operator, a full rewrite seemed worthwhile. This allows for a clean, simplified design focused on current goals—performance and memory efficiency—without being constrained by the legacy implementation.

What changes are included in this PR?

Implementation

The implementation/design doc can be found in the source code.

A brief comparison between old implementation and this PR:

Old implementation

Do a Cartesian produce of (all_buffered_left_batch x one_right_batch) to calculate indices
Construct intermediate batch and output it chunk by chunk

This PR

For the inner loop, only perform (one_left_row x one_right_batch), and do filtering, and output construction directly on this small intermediate.
Eagerly yield output when the output buffer has reached batch_size threshold

Old v.s. PR

The old implementation requires multiple conversions between indices <--> batch, and this PR can use the right batch directly. This avoids unnecessary transformations and make the implementation more cache-friendly
The old implementation has extra memory overhead of left_buffered_batches_total_rows * right_batch_size (default 8192) * 12 Bytes (left and right indices are represented by uint64 and uint32), can be significant for large dataset.

Are these changes tested?

Existing tests

Are there any user-facing changes?

The old implementation can maintain right input order in certain cases, this PR's new design is not able to maintain that property -- preserving it would require a different design, and it has significant performance and memory overhead.

If this property is important to some user, we can keep the old implementation (maybe rename to RightOrderPreservingNLJ and use a configuration to control it)

2010YOUY01 · 2025-07-31T15:20:57Z

Draft PR and early discussions: #16889

Thanks @UBarney @ding-young and @jonathanc-n for the help 🙏🏼

2010YOUY01 · 2025-07-31T15:24:36Z

datafusion/sqllogictest/test_files/joins.slt

-01)NestedLoopJoinExec: join_type=Inner, filter=a@0 < x@1
-02)--DataSourceExec: partitions=1, partition_sizes=[0]
-03)--SortExec: expr=[x@0 ASC NULLS LAST], preserve_partitioning=[false]
+01)SortExec: expr=[x@3 ASC NULLS LAST], preserve_partitioning=[false]


See the PR write-up's 'breaking change' section for the reason.

ding-young · 2025-08-01T02:41:12Z

Memory Usage Benchmark

current

Query     Time (ms)     Peak RSS  Peak Commit  Major Page Faults
----------------------------------------------------------------
1            761.84    1042.7 MB       5.9 GB                  0
2            973.85    1486.3 MB       8.3 GB                  0
3           1197.72       2.2 GB      11.9 GB                  0
4           3644.36      12.2 GB      20.8 GB                  0
5           3729.40       3.9 GB      13.4 GB                  0
6          18882.52      22.9 GB      24.5 GB                  0
7           3128.09       3.9 GB      13.4 GB                  0
8          19002.03      22.9 GB      24.6 GB                  0
9           2968.10      10.3 GB      15.1 GB                  0
10          6064.70      19.4 GB      28.3 GB                  0

nlj-rewrite

Query     Time (ms)     Peak RSS  Peak Commit  Major Page Faults
----------------------------------------------------------------
1            218.46      78.6 MB    1024.1 MB                  0
2            318.69      76.7 MB    1024.2 MB                  0
3            352.38      77.0 MB    1024.2 MB                  0
4           1384.05      85.1 MB    1024.2 MB                  0
5           2217.99      82.4 MB    1024.2 MB                  0
6           4331.11      77.0 MB    1024.1 MB                  0
7           2295.73      79.1 MB    1024.1 MB                  0
8           4372.36      78.6 MB    1024.1 MB                  0
9            911.73      81.1 MB    1024.2 MB                  0
10          1212.94      71.6 MB    1024.1 MB                  0

(reproducer branch & comments to be added)

comphead

Thanks @2010YOUY01 this is a great PR, I'm planning to have a look on next week

Dandandan · 2025-08-01T20:20:22Z

datafusion/physical-plan/src/joins/nested_loop_join.rs

+                original_left_array.as_ref(),
+                build_side_index,
+            )?;
+            scalar_value.to_array_of_size(filtered_probe_batch.num_rows())?


I think it would be neat if we can avoid generating this array and directly apply the filter based on the scalar, rather than converting to an array of the same size.

I was expecting this optimization to bring > 10% speedup to the NLJ micro-bench, based on the flamegraphs.

Though I think this change requires some transformation to JoinFilter, and we don't have such utility function yet. I'll open an issue to track this idea later.

I tried but it only gets slightly faster, I think this optimization might not worth the extra complexity.

The reason for not getting significantly faster:

This step is not the bottleneck (for this simple NLJ micro-bench, now the actual bottleneck is evaluating the expression, especially modulo)

I used a utility in the optimizer to rewrite the join filter, this step is slow (6% in total time)

Benchmark result:

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Query ┃ nlj-rewrite ┃ optimize-join-filter ┃ Change ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ QQuery 1 │ 87.14 ms │ 83.12 ms │ no change │ │ QQuery 2 │ 118.03 ms │ 110.51 ms │ +1.07x faster │ │ QQuery 3 │ 164.04 ms │ 160.60 ms │ no change │ │ QQuery 4 │ 373.67 ms │ 358.43 ms │ no change │ │ QQuery 5 │ 232.66 ms │ 259.75 ms │ 1.12x slower │ │ QQuery 6 │ 1708.19 ms │ 1633.82 ms │ no change │ │ QQuery 7 │ 236.27 ms │ 257.24 ms │ 1.09x slower │ │ QQuery 8 │ 1710.80 ms │ 1634.83 ms │ no change │ │ QQuery 9 │ 270.86 ms │ 261.67 ms │ no change │ │ QQuery 10 │ 499.70 ms │ 493.96 ms │ no change │ └──────────────┴─────────────┴──────────────────────┴───────────────┘ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓ ┃ Benchmark Summary ┃ ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩ │ Total Time (nlj-rewrite) │ 5401.37ms │ │ Total Time (optimize-join-filter) │ 5253.94ms │ │ Average Time (nlj-rewrite) │ 540.14ms │ │ Average Time (optimize-join-filter) │ 525.39ms │ │ Queries Faster │ 1 │ │ Queries Slower │ 2 │ │ Queries with No Change │ 7 │ │ Queries with Failure │ 0 │ └─────────────────────────────────────┴───────────┘

My experiment branch: https://github.com/2010YOUY01/arrow-datafusion/tree/optimize-join-filter

I tried but it only gets slightly faster, I think this optimization might not worth the extra complexity.

The reason for not getting significantly faster:

This step is not the bottleneck (for this simple NLJ micro-bench, now the actual bottleneck is evaluating the expression, especially modulo)

I used a utility in the optimizer to rewrite the join filter, this step is slow (6% in total time)

Benchmark result:

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Query ┃ nlj-rewrite ┃ optimize-join-filter ┃ Change ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ QQuery 1 │ 87.14 ms │ 83.12 ms │ no change │ │ QQuery 2 │ 118.03 ms │ 110.51 ms │ +1.07x faster │ │ QQuery 3 │ 164.04 ms │ 160.60 ms │ no change │ │ QQuery 4 │ 373.67 ms │ 358.43 ms │ no change │ │ QQuery 5 │ 232.66 ms │ 259.75 ms │ 1.12x slower │ │ QQuery 6 │ 1708.19 ms │ 1633.82 ms │ no change │ │ QQuery 7 │ 236.27 ms │ 257.24 ms │ 1.09x slower │ │ QQuery 8 │ 1710.80 ms │ 1634.83 ms │ no change │ │ QQuery 9 │ 270.86 ms │ 261.67 ms │ no change │ │ QQuery 10 │ 499.70 ms │ 493.96 ms │ no change │ └──────────────┴─────────────┴──────────────────────┴───────────────┘ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓ ┃ Benchmark Summary ┃ ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩ │ Total Time (nlj-rewrite) │ 5401.37ms │ │ Total Time (optimize-join-filter) │ 5253.94ms │ │ Average Time (nlj-rewrite) │ 540.14ms │ │ Average Time (optimize-join-filter) │ 525.39ms │ │ Queries Faster │ 1 │ │ Queries Slower │ 2 │ │ Queries with No Change │ 7 │ │ Queries with Failure │ 0 │ └─────────────────────────────────────┴───────────┘

My experiment branch: https://github.com/2010YOUY01/arrow-datafusion/tree/optimize-join-filter

Very nice, thanks for trying.
I wonder if we can try with a different expression, e.g. greater than?

For such simple expressions, it's 20% faster on top of this PR. I think we should include this optimization as a follow-up.
After it, the bottleneck shifts to filter + concating the final output, this TODO might help
https://github.com/apache/arrow-rs/blob/a9b6077b2d8d5b5aee0e97e7d4335878e8da1876/arrow-select/src/coalesce.rs#L206

UBarney · 2025-08-02T09:57:56Z

Hi @2010YOUY01

Thanks for creating this amazing PR and for the detailed explanation on why we don't need to maintain the right_side order. This is a great optimization!

Just to provide some context for future collaboration, this approach (joining only one left row at a time) was actually the initial direction I considered. Critically, I was hesitant to proceed because I wanted to avoid making what seemed like a potential breaking change without getting confirmation. I believe we should be very cautious about such changes.

That's precisely why I raised this concern and @-mentioned you in my comment , seeking clarification.

Since I didn't get a response, I proceeded with the alternative implementation to keep things moving. A quick clarification on that point back then would have saved the effort on my merged PR and allowed us to get to this optimal solution directly.

No worries, the main thing is we've landed the best solution for the project. Just thought this feedback on the process could be helpful for our future interactions.

Great work on this!

2010YOUY01 · 2025-08-02T11:38:08Z

@UBarney I'm so sorry about that — I completely let that issue discussion slip through.
For important matters, feel free to ping me multiple times or bring it up again in a PR discussion.
I have to admit I often forget to respond to issue replies, but I usually follow up on PR discussions. (I think probably it's true for others as well)

comphead

Thanks @2010YOUY01 I'll continue later, I'm out of my mental capacity :)
Appreciate for the comments it was much easier to navigate.

Good sign we got fuzz testing working with the new implementation.
For the post filtering it would be probably possible to split filter stage evalulation for certain types of join.
Inner, outer joins filter can be evaluated early whereas SEMI, ANTI on late stage like now.

datafusion/physical-plan/src/joins/nested_loop_join.rs

comphead · 2025-08-07T15:22:09Z

datafusion/physical-plan/src/joins/nested_loop_join.rs

-/// unmatched build-side data.
+/// ## 1. Buffering Left Input
+/// - The operator eagerly buffers all left-side input batches into memory,
+///   unless a memory limit is reached (see 'Memory-limited Execution').


what would happen if memory limit is reached?

I have updated the comment to explain it.

datafusion/physical-plan/src/joins/nested_loop_join.rs

comphead · 2025-08-07T15:31:11Z

datafusion/physical-plan/src/joins/nested_loop_join.rs

+    pub(crate) join_metrics: BuildProbeJoinMetrics,
+
+    /// `batch_size` from configuration
+    cfg_batch_size: usize,


lets use just batch_size?

I actually want to enforce this naming style at the project scope.
For rare configuration options that has been propagated many layers to some utility function, it would be easier to figure out the purpose of such argument by prefixing cfg_, usually in the config page we can find a more detailed explanation.

We can open issue, and write a style guide for it, WDYT?

it might be good idea to standardize naming, although it is typically a painful process. IMO I'd prefer batch_size for now as people got used to it and it would be a confusion for community to understand if there is any diff between batch_size and cfg_batch_size

Makes sense, let's use batch_size for now.

datafusion/physical-plan/src/joins/nested_loop_join.rs

2010YOUY01 · 2025-08-08T09:43:59Z

I really appreciate your detailed review. I have addressed them in 4c111c3

BTW, I take code understandability very seriously, and I’m happy to make any small changes that improve it. I encourage others to do similar reviews -- point out anything that doesn’t make sense and to share even small nitpicks.

Thanks @2010YOUY01 I'll continue later, I'm out of my mental capacity :) Appreciate for the comments it was much easier to navigate.

Good sign we got fuzz testing working with the new implementation. For the post filtering it would be probably possible to split filter stage evalulation for certain types of join. Inner, outer joins filter can be evaluated early whereas SEMI, ANTI on late stage like now.

I don't get this changing filter evaluation time idea, could you elaborate?

comphead · 2025-08-09T20:37:15Z

I really appreciate your detailed review. I have addressed them in 4c111c3

BTW, I take code understandability very seriously, and I’m happy to make any small changes that improve it. I encourage others to do similar reviews -- point out anything that doesn’t make sense and to share even small nitpicks.

Thanks @2010YOUY01 I'll continue later, I'm out of my mental capacity :) Appreciate for the comments it was much easier to navigate.
Good sign we got fuzz testing working with the new implementation. For the post filtering it would be probably possible to split filter stage evalulation for certain types of join. Inner, outer joins filter can be evaluated early whereas SEMI, ANTI on late stage like now.

I don't get this changing filter evaluation time idea, could you elaborate?

As far as I understood the filter evaluation happens in post phase, after tuples joined by key and this makes a lot of sense especially for SEMI, ANTI joins where you have to track filter eval results for the same key across all batches.
However for Inner join and outer joins it maybe beneficial to evaluate record batches early and avoid join calculation and save some ticks

For example

> with t as (select 1 a, 2 b), t1 as (select 2 a, 3 b) select * from t left join t1 on (t.a < t1.b and t1.b = 4);
+---+---+------+------+
| a | b | a    | b    |
+---+---+------+------+
| 1 | 2 | NULL | NULL |
+---+---+------+------+

So the right record batch you can evaluate filter t1.b = 4 and figure out that indices in the batch doesn't fit to join condition even without having actual join

comphead · 2025-08-09T20:57:12Z

Btw Would be that possible to calculate the cost of the join like in https://www.youtube.com/watch?v=RcEW0P8iVTc
?

The video shows multiple implementations for NLJ and how to calculate the cost and describe pseudo code, it would be super useful for community and further improvements.

From what I understood, the left side scanned once, and entirely saved in memory, what about right scans?
Perhaps in future we can play with blocks of input left batches to prevent OOM

2010YOUY01 · 2025-08-11T09:08:23Z

I really appreciate your detailed review. I have addressed them in 4c111c3
BTW, I take code understandability very seriously, and I’m happy to make any small changes that improve it. I encourage others to do similar reviews -- point out anything that doesn’t make sense and to share even small nitpicks.

Thanks @2010YOUY01 I'll continue later, I'm out of my mental capacity :) Appreciate for the comments it was much easier to navigate.
Good sign we got fuzz testing working with the new implementation. For the post filtering it would be probably possible to split filter stage evalulation for certain types of join. Inner, outer joins filter can be evaluated early whereas SEMI, ANTI on late stage like now.

I don't get this changing filter evaluation time idea, could you elaborate?

As far as I understood the filter evaluation happens in post phase, after tuples joined by key and this makes a lot of sense especially for SEMI, ANTI joins where you have to track filter eval results for the same key across all batches. However for Inner join and outer joins it maybe beneficial to evaluate record batches early and avoid join calculation and save some ticks

For example
> with t as (select 1 a, 2 b), t1 as (select 2 a, 3 b) select * from t left join t1 on (t.a < t1.b and t1.b = 4);
+---+---+------+------+
| a | b | a    | b    |
+---+---+------+------+
| 1 | 2 | NULL | NULL |
+---+---+------+------+
So the right record batch you can evaluate filter t1.b = 4 and figure out that indices in the batch doesn't fit to join condition even without having actual join

I see, but this step should be handled below the join operator by filter pushdown:
See the FilterExec in the left input in the below example

> explain SELECT *
        FROM range(10000) AS t1
        JOIN range(10000) AS t2
        ON ((t1.value + t2.value) % 1000 = 0) AND (t1.value > 5000);
+---------------+------------------------------------------------------------+
| plan_type     | plan                                                       |
+---------------+------------------------------------------------------------+
| physical_plan | ┌───────────────────────────┐                              |
|               | │     NestedLoopJoinExec    ├──────────────┐               |
|               | └─────────────┬─────────────┘              │               |
|               | ┌─────────────┴─────────────┐┌─────────────┴─────────────┐ |
|               | │   CoalescePartitionsExec  ││      RepartitionExec      │ |
|               | │                           ││    --------------------   │ |
|               | │                           ││ partition_count(in->out): │ |
|               | │                           ││          1 -> 14          │ |
|               | │                           ││                           │ |
|               | │                           ││    partitioning_scheme:   │ |
|               | │                           ││    RoundRobinBatch(14)    │ |
|               | └─────────────┬─────────────┘└─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐┌─────────────┴─────────────┐ |
|               | │    CoalesceBatchesExec    ││       LazyMemoryExec      │ |
|               | │    --------------------   ││    --------------------   │ |
|               | │     target_batch_size:    ││     batch_generators:     │ |
|               | │            8192           ││ range: start=0, end=10000,│ |
|               | │                           ││       batch_size=8192     │ |
|               | └─────────────┬─────────────┘└───────────────────────────┘ |
|               | ┌─────────────┴─────────────┐                              |
|               | │         FilterExec        │                              |
|               | │    --------------------   │                              |
|               | │         predicate:        │                              |
|               | │        value > 5000       │                              |
|               | └─────────────┬─────────────┘                              |
|               | ┌─────────────┴─────────────┐                              |
|               | │      RepartitionExec      │                              |
|               | │    --------------------   │                              |
|               | │ partition_count(in->out): │                              |
|               | │          1 -> 14          │                              |
|               | │                           │                              |
|               | │    partitioning_scheme:   │                              |
|               | │    RoundRobinBatch(14)    │                              |
|               | └─────────────┬─────────────┘                              |
|               | ┌─────────────┴─────────────┐                              |
|               | │       LazyMemoryExec      │                              |
|               | │    --------------------   │                              |
|               | │     batch_generators:     │                              |
|               | │ range: start=0, end=10000,│                              |
|               | │       batch_size=8192     │                              |
|               | └───────────────────────────┘                              |
|               |                                                            |
+---------------+------------------------------------------------------------+

Perhaps it's better to rename the variable name from filter to join_predicate to make it more clear.

I'm not sure if such filter pushdowns would be incorrect for SEMI/ANTI joins, I'll double check that.

2010YOUY01 · 2025-08-11T09:11:36Z

Btw Would be that possible to calculate the cost of the join like in https://www.youtube.com/watch?v=RcEW0P8iVTc ?

The video shows multiple implementations for NLJ and how to calculate the cost and describe pseudo code, it would be super useful for community and further improvements.

From what I understood, the left side scanned once, and entirely saved in memory, what about right scans? Perhaps in future we can play with blocks of input left batches to prevent OOM

Yes, that's exactly the idea for the future memory-limited NLJ implementation -- for each buffered left batches (under memory limit), do 1 round of right scan.
Though I don't think there are much tuning opportunities here, I think input scanning would be expensive if it's a parquet file, so the goal here is to minimize the number of right scans, and we should buffer as much left batches as possible.

comphead

Thanks @2010YOUY01 I think this PR is good the way it is.
Given the change I would probably ask another pair of eyes, like @jonathanc-n who is also working on join in #16660

datafusion/physical-plan/src/joins/nested_loop_join.rs

mbutrovich · 2025-08-11T18:35:12Z

datafusion/physical-plan/src/joins/nested_loop_join.rs

+    // PROPERTIES:
+    // Operator's properties that remain constant
+    //
+    // Note: The implementation uses the terms left/inner/build-side table and


I don't agree with this. "Inner" and "outer" table, in particular for a nested loop join, have fairly strict definitions. We should not confuse the issue by referring to what is historically referred to as the outer table as the "inner" table in this implementation. It makes it harder for contributors to understand the implementation.

I see. What do you think about using only the terms left/build and right/probe in the implementation?
I think build and probe are unambiguous, and in DataFusion, left == build is a convention: it always swaps the smaller table to the left as the build side.

Yes I think we should just keep it as left/build + right/probe. I can see the use of referring as it to inner/outer but it gets confusing because in academics (as @mbutrovich says), the term 'outer table' is the table which drives the outer loop. despite this, I've noticed that people do associate the left table to be the inner one, so just removing that ambiguity is probably the best idea

I believe Snowflake flips the convention and calls the left table "inner" which confuses the issue, regardless of physical join implementation. I think left/build and right/probe is consistent and fine, even though build and probe don't have as much meaning in a NLJ. I just was concerned about inner and outer getting confused particularly for a NLJ. Thanks!

Updated in 8cc3654. Thanks!

jonathanc-n

This lgtm, thank you @2010YOUY01. nice clean design!

2010YOUY01 · 2025-08-13T04:55:37Z

There are no new changes included. The speedup reaches 5× simply because the NLJ micro-benchmark is extended with cases where the join predicate is very cheap to evaluate (see #16996 (comment)), and such cases favor the new implementation more. I changed the title to make it more click-baity 🤣

mbutrovich

Thanks for updating the comments and variable names. This LGTM now.

comphead · 2025-08-13T15:01:34Z

Thanks @2010YOUY01 @mbutrovich @jonathanc-n @UBarney @ding-young for the fantastic team work. I'm going to merge this PR in the end of the day

Omega359 · 2025-08-14T01:22:58Z

Can someone please run the extended tests on this branch? I am pretty sure that the extended tests are failing against this branch but I am having issues with git locally so I'd like confirmation.

Edit: you'll likely have to merge latest main to get the latest datafusion-testing pin and update your submodules to make sure they reflect that pin. Then

INCLUDE_SQLITE=true cargo test --profile release-nonlto --test sqllogictests

Omega359 · 2025-08-14T02:09:12Z

Here is an example of a query that passes in main and fails on this branch:

CREATE TABLE tab0(col0 INTEGER, col1 INTEGER, col2 INTEGER);
CREATE TABLE tab1(col0 INTEGER, col1 INTEGER, col2 INTEGER);
CREATE TABLE tab2(col0 INTEGER, col1 INTEGER, col2 INTEGER);
INSERT INTO tab0 VALUES(97,1,99);
INSERT INTO tab0 VALUES(15,81,47);
INSERT INTO tab0 VALUES(87,21,10);
INSERT INTO tab1 VALUES(51,14,96);
INSERT INTO tab1 VALUES(85,5,59);
INSERT INTO tab1 VALUES(91,47,68);
INSERT INTO tab2 VALUES(64,77,40);
INSERT INTO tab2 VALUES(75,67,58);
INSERT INTO tab2 VALUES(46,51,23);

SELECT DISTINCT 32 AS col2 FROM tab0 AS cor0 LEFT OUTER JOIN tab2 AS cor1 ON ( NULL ) IS NULL;

Arrow error: Invalid argument error: must either specify a row count or at least one column

There are a bunch of other failures where the result count between main and this branch do not match. I verified a bunch of the queries do have NestedLoopJoinExec in the explain result.

2010YOUY01 · 2025-08-14T02:58:17Z

Here is an example of a query that passes in main and fails on this branch:

CREATE TABLE tab0(col0 INTEGER, col1 INTEGER, col2 INTEGER);
CREATE TABLE tab1(col0 INTEGER, col1 INTEGER, col2 INTEGER);
CREATE TABLE tab2(col0 INTEGER, col1 INTEGER, col2 INTEGER);
INSERT INTO tab0 VALUES(97,1,99);
INSERT INTO tab0 VALUES(15,81,47);
INSERT INTO tab0 VALUES(87,21,10);
INSERT INTO tab1 VALUES(51,14,96);
INSERT INTO tab1 VALUES(85,5,59);
INSERT INTO tab1 VALUES(91,47,68);
INSERT INTO tab2 VALUES(64,77,40);
INSERT INTO tab2 VALUES(75,67,58);
INSERT INTO tab2 VALUES(46,51,23);

SELECT DISTINCT 32 AS col2 FROM tab0 AS cor0 LEFT OUTER JOIN tab2 AS cor1 ON ( NULL ) IS NULL;

Arrow error: Invalid argument error: must either specify a row count or at least one column

There are a bunch of other failures where the result count between main and this branch do not match. I verified a bunch of the queries do have NestedLoopJoinExec in the explain result.

Thanks for the reporting, I'll work on that today. I have ran the extended test on a early version, those failures must be introduced by my recent changes 🤔

2010YOUY01 · 2025-08-14T02:59:10Z

Run extended tests

Omega359 · 2025-08-14T12:57:18Z

Run extended tests

That was removed recently unfortunately.

2010YOUY01 · 2025-08-14T13:22:44Z

I have tested the sqlite test suite locally, and it's passing now.

yongting@Yongtings-MacBook-Pro-2 ~/C/datafusion (pr-16996)> INCLUDE_SQLITE=true cargo test --profile release-nonlto --test sqllogictests

Completed 943 test files in 4 minutes

This PR is ready to merge again.

comphead · 2025-08-14T15:49:04Z

@2010YOUY01 I think you can run extended tests on your forked repo from Actions providing a branch?

2010YOUY01 · 2025-08-14T16:15:57Z

@2010YOUY01 I think you can run extended tests on your forked repo from Actions providing a branch?

Ah, yes. I forget pushing to private clone's main branch can also trigger it.

I have started it:
https://github.com/2010YOUY01/arrow-datafusion/actions/runs/16970623908

comphead · 2025-08-14T18:19:52Z

@2010YOUY01 I think you can run extended tests on your forked repo from Actions providing a branch?

Ah, yes. I forget pushing to private clone's main branch can also trigger it.

I have started it: https://github.com/2010YOUY01/arrow-datafusion/actions/runs/16970623908

It is passed. I think its safe to merge now, thanks everyone@

alamb · 2025-10-17T19:16:39Z

@ianthetechie reported a performance regression in 50.0.0, and I tracked it down to this PR. See details / reproducer on:

Datafusion 50 Performance Regression (array_has style filter/join for Parquet data set) #18070 (comment)

## Which issue does this PR close?  - Closes #18070 ## Rationale for this change  See the above issue and its comment #18070 (comment) ## What changes are included in this PR?  In nested loop join, when the join column includes `List(Utf8View)`, use `take()` instead of `to_array_of_size()` to avoid deep copying the utf8 buffers inside `Utf8View` array. This is the quick fix, avoiding deep copy inside `to_array_of_size()` is a bit tricky. Here is `ListArray`'s physical layout: https://arrow.apache.org/rust/arrow/array/struct.GenericListArray.html If multiple elements is pointing to the same list range, the underlying payload can't be reused.So the potential fix in `to_array_of_size` can only avoids copying the inner-inner utf8view array buffers, but can't avoid copying the inner array (i.e. views are still copied), and deep copying for other primitive types also can't be avoided. Seems this can be better solved when `ListView` type is ready 🤔 ### Benchmark I tried query 1 in #18070, but only used 3 randomly sampled `places` parquet file. 49.0.0: 4s 50.0.0: stuck > 1 minute PR: 4s Now the performance are similar, I suspect the most time is spend evaluating the expensive `array_has` so the optimization in #16996 can't help much. ## Are these changes tested?  Existing tests ## Are there any user-facing changes?  No  --------- Co-authored-by: Andrew Lamb <[email protected]>

## Which issue does this PR close?  - Closes apache#18070 ## Rationale for this change  See the above issue and its comment apache#18070 (comment) ## What changes are included in this PR?  In nested loop join, when the join column includes `List(Utf8View)`, use `take()` instead of `to_array_of_size()` to avoid deep copying the utf8 buffers inside `Utf8View` array. This is the quick fix, avoiding deep copy inside `to_array_of_size()` is a bit tricky. Here is `ListArray`'s physical layout: https://arrow.apache.org/rust/arrow/array/struct.GenericListArray.html If multiple elements is pointing to the same list range, the underlying payload can't be reused.So the potential fix in `to_array_of_size` can only avoids copying the inner-inner utf8view array buffers, but can't avoid copying the inner array (i.e. views are still copied), and deep copying for other primitive types also can't be avoided. Seems this can be better solved when `ListView` type is ready 🤔 ### Benchmark I tried query 1 in apache#18070, but only used 3 randomly sampled `places` parquet file. 49.0.0: 4s 50.0.0: stuck > 1 minute PR: 4s Now the performance are similar, I suspect the most time is spend evaluating the expensive `array_has` so the optimization in apache#16996 can't help much. ## Are these changes tested?  Existing tests ## Are there any user-facing changes?  No  --------- Co-authored-by: Andrew Lamb <[email protected]>

2010YOUY01 added 2 commits July 31, 2025 19:19

Rewrite NestedLoopJoin for better performance and memory efficiency

340cac6

fix doc

5c65db5

github-actions bot added sqllogictest SQL Logic Tests (.slt) physical-plan Changes to the physical-plan crate labels Jul 31, 2025

2010YOUY01 mentioned this pull request Jul 31, 2025

Benchmark: Add micro-benchmark for Nested Loop Join operator #16819

Closed

2010YOUY01 commented Jul 31, 2025

View reviewed changes

2010YOUY01 mentioned this pull request Jul 31, 2025

WIP: Rewrite NestedLoopJoin to limit intermediate size (up to 3.2X faster) #16889

Closed

5 tasks

comphead reviewed Aug 1, 2025

View reviewed changes

Dandandan reviewed Aug 1, 2025

View reviewed changes

2010YOUY01 added 3 commits August 4, 2025 12:21

metrics tracking

766396b

Merge branch 'main' into nlj-rewrite

5d53a70

extended nlj microbench

97869e0

comphead reviewed Aug 7, 2025

View reviewed changes

comphead mentioned this pull request Aug 7, 2025

feat: Support PiecewiseMergeJoin to speed up single range predicate joins #16660

Closed

review

4c111c3

fix CI

f70a39b

rename batch_size

dab1c5b

comphead approved these changes Aug 11, 2025

View reviewed changes

mbutrovich self-requested a review August 11, 2025 17:08

mbutrovich reviewed Aug 11, 2025

View reviewed changes

datafusion/physical-plan/src/joins/nested_loop_join.rs Show resolved Hide resolved

mbutrovich reviewed Aug 11, 2025

View reviewed changes

jonathanc-n approved these changes Aug 12, 2025

View reviewed changes

don't use inner/outer table to avoid confusion

8cc3654

2010YOUY01 changed the title ~~Rewrite Nested Loop Join executor for 3.5× speed and 1% memory usage~~ Rewrite Nested Loop Join executor for 5× speed and 1% memory usage Aug 13, 2025

2010YOUY01 and others added 2 commits August 13, 2025 13:28

fix typo

1a80fe2

Merge branch 'main' into nlj-rewrite

4112128

mbutrovich approved these changes Aug 13, 2025

View reviewed changes

2010YOUY01 added 2 commits August 14, 2025 19:49

Merge branch 'main' into pr-16996

b600c1f

fix bugs to pass sqlite extended tests

93768d3

comphead merged commit 1206019 into apache:main Aug 14, 2025
27 checks passed

2010YOUY01 mentioned this pull request Aug 15, 2025

Doc: Update upgrade guide for the rewritten NLJ operator #17202

Merged

Omega359 mentioned this pull request Aug 22, 2025

Release DataFusion 50.0.0 (Aug/Sep 2025) #16799

Closed

45 tasks

tobixdev mentioned this pull request Sep 9, 2025

Nested Loop Join: Performance Regression in DataFusion 50 for Suboptimal Join Orderings #17488

Closed

alamb mentioned this pull request Oct 17, 2025

Datafusion 50 Performance Regression (array_has style filter/join for Parquet data set) #18070

Closed

2010YOUY01 mentioned this pull request Oct 20, 2025

perf: Fix NLJ slow join with condition array_has #18161

Merged

2010YOUY01 mentioned this pull request Oct 27, 2025

OOM when nested join + limit #15628

Closed

Rewrite Nested Loop Join executor for 5× speed and 1% memory usage #16996

Rewrite Nested Loop Join executor for 5× speed and 1% memory usage #16996

Uh oh!

Conversation

2010YOUY01 commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

Summary

Speed benchmark

Memory Usage Benchmark

Main

PR

Next Steps

Follow-ups

Why a Rewrite?

Example of Prior Work

What changes are included in this PR?

Implementation

Old implementation

This PR

Old v.s. PR

Are these changes tested?

Are there any user-facing changes?

Uh oh!

2010YOUY01 commented Jul 31, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ding-young commented Aug 1, 2025

Memory Usage Benchmark

current

nlj-rewrite

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

UBarney commented Aug 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

2010YOUY01 commented Aug 2, 2025

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

2010YOUY01 commented Aug 8, 2025

Uh oh!

comphead commented Aug 9, 2025

Uh oh!

comphead commented Aug 9, 2025

2010YOUY01 commented Jul 31, 2025 •

edited

Loading

UBarney commented Aug 2, 2025 •

edited

Loading

mbutrovich Aug 12, 2025 •

edited

Loading

2010YOUY01 commented Aug 13, 2025 •

edited

Loading

Omega359 commented Aug 14, 2025 •

edited

Loading

2010YOUY01 commented Aug 14, 2025 •

edited

Loading