Skip to content

Conversation

2010YOUY01
Copy link
Contributor

@2010YOUY01 2010YOUY01 commented Jul 31, 2025

Which issue does this PR close?

  • Closes #.

Rationale for this change

Summary

This PR rewrites the NLJ operator from scratch with a different approach, to limit the extra intermediate data overhead to only 1 RecordBatch, and eliminate other redundant conversions in the old implementation.

Using the NLJ micro-bench introduced inside this PR, this rewrite can introduce up to 5X speed-up, and uses only 1% memory in extreme cases.

Speed benchmark

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ nlj-rewrite ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  220.42 ms │    86.53 ms │ +2.55x faster │
│ QQuery 2     │  241.19 ms │   117.37 ms │ +2.05x faster │
│ QQuery 3     │  335.39 ms │   165.34 ms │ +2.03x faster │
│ QQuery 4     │  761.47 ms │   367.77 ms │ +2.07x faster │
│ QQuery 5     │  540.70 ms │   231.72 ms │ +2.33x faster │
│ QQuery 6     │ 5344.76 ms │  1678.18 ms │ +3.18x faster │
│ QQuery 7     │  546.76 ms │   233.03 ms │ +2.35x faster │
│ QQuery 8     │ 5218.51 ms │  1680.25 ms │ +3.11x faster │
│ QQuery 9     │  677.36 ms │   266.70 ms │ +2.54x faster │
│ QQuery 10    │ 1698.53 ms │   500.02 ms │ +3.40x faster │
│ QQuery 11    │ 1144.87 ms │   222.46 ms │ +5.15x faster │
│ QQuery 12    │ 1138.59 ms │   225.84 ms │ +5.04x faster │
└──────────────┴────────────┴─────────────┴───────────────┘

Memory Usage Benchmark

Main

Query     Time (ms)     Peak RSS  Peak Commit  Major Page Faults
----------------------------------------------------------------
1            761.84    1042.7 MB       5.9 GB                  0
2            973.85    1486.3 MB       8.3 GB                  0
3           1197.72       2.2 GB      11.9 GB                  0
4           3644.36      12.2 GB      20.8 GB                  0
5           3729.40       3.9 GB      13.4 GB                  0
6          18882.52      22.9 GB      24.5 GB                  0
7           3128.09       3.9 GB      13.4 GB                  0
8          19002.03      22.9 GB      24.6 GB                  0
9           2968.10      10.3 GB      15.1 GB                  0
10          6064.70      19.4 GB      28.3 GB                  0

PR

Query     Time (ms)     Peak RSS  Peak Commit  Major Page Faults
----------------------------------------------------------------
1            218.46      78.6 MB    1024.1 MB                  0
2            318.69      76.7 MB    1024.2 MB                  0
3            352.38      77.0 MB    1024.2 MB                  0
4           1384.05      85.1 MB    1024.2 MB                  0
5           2217.99      82.4 MB    1024.2 MB                  0
6           4331.11      77.0 MB    1024.1 MB                  0
7           2295.73      79.1 MB    1024.1 MB                  0
8           4372.36      78.6 MB    1024.1 MB                  0
9            911.73      81.1 MB    1024.2 MB                  0
10          1212.94      71.6 MB    1024.1 MB                  0

Next Steps

I think it is ready for review, the major potential optimizations have all been done. Though there is one minor chore to be done:

  • Add join metrics tracking

Follow-ups

Why a Rewrite?

(TLDR: it's the easiest way to address the existing problem)

The original implementation performs a Cartesian product of (all-left-batches x right-batch), materializes that intermediate result for predicate evaluation, and then materializes the (potentially very large) final result all at once. This design is inherently inefficient, and although many patches have attempted to alleviate the problem, the fundamental issue remains.

A key challenge is that the original design and the ideal design (i.e., one that produces small intermediates during execution) are fundamentally different. As a result, it's practically impossible to make small incremental changes that fully address the inefficiency. These patches may also increase code complexity, making long-term maintenance more difficult.

Example of Prior Work

Here's a recent example of a small patch intended to improve the situation:
#16443
Even with careful engineering, I still feel the entropy in the code increases.

Since NLJ is a relatively straightforward operator, a full rewrite seemed worthwhile. This allows for a clean, simplified design focused on current goals—performance and memory efficiency—without being constrained by the legacy implementation.

What changes are included in this PR?

Implementation

The implementation/design doc can be found in the source code.

A brief comparison between old implementation and this PR:

Old implementation

  1. Do a Cartesian produce of (all_buffered_left_batch x one_right_batch) to calculate indices
  2. Construct intermediate batch and output it chunk by chunk

This PR

  • For the inner loop, only perform (one_left_row x one_right_batch), and do filtering, and output construction directly on this small intermediate.
  • Eagerly yield output when the output buffer has reached batch_size threshold

Old v.s. PR

  • The old implementation requires multiple conversions between indices <--> batch, and this PR can use the right batch directly. This avoids unnecessary transformations and make the implementation more cache-friendly
  • The old implementation has extra memory overhead of left_buffered_batches_total_rows * right_batch_size (default 8192) * 12 Bytes (left and right indices are represented by uint64 and uint32), can be significant for large dataset.

Are these changes tested?

Existing tests

Are there any user-facing changes?

The old implementation can maintain right input order in certain cases, this PR's new design is not able to maintain that property -- preserving it would require a different design, and it has significant performance and memory overhead.

If this property is important to some user, we can keep the old implementation (maybe rename to RightOrderPreservingNLJ and use a configuration to control it)

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) physical-plan Changes to the physical-plan crate labels Jul 31, 2025
@2010YOUY01
Copy link
Contributor Author

Draft PR and early discussions: #16889

Thanks @UBarney @ding-young and @jonathanc-n for the help 🙏🏼

01)NestedLoopJoinExec: join_type=Inner, filter=a@0 < x@1
02)--DataSourceExec: partitions=1, partition_sizes=[0]
03)--SortExec: expr=[x@0 ASC NULLS LAST], preserve_partitioning=[false]
01)SortExec: expr=[x@3 ASC NULLS LAST], preserve_partitioning=[false]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the PR write-up's 'breaking change' section for the reason.

@ding-young
Copy link
Contributor

Memory Usage Benchmark

current

Query     Time (ms)     Peak RSS  Peak Commit  Major Page Faults
----------------------------------------------------------------
1            761.84    1042.7 MB       5.9 GB                  0
2            973.85    1486.3 MB       8.3 GB                  0
3           1197.72       2.2 GB      11.9 GB                  0
4           3644.36      12.2 GB      20.8 GB                  0
5           3729.40       3.9 GB      13.4 GB                  0
6          18882.52      22.9 GB      24.5 GB                  0
7           3128.09       3.9 GB      13.4 GB                  0
8          19002.03      22.9 GB      24.6 GB                  0
9           2968.10      10.3 GB      15.1 GB                  0
10          6064.70      19.4 GB      28.3 GB                  0

nlj-rewrite

Query     Time (ms)     Peak RSS  Peak Commit  Major Page Faults
----------------------------------------------------------------
1            218.46      78.6 MB    1024.1 MB                  0
2            318.69      76.7 MB    1024.2 MB                  0
3            352.38      77.0 MB    1024.2 MB                  0
4           1384.05      85.1 MB    1024.2 MB                  0
5           2217.99      82.4 MB    1024.2 MB                  0
6           4331.11      77.0 MB    1024.1 MB                  0
7           2295.73      79.1 MB    1024.1 MB                  0
8           4372.36      78.6 MB    1024.1 MB                  0
9            911.73      81.1 MB    1024.2 MB                  0
10          1212.94      71.6 MB    1024.1 MB                  0

(reproducer branch & comments to be added)

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @2010YOUY01 this is a great PR, I'm planning to have a look on next week

original_left_array.as_ref(),
build_side_index,
)?;
scalar_value.to_array_of_size(filtered_probe_batch.num_rows())?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be neat if we can avoid generating this array and directly apply the filter based on the scalar, rather than converting to an array of the same size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was expecting this optimization to bring > 10% speedup to the NLJ micro-bench, based on the flamegraphs.

Though I think this change requires some transformation to JoinFilter, and we don't have such utility function yet. I'll open an issue to track this idea later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried but it only gets slightly faster, I think this optimization might not worth the extra complexity.

The reason for not getting significantly faster:

  1. This step is not the bottleneck (for this simple NLJ micro-bench, now the actual bottleneck is evaluating the expression, especially modulo)
  2. I used a utility in the optimizer to rewrite the join filter, this step is slow (6% in total time)

Benchmark result:

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ nlj-rewrite ┃ optimize-join-filter ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │    87.14 ms │             83.12 ms │     no change │
│ QQuery 2     │   118.03 ms │            110.51 ms │ +1.07x faster │
│ QQuery 3     │   164.04 ms │            160.60 ms │     no change │
│ QQuery 4     │   373.67 ms │            358.43 ms │     no change │
│ QQuery 5     │   232.66 ms │            259.75 ms │  1.12x slower │
│ QQuery 6     │  1708.19 ms │           1633.82 ms │     no change │
│ QQuery 7     │   236.27 ms │            257.24 ms │  1.09x slower │
│ QQuery 8     │  1710.80 ms │           1634.83 ms │     no change │
│ QQuery 9     │   270.86 ms │            261.67 ms │     no change │
│ QQuery 10    │   499.70 ms │            493.96 ms │     no change │
└──────────────┴─────────────┴──────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                   ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (nlj-rewrite)            │ 5401.37ms │
│ Total Time (optimize-join-filter)   │ 5253.94ms │
│ Average Time (nlj-rewrite)          │  540.14ms │
│ Average Time (optimize-join-filter) │  525.39ms │
│ Queries Faster                      │         1 │
│ Queries Slower                      │         2 │
│ Queries with No Change              │         7 │
│ Queries with Failure                │         0 │
└─────────────────────────────────────┴───────────┘

My experiment branch: https://github.com/2010YOUY01/arrow-datafusion/tree/optimize-join-filter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried but it only gets slightly faster, I think this optimization might not worth the extra complexity.

The reason for not getting significantly faster:

  1. This step is not the bottleneck (for this simple NLJ micro-bench, now the actual bottleneck is evaluating the expression, especially modulo)
  2. I used a utility in the optimizer to rewrite the join filter, this step is slow (6% in total time)

Benchmark result:

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ nlj-rewrite ┃ optimize-join-filter ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │    87.14 ms │             83.12 ms │     no change │
│ QQuery 2     │   118.03 ms │            110.51 ms │ +1.07x faster │
│ QQuery 3     │   164.04 ms │            160.60 ms │     no change │
│ QQuery 4     │   373.67 ms │            358.43 ms │     no change │
│ QQuery 5     │   232.66 ms │            259.75 ms │  1.12x slower │
│ QQuery 6     │  1708.19 ms │           1633.82 ms │     no change │
│ QQuery 7     │   236.27 ms │            257.24 ms │  1.09x slower │
│ QQuery 8     │  1710.80 ms │           1634.83 ms │     no change │
│ QQuery 9     │   270.86 ms │            261.67 ms │     no change │
│ QQuery 10    │   499.70 ms │            493.96 ms │     no change │
└──────────────┴─────────────┴──────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                   ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (nlj-rewrite)            │ 5401.37ms │
│ Total Time (optimize-join-filter)   │ 5253.94ms │
│ Average Time (nlj-rewrite)          │  540.14ms │
│ Average Time (optimize-join-filter) │  525.39ms │
│ Queries Faster                      │         1 │
│ Queries Slower                      │         2 │
│ Queries with No Change              │         7 │
│ Queries with Failure                │         0 │
└─────────────────────────────────────┴───────────┘

My experiment branch: https://github.com/2010YOUY01/arrow-datafusion/tree/optimize-join-filter

Very nice, thanks for trying.
I wonder if we can try with a different expression, e.g. greater than?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For such simple expressions, it's 20% faster on top of this PR. I think we should include this optimization as a follow-up.
After it, the bottleneck shifts to filter + concating the final output, this TODO might help
https://github.com/apache/arrow-rs/blob/a9b6077b2d8d5b5aee0e97e7d4335878e8da1876/arrow-select/src/coalesce.rs#L206

@UBarney
Copy link
Contributor

UBarney commented Aug 2, 2025

Hi @2010YOUY01

Thanks for creating this amazing PR and for the detailed explanation on why we don't need to maintain the right_side order. This is a great optimization!

Just to provide some context for future collaboration, this approach (joining only one left row at a time) was actually the initial direction I considered. Critically, I was hesitant to proceed because I wanted to avoid making what seemed like a potential breaking change without getting confirmation. I believe we should be very cautious about such changes.

That's precisely why I raised this concern and @-mentioned you in my comment , seeking clarification.

Since I didn't get a response, I proceeded with the alternative implementation to keep things moving. A quick clarification on that point back then would have saved the effort on my merged PR and allowed us to get to this optimal solution directly.

No worries, the main thing is we've landed the best solution for the project. Just thought this feedback on the process could be helpful for our future interactions.

Great work on this!

@2010YOUY01
Copy link
Contributor Author

@UBarney I'm so sorry about that — I completely let that issue discussion slip through.
For important matters, feel free to ping me multiple times or bring it up again in a PR discussion.
I have to admit I often forget to respond to issue replies, but I usually follow up on PR discussions. (I think probably it's true for others as well)

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @2010YOUY01 I'll continue later, I'm out of my mental capacity :)
Appreciate for the comments it was much easier to navigate.

Good sign we got fuzz testing working with the new implementation.
For the post filtering it would be probably possible to split filter stage evalulation for certain types of join.
Inner, outer joins filter can be evaluated early whereas SEMI, ANTI on late stage like now.

/// unmatched build-side data.
/// ## 1. Buffering Left Input
/// - The operator eagerly buffers all left-side input batches into memory,
/// unless a memory limit is reached (see 'Memory-limited Execution').
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would happen if memory limit is reached?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the comment to explain it.

pub(crate) join_metrics: BuildProbeJoinMetrics,

/// `batch_size` from configuration
cfg_batch_size: usize,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets use just batch_size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually want to enforce this naming style at the project scope.
For rare configuration options that has been propagated many layers to some utility function, it would be easier to figure out the purpose of such argument by prefixing cfg_, usually in the config page we can find a more detailed explanation.

We can open issue, and write a style guide for it, WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be good idea to standardize naming, although it is typically a painful process. IMO I'd prefer batch_size for now as people got used to it and it would be a confusion for community to understand if there is any diff between batch_size and cfg_batch_size

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, let's use batch_size for now.

@2010YOUY01
Copy link
Contributor Author

I really appreciate your detailed review. I have addressed them in 4c111c3

BTW, I take code understandability very seriously, and I’m happy to make any small changes that improve it. I encourage others to do similar reviews -- point out anything that doesn’t make sense and to share even small nitpicks.

Thanks @2010YOUY01 I'll continue later, I'm out of my mental capacity :) Appreciate for the comments it was much easier to navigate.

Good sign we got fuzz testing working with the new implementation. For the post filtering it would be probably possible to split filter stage evalulation for certain types of join. Inner, outer joins filter can be evaluated early whereas SEMI, ANTI on late stage like now.

I don't get this changing filter evaluation time idea, could you elaborate?

@comphead
Copy link
Contributor

comphead commented Aug 9, 2025

I really appreciate your detailed review. I have addressed them in 4c111c3

BTW, I take code understandability very seriously, and I’m happy to make any small changes that improve it. I encourage others to do similar reviews -- point out anything that doesn’t make sense and to share even small nitpicks.

Thanks @2010YOUY01 I'll continue later, I'm out of my mental capacity :) Appreciate for the comments it was much easier to navigate.
Good sign we got fuzz testing working with the new implementation. For the post filtering it would be probably possible to split filter stage evalulation for certain types of join. Inner, outer joins filter can be evaluated early whereas SEMI, ANTI on late stage like now.

I don't get this changing filter evaluation time idea, could you elaborate?

As far as I understood the filter evaluation happens in post phase, after tuples joined by key and this makes a lot of sense especially for SEMI, ANTI joins where you have to track filter eval results for the same key across all batches.
However for Inner join and outer joins it maybe beneficial to evaluate record batches early and avoid join calculation and save some ticks

For example

> with t as (select 1 a, 2 b), t1 as (select 2 a, 3 b) select * from t left join t1 on (t.a < t1.b and t1.b = 4);
+---+---+------+------+
| a | b | a    | b    |
+---+---+------+------+
| 1 | 2 | NULL | NULL |
+---+---+------+------+

So the right record batch you can evaluate filter t1.b = 4 and figure out that indices in the batch doesn't fit to join condition even without having actual join

@comphead
Copy link
Contributor

comphead commented Aug 9, 2025

Btw Would be that possible to calculate the cost of the join like in https://www.youtube.com/watch?v=RcEW0P8iVTc
?

The video shows multiple implementations for NLJ and how to calculate the cost and describe pseudo code, it would be super useful for community and further improvements.

From what I understood, the left side scanned once, and entirely saved in memory, what about right scans?
Perhaps in future we can play with blocks of input left batches to prevent OOM

@2010YOUY01
Copy link
Contributor Author

I really appreciate your detailed review. I have addressed them in 4c111c3
BTW, I take code understandability very seriously, and I’m happy to make any small changes that improve it. I encourage others to do similar reviews -- point out anything that doesn’t make sense and to share even small nitpicks.

Thanks @2010YOUY01 I'll continue later, I'm out of my mental capacity :) Appreciate for the comments it was much easier to navigate.
Good sign we got fuzz testing working with the new implementation. For the post filtering it would be probably possible to split filter stage evalulation for certain types of join. Inner, outer joins filter can be evaluated early whereas SEMI, ANTI on late stage like now.

I don't get this changing filter evaluation time idea, could you elaborate?

As far as I understood the filter evaluation happens in post phase, after tuples joined by key and this makes a lot of sense especially for SEMI, ANTI joins where you have to track filter eval results for the same key across all batches. However for Inner join and outer joins it maybe beneficial to evaluate record batches early and avoid join calculation and save some ticks

For example

> with t as (select 1 a, 2 b), t1 as (select 2 a, 3 b) select * from t left join t1 on (t.a < t1.b and t1.b = 4);
+---+---+------+------+
| a | b | a    | b    |
+---+---+------+------+
| 1 | 2 | NULL | NULL |
+---+---+------+------+

So the right record batch you can evaluate filter t1.b = 4 and figure out that indices in the batch doesn't fit to join condition even without having actual join

I see, but this step should be handled below the join operator by filter pushdown:
See the FilterExec in the left input in the below example

> explain SELECT *
        FROM range(10000) AS t1
        JOIN range(10000) AS t2
        ON ((t1.value + t2.value) % 1000 = 0) AND (t1.value > 5000);
+---------------+------------------------------------------------------------+
| plan_type     | plan                                                       |
+---------------+------------------------------------------------------------+
| physical_plan | ┌───────────────────────────┐                              |
|               | │     NestedLoopJoinExec    ├──────────────┐               |
|               | └─────────────┬─────────────┘              │               |
|               | ┌─────────────┴─────────────┐┌─────────────┴─────────────┐ |
|               | │   CoalescePartitionsExec  ││      RepartitionExec      │ |
|               | │                           ││    --------------------   │ |
|               | │                           ││ partition_count(in->out): │ |
|               | │                           ││          1 -> 14          │ |
|               | │                           ││                           │ |
|               | │                           ││    partitioning_scheme:   │ |
|               | │                           ││    RoundRobinBatch(14)    │ |
|               | └─────────────┬─────────────┘└─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐┌─────────────┴─────────────┐ |
|               | │    CoalesceBatchesExec    ││       LazyMemoryExec      │ |
|               | │    --------------------   ││    --------------------   │ |
|               | │     target_batch_size:    ││     batch_generators:     │ |
|               | │            8192           ││ range: start=0, end=10000,│ |
|               | │                           ││       batch_size=8192     │ |
|               | └─────────────┬─────────────┘└───────────────────────────┘ |
|               | ┌─────────────┴─────────────┐                              |
|               | │         FilterExec        │                              |
|               | │    --------------------   │                              |
|               | │         predicate:        │                              |
|               | │        value > 5000       │                              |
|               | └─────────────┬─────────────┘                              |
|               | ┌─────────────┴─────────────┐                              |
|               | │      RepartitionExec      │                              |
|               | │    --------------------   │                              |
|               | │ partition_count(in->out): │                              |
|               | │          1 -> 14          │                              |
|               | │                           │                              |
|               | │    partitioning_scheme:   │                              |
|               | │    RoundRobinBatch(14)    │                              |
|               | └─────────────┬─────────────┘                              |
|               | ┌─────────────┴─────────────┐                              |
|               | │       LazyMemoryExec      │                              |
|               | │    --------------------   │                              |
|               | │     batch_generators:     │                              |
|               | │ range: start=0, end=10000,│                              |
|               | │       batch_size=8192     │                              |
|               | └───────────────────────────┘                              |
|               |                                                            |
+---------------+------------------------------------------------------------+

Perhaps it's better to rename the variable name from filter to join_predicate to make it more clear.

I'm not sure if such filter pushdowns would be incorrect for SEMI/ANTI joins, I'll double check that.

@2010YOUY01
Copy link
Contributor Author

Btw Would be that possible to calculate the cost of the join like in https://www.youtube.com/watch?v=RcEW0P8iVTc ?

The video shows multiple implementations for NLJ and how to calculate the cost and describe pseudo code, it would be super useful for community and further improvements.

From what I understood, the left side scanned once, and entirely saved in memory, what about right scans? Perhaps in future we can play with blocks of input left batches to prevent OOM

Yes, that's exactly the idea for the future memory-limited NLJ implementation -- for each buffered left batches (under memory limit), do 1 round of right scan.
Though I don't think there are much tuning opportunities here, I think input scanning would be expensive if it's a parquet file, so the goal here is to minimize the number of right scans, and we should buffer as much left batches as possible.

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @2010YOUY01 I think this PR is good the way it is.
Given the change I would probably ask another pair of eyes, like @jonathanc-n who is also working on join in #16660

@mbutrovich mbutrovich self-requested a review August 11, 2025 17:08
// PROPERTIES:
// Operator's properties that remain constant
//
// Note: The implementation uses the terms left/inner/build-side table and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree with this. "Inner" and "outer" table, in particular for a nested loop join, have fairly strict definitions. We should not confuse the issue by referring to what is historically referred to as the outer table as the "inner" table in this implementation. It makes it harder for contributors to understand the implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. What do you think about using only the terms left/build and right/probe in the implementation?
I think build and probe are unambiguous, and in DataFusion, left == build is a convention: it always swaps the smaller table to the left as the build side.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think we should just keep it as left/build + right/probe. I can see the use of referring as it to inner/outer but it gets confusing because in academics (as @mbutrovich says), the term 'outer table' is the table which drives the outer loop. despite this, I've noticed that people do associate the left table to be the inner one, so just removing that ambiguity is probably the best idea

Copy link
Contributor

@mbutrovich mbutrovich Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe Snowflake flips the convention and calls the left table "inner" which confuses the issue, regardless of physical join implementation. I think left/build and right/probe is consistent and fine, even though build and probe don't have as much meaning in a NLJ. I just was concerned about inner and outer getting confused particularly for a NLJ. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 8cc3654. Thanks!

Copy link
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just feedback on the comments, but I'd like a discussion before merging. The implementation looked clean to me. Nicely done!

@jonathanc-n
Copy link
Contributor

Thanks @2010YOUY01 I think this PR is good the way it is. Given the change I would probably ask another pair of eyes, like @jonathanc-n who is also working on join in #16660

I'll try to take a look today

Copy link
Contributor

@jonathanc-n jonathanc-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lgtm, thank you @2010YOUY01. nice clean design!

@2010YOUY01 2010YOUY01 changed the title Rewrite Nested Loop Join executor for 3.5× speed and 1% memory usage Rewrite Nested Loop Join executor for 5× speed and 1% memory usage Aug 13, 2025
@2010YOUY01
Copy link
Contributor Author

2010YOUY01 commented Aug 13, 2025

There are no new changes included. The speedup reaches 5× simply because the NLJ micro-benchmark is extended with cases where the join predicate is very cheap to evaluate (see #16996 (comment)), and such cases favor the new implementation more. I changed the title to make it more click-baity 🤣

Copy link
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the comments and variable names. This LGTM now.

@comphead
Copy link
Contributor

Thanks @2010YOUY01 @mbutrovich @jonathanc-n @UBarney @ding-young for the fantastic team work. I'm going to merge this PR in the end of the day

@Omega359
Copy link
Contributor

Omega359 commented Aug 14, 2025

Can someone please run the extended tests on this branch? I am pretty sure that the extended tests are failing against this branch but I am having issues with git locally so I'd like confirmation.

Edit: you'll likely have to merge latest main to get the latest datafusion-testing pin and update your submodules to make sure they reflect that pin. Then

INCLUDE_SQLITE=true cargo test --profile release-nonlto --test sqllogictests

@Omega359
Copy link
Contributor

Here is an example of a query that passes in main and fails on this branch:

CREATE TABLE tab0(col0 INTEGER, col1 INTEGER, col2 INTEGER);
CREATE TABLE tab1(col0 INTEGER, col1 INTEGER, col2 INTEGER);
CREATE TABLE tab2(col0 INTEGER, col1 INTEGER, col2 INTEGER);
INSERT INTO tab0 VALUES(97,1,99);
INSERT INTO tab0 VALUES(15,81,47);
INSERT INTO tab0 VALUES(87,21,10);
INSERT INTO tab1 VALUES(51,14,96);
INSERT INTO tab1 VALUES(85,5,59);
INSERT INTO tab1 VALUES(91,47,68);
INSERT INTO tab2 VALUES(64,77,40);
INSERT INTO tab2 VALUES(75,67,58);
INSERT INTO tab2 VALUES(46,51,23);

SELECT DISTINCT 32 AS col2 FROM tab0 AS cor0 LEFT OUTER JOIN tab2 AS cor1 ON ( NULL ) IS NULL;

Arrow error: Invalid argument error: must either specify a row count or at least one column

There are a bunch of other failures where the result count between main and this branch do not match. I verified a bunch of the queries do have NestedLoopJoinExec in the explain result.

@2010YOUY01
Copy link
Contributor Author

Here is an example of a query that passes in main and fails on this branch:

CREATE TABLE tab0(col0 INTEGER, col1 INTEGER, col2 INTEGER);
CREATE TABLE tab1(col0 INTEGER, col1 INTEGER, col2 INTEGER);
CREATE TABLE tab2(col0 INTEGER, col1 INTEGER, col2 INTEGER);
INSERT INTO tab0 VALUES(97,1,99);
INSERT INTO tab0 VALUES(15,81,47);
INSERT INTO tab0 VALUES(87,21,10);
INSERT INTO tab1 VALUES(51,14,96);
INSERT INTO tab1 VALUES(85,5,59);
INSERT INTO tab1 VALUES(91,47,68);
INSERT INTO tab2 VALUES(64,77,40);
INSERT INTO tab2 VALUES(75,67,58);
INSERT INTO tab2 VALUES(46,51,23);

SELECT DISTINCT 32 AS col2 FROM tab0 AS cor0 LEFT OUTER JOIN tab2 AS cor1 ON ( NULL ) IS NULL;

Arrow error: Invalid argument error: must either specify a row count or at least one column

There are a bunch of other failures where the result count between main and this branch do not match. I verified a bunch of the queries do have NestedLoopJoinExec in the explain result.

Thanks for the reporting, I'll work on that today. I have ran the extended test on a early version, those failures must be introduced by my recent changes 🤔

@2010YOUY01
Copy link
Contributor Author

Run extended tests

@Omega359
Copy link
Contributor

Run extended tests

That was removed recently unfortunately.

@2010YOUY01
Copy link
Contributor Author

2010YOUY01 commented Aug 14, 2025

I have tested the sqlite test suite locally, and it's passing now.

yongting@Yongtings-MacBook-Pro-2 ~/C/datafusion (pr-16996)> INCLUDE_SQLITE=true cargo test --profile release-nonlto --test sqllogictests

Completed 943 test files in 4 minutes

This PR is ready to merge again.

@comphead
Copy link
Contributor

@2010YOUY01 I think you can run extended tests on your forked repo from Actions providing a branch?

@2010YOUY01
Copy link
Contributor Author

@2010YOUY01 I think you can run extended tests on your forked repo from Actions providing a branch?

Ah, yes. I forget pushing to private clone's main branch can also trigger it.

I have started it:
https://github.com/2010YOUY01/arrow-datafusion/actions/runs/16970623908

@comphead
Copy link
Contributor

@2010YOUY01 I think you can run extended tests on your forked repo from Actions providing a branch?

Ah, yes. I forget pushing to private clone's main branch can also trigger it.

I have started it: https://github.com/2010YOUY01/arrow-datafusion/actions/runs/16970623908

It is passed. I think its safe to merge now, thanks everyone@

@comphead comphead merged commit 1206019 into apache:main Aug 14, 2025
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-plan Changes to the physical-plan crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants