Skip to content

Conversation

feilong-liu
Copy link
Contributor

@feilong-liu feilong-liu commented Oct 19, 2025

Description

The current MergeJoinForSortedInputOptimizer rule does sort merge join only when both inputs are sorted. In this PR, I made the following changes:

  • Make the optimizer to work only for native execution, as java does not have merge join support in worker side
  • Have the optimizer to work for all types of joins rather than just inner join, as velox supports all types of merge joins
  • Do sort merge join if one side is sorted, and add sort on the other side, when the query is running in presto on spark
  • In GroupedExecutionTagger, do not fail for merge join node when grouped execution is not available for presto on spark. This is because presto on spark spawn as many tasks as the number of partitions of data. When one side is bucketed, the other side even not bucketed, the join will get as many tasks as the number of buckets, hence still equivalent to a bucket by bucket execution

Motivation and Context

Sort merge join for more cases

Impact

Improve performance

Test Plan

Unit tests, and local end to end tests

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.
  • If adding new dependencies, verified they have an OpenSSF Scorecard score of 5.0 or higher (or obtained explicit TSC approval for lower scores).

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Improve  `MergeJoinForSortedInputOptimizer` to do sort merge join when one side of the input is sorted

@prestodb-ci prestodb-ci added the from:Meta PR from Meta label Oct 19, 2025
@feilong-liu feilong-liu marked this pull request as draft October 19, 2025 06:36
@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Oct 19, 2025

Reviewer's Guide

Extends the MergeJoinForSortedInputOptimizer to run under native execution and Presto on Spark, support all join types, and enable merge join when only one side is sorted by introducing a prestoOnSpark flag, injecting sorts dynamically, propagating the flag throughout the planner and execution components, updating grouped execution logic, and broadening test coverage.

Sequence diagram for enhanced sort merge join optimization in Presto on Spark

sequenceDiagram
    participant "Query Planner"
    participant "MergeJoinForSortedInputOptimizer"
    participant "PlanNode"
    participant "SortNode"
    participant "MergeJoinNode"
    participant "Execution Engine"
    "Query Planner"->>"MergeJoinForSortedInputOptimizer": Optimize JoinNode
    "MergeJoinForSortedInputOptimizer"->>"PlanNode": Check if left/right input is sorted
    alt Only one side is sorted and prestoOnSpark is true
        "MergeJoinForSortedInputOptimizer"->>"SortNode": Inject sort on unsorted side
    end
    "MergeJoinForSortedInputOptimizer"->>"MergeJoinNode": Replace JoinNode with MergeJoinNode
    "MergeJoinNode"->>"Execution Engine": Execute merge join
Loading

ER diagram for new and updated planner flags and their propagation

erDiagram
    PlanFragmenter {
        bool isPrestoOnSpark
    }
    IterativePlanFragmenter {
        bool isPrestoOnSpark
    }
    PlanFragmenterUtils {
        bool isPrestoOnSpark
    }
    GroupedExecutionTagger {
        bool isPrestoOnSpark
    }
    PlanFragmenter ||--o| PlanFragmenterUtils : uses
    IterativePlanFragmenter ||--o| PlanFragmenterUtils : uses
    PlanFragmenterUtils ||--o| GroupedExecutionTagger : propagates
    PlanFragmenterUtils ||--o| MergeJoinForSortedInputOptimizer : propagates
    MergeJoinForSortedInputOptimizer {
        bool prestoOnSpark
    }
Loading

Class diagram for updated MergeJoinForSortedInputOptimizer and related planner classes

classDiagram
    class MergeJoinForSortedInputOptimizer {
        - Metadata metadata
        - boolean nativeExecution
        - boolean prestoOnSpark
        - boolean isEnabledForTesting
        + MergeJoinForSortedInputOptimizer(metadata, nativeExecution, prestoOnSpark)
        + boolean isEnabled(Session session)
        + PlanOptimizerResult optimize(PlanNode plan, Session session, TypeProvider types, VariableAllocator variableAllocator, PlanNodeIdAllocator idAllocator)
    }
    class Rewriter {
        - PlanNodeIdAllocator idAllocator
        - Metadata metadata
        - Session session
        - boolean prestoOnSpark
        - boolean planChanged
        + Rewriter(idAllocator, metadata, session, prestoOnSpark)
        + boolean isPlanChanged()
        + PlanNode visitJoin(JoinNode node, RewriteContext<Void> context)
        + boolean isPlanOutputSortedByColumns(PlanNode plan, List<VariableReferenceExpression> columns)
    }
    MergeJoinForSortedInputOptimizer o-- Rewriter
    class GroupedExecutionTagger {
        - Metadata metadata
        - NodePartitioningManager nodePartitioningManager
        - boolean groupedExecutionEnabled
        - boolean isPrestoOnSpark
        + GroupedExecutionTagger(Session session, Metadata metadata, NodePartitioningManager nodePartitioningManager, boolean groupedExecutionEnabled, boolean isPrestoOnSpark)
    }
    class PlanFragmenter {
        - boolean isPrestoOnSpark
    }
    class IterativePlanFragmenter {
        - boolean isPrestoOnSpark
    }
    class PlanFragmenterUtils {
        + static SubPlan finalizeSubPlan(..., boolean isPrestoOnSpark)
    }
    MergeJoinForSortedInputOptimizer <.. PlanOptimizers
    PlanFragmenter <.. PlanFragmenterUtils
    IterativePlanFragmenter <.. PlanFragmenterUtils
    GroupedExecutionTagger <.. PlanFragmenterUtils
Loading

File-Level Changes

Change Details Files
Enhance MergeJoinForSortedInputOptimizer for one-sided sorted inputs and all join types
  • Add prestoOnSpark flag and constructor parameter
  • Update isEnabled() to require nativeExecution and allow prestoOnSpark
  • Remove inner-only restriction, derive sortedness for each input
  • Inject SortNode on the unsorted side when running Presto on Spark
  • Create MergeJoinNode with optional hash variables
presto-main-base/src/main/java/com/facebook/presto/sql/planner/optimizations/MergeJoinForSortedInputOptimizer.java
Allow merge join under Presto on Spark in GroupedExecutionTagger
  • Introduce isPrestoOnSpark field
  • Bypass failure on merge join node when grouped execution disabled in Presto on Spark
presto-main-base/src/main/java/com/facebook/presto/sql/planner/GroupedExecutionTagger.java
Propagate isPrestoOnSpark flag across plan fragmentation
  • Add boolean isPrestoOnSpark parameter to finalizeSubPlan and analyzeGroupedExecution
  • Update PlanFragmenterUtils, PlanFragmenter, IterativePlanFragmenter, PrestoSparkAdaptiveQueryExecution constructors and calls
presto-main-base/src/main/java/com/facebook/presto/sql/planner/PlanFragmenterUtils.java
presto-main-base/src/main/java/com/facebook/presto/sql/planner/PlanFragmenter.java
presto-spark-base/src/main/java/com/facebook/presto/spark/planner/IterativePlanFragmenter.java
presto-spark-base/src/main/java/com/facebook/presto/spark/execution/PrestoSparkAdaptiveQueryExecution.java
Instantiate MergeJoinForSortedInputOptimizer with Presto-Spark flag
  • Pass featuresConfig.isPrestoSparkExecutionEnvironment() into constructor
presto-main-base/src/main/java/com/facebook/presto/sql/planner/PlanOptimizers.java
Expand and adjust tests for Presto on Spark merge join
  • Update existing Hive tests for left/right/full join mergeJoinEnabled flags
  • Add TestMergeJoinPlanPrestoOnSpark for various join/sort/bucket scenarios
  • Modify TestIterativePlanFragmenter to include prestoOnSpark flag
presto-hive/src/test/java/com/facebook/presto/hive/TestMergeJoinPlan.java
presto-hive/src/test/java/com/facebook/presto/hive/TestMergeJoinPlanPrestoOnSpark.java
presto-spark-base/src/test/java/com/facebook/presto/spark/planner/TestIterativePlanFragmenter.java

Possibly linked issues


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • Rename the typo in ConstatnCheck to ConstantCheck to improve readability and consistency.
  • The new constant‐check visitor only handles Project, Filter, and TableScan nodes—either document this limitation or extend it to explicitly handle other PlanNode types.
  • Threading the isPrestoOnSpark flag through many methods makes signatures verbose—consider bundling environment flags into a single context object to simplify method parameters.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Rename the typo in `ConstatnCheck` to `ConstantCheck` to improve readability and consistency.
- The new constant‐check visitor only handles Project, Filter, and TableScan nodes—either document this limitation or extend it to explicitly handle other PlanNode types.
- Threading the `isPrestoOnSpark` flag through many methods makes signatures verbose—consider bundling environment flags into a single context object to simplify method parameters.

## Individual Comments

### Comment 1
<location> `presto-main-base/src/main/java/com/facebook/presto/sql/planner/optimizations/SortMergeJoinOptimizer.java:106` </location>
<code_context>
+                && !joinNode.getCriteria().isEmpty();
+    }
+
+    private static class ConstatnCheck
+            extends InternalPlanVisitor<Boolean, Set<VariableReferenceExpression>>
+    {
</code_context>

<issue_to_address>
**nitpick (typo):** Typo in class name 'ConstatnCheck' should be 'ConstantCheck'.

Please rename 'ConstatnCheck' to 'ConstantCheck' for consistency.

Suggested implementation:

```java
    private static class ConstantCheck
            extends InternalPlanVisitor<Boolean, Set<VariableReferenceExpression>>
    {

```

If there are any usages of `ConstatnCheck` elsewhere in this file (e.g., instantiations or references), they should also be renamed to `ConstantCheck`.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@feilong-liu feilong-liu changed the title Fix rule to do sort merge join fix: Fix rule to do sort merge join Oct 19, 2025
@feilong-liu feilong-liu force-pushed the sort_merge branch 3 times, most recently from ac27911 to a225204 Compare October 22, 2025 17:04
@feilong-liu feilong-liu changed the title fix: Fix rule to do sort merge join fix: Fix sort merge join rule to add sort only when input is not sorted Oct 22, 2025
@feilong-liu feilong-liu force-pushed the sort_merge branch 2 times, most recently from 775c049 to a08cb35 Compare October 22, 2025 21:28
sourcery-ai[bot]
sourcery-ai bot previously requested changes Oct 22, 2025
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New security issues found

// merge join can't be enabled
assertPlan(
mergeJoinEnabled(),
"select * from test_join_customer2 join test_join_order2 on test_join_customer2.custkey = test_join_order2.custkey",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security (generic-api-key): Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

Source: gitleaks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this custkey somehow trigger this gitleaks check

@feilong-liu feilong-liu force-pushed the sort_merge branch 3 times, most recently from ac32798 to 4ce55e7 Compare October 22, 2025 21:56
@feilong-liu feilong-liu changed the title fix: Fix sort merge join rule to add sort only when input is not sorted fix: enhance sort merge join rule to do merge join when one sided is sorted Oct 22, 2025
@feilong-liu feilong-liu dismissed sourcery-ai[bot]’s stale review October 22, 2025 22:40

The key in the text is just a test text

@Override
protected FeaturesConfig createFeaturesConfig()
{
return new FeaturesConfig().setNativeExecutionEnabled(true);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set the test environment to be native execution

mergeJoinEnabled(),
"select * from test_join_customer_join_type left join test_join_order_join_type on test_join_customer_join_type.custkey = test_join_order_join_type.custkey",
joinPlan("test_join_customer_join_type", "test_join_order_join_type", ImmutableList.of("custkey"), ImmutableList.of("custkey"), LEFT, false));
joinPlan("test_join_customer_join_type", "test_join_order_join_type", ImmutableList.of("custkey"), ImmutableList.of("custkey"), LEFT, true));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now not only inner joins are supported

@Override
protected FeaturesConfig createFeaturesConfig()
{
return new FeaturesConfig().setNativeExecutionEnabled(true).setPrestoSparkExecutionEnvironment(true);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test for presto on spark environment

// merge join can't be enabled
assertPlan(
mergeJoinEnabled(),
"select * from test_join_customer2 join test_join_order2 on test_join_customer2.custkey = test_join_order2.custkey",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this custkey somehow trigger this gitleaks check

return GroupedExecutionTagger.GroupedExecutionProperties.notCapable();
}

if (isPrestoOnSpark) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When presto on spark is the environment, if one side is grouped execution capable, even if the other side is not, the data will still be partitioned by buckets, and have a bucket by bucket execution in presto on spark

public boolean isEnabled(Session session)
{
return isEnabledForTesting || isGroupedExecutionEnabled(session) && preferMergeJoinForSortedInputs(session) && !isSingleNodeExecutionEnabled(session);
return isEnabledForTesting || nativeExecution && (isGroupedExecutionEnabled(session) || prestoOnSpark) && preferMergeJoinForSortedInputs(session) && !isSingleNodeExecutionEnabled(session);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. only run for nativeExecution
  2. if presto on spark, do not need group execution enabled

boolean leftInputSorted = isPlanOutputSortedByColumns(rewrittenLeft, node.getCriteria().stream().map(EquiJoinClause::getLeft).collect(toImmutableList()));
boolean rightInputSorted = isPlanOutputSortedByColumns(rewrittenRight, node.getCriteria().stream().map(EquiJoinClause::getRight).collect(toImmutableList()));

if ((!leftInputSorted && !rightInputSorted) || (!prestoOnSpark && (!leftInputSorted || !rightInputSorted))) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Do not do merge join is no input is sorted
  2. If only one side is sorted, do not do merge join if not presto on spark

@feilong-liu feilong-liu marked this pull request as ready for review October 22, 2025 22:48
sourcery-ai[bot]
sourcery-ai bot previously requested changes Oct 22, 2025
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

Blocking issues:

  • Detected a Generic API Key, potentially exposing access to various services and sensitive operations. (link)
  • Detected a Generic API Key, potentially exposing access to various services and sensitive operations. (link)
  • Detected a Generic API Key, potentially exposing access to various services and sensitive operations. (link)
  • Detected a Generic API Key, potentially exposing access to various services and sensitive operations. (link)
  • Detected a Generic API Key, potentially exposing access to various services and sensitive operations. (link)

General comments:

  • The prestoOnSpark flag is propagated through many constructors and static methods; consider centralizing that flag in a shared execution context or reading it directly from session/FeaturesConfig to reduce boilerplate and improve maintainability.
  • The inversion logic in isPlanOutputSortedByColumns around LocalProperties.match(...) can be confusing—extract it into a clearly named helper (e.g. isSortedBy(…)) or invert the matcher itself to make the intent more readable.
  • When adding on‐demand SortNodes for one‐sided sorting, ensure that redundant or no‐op sorts are pruned downstream or guarded against here, so you don’t introduce unnecessary sort operators into the plan.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The `prestoOnSpark` flag is propagated through many constructors and static methods; consider centralizing that flag in a shared execution context or reading it directly from session/FeaturesConfig to reduce boilerplate and improve maintainability.
- The inversion logic in `isPlanOutputSortedByColumns` around `LocalProperties.match(...)` can be confusing—extract it into a clearly named helper (e.g. `isSortedBy(…)`) or invert the matcher itself to make the intent more readable.
- When adding on‐demand `SortNode`s for one‐sided sorting, ensure that redundant or no‐op sorts are pruned downstream or guarded against here, so you don’t introduce unnecessary sort operators into the plan.

## Individual Comments

### Comment 1
<location> `presto-hive/src/test/java/com/facebook/presto/hive/TestMergeJoinPlan.java:90-93` </location>
<code_context>
             assertPlan(
                     mergeJoinEnabled(),
                     "select * from test_join_customer_join_type left join test_join_order_join_type on test_join_customer_join_type.custkey = test_join_order_join_type.custkey",
-                    joinPlan("test_join_customer_join_type", "test_join_order_join_type", ImmutableList.of("custkey"), ImmutableList.of("custkey"), LEFT, false));
+                    joinPlan("test_join_customer_join_type", "test_join_order_join_type", ImmutableList.of("custkey"), ImmutableList.of("custkey"), LEFT, true));

             // Right join
</code_context>

<issue_to_address>
**suggestion (testing):** Updated assertions to check for merge join enablement for all join types.

Also, please add tests to verify that merge join is not enabled when inputs are unsorted.
</issue_to_address>

### Comment 2
<location> `presto-hive/src/test/java/com/facebook/presto/hive/TestMergeJoinPlanPrestoOnSpark.java:178` </location>
<code_context>
test_join_order2.custkey
</code_context>

<issue_to_address>
**security (generic-api-key):** Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

*Source: gitleaks*
</issue_to_address>

### Comment 3
<location> `presto-hive/src/test/java/com/facebook/presto/hive/TestMergeJoinPlanPrestoOnSpark.java:210` </location>
<code_context>
test_join_order3.custkey
</code_context>

<issue_to_address>
**security (generic-api-key):** Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

*Source: gitleaks*
</issue_to_address>

### Comment 4
<location> `presto-hive/src/test/java/com/facebook/presto/hive/TestMergeJoinPlanPrestoOnSpark.java:242` </location>
<code_context>
test_join_order4.custkey
</code_context>

<issue_to_address>
**security (generic-api-key):** Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

*Source: gitleaks*
</issue_to_address>

### Comment 5
<location> `presto-hive/src/test/java/com/facebook/presto/hive/TestMergeJoinPlanPrestoOnSpark.java:277` </location>
<code_context>
test_join_order5.custkey
</code_context>

<issue_to_address>
**security (generic-api-key):** Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

*Source: gitleaks*
</issue_to_address>

### Comment 6
<location> `presto-hive/src/test/java/com/facebook/presto/hive/TestMergeJoinPlanPrestoOnSpark.java:277` </location>
<code_context>
test_join_order5.orderkey
</code_context>

<issue_to_address>
**security (generic-api-key):** Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

*Source: gitleaks*
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

from:Meta PR from Meta

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants