[SPARK-52823][SQL] Support DSv2 Join pushdown for Oracle connector #51519

PetarVasiljevic-DB · 2025-07-16T19:31:06Z

What changes were proposed in this pull request?

In #50921, Join pushdown was added for DSv2 and it was only enabled for H2 dialect.
With this PR, I am enabling DSv2 join pushdown for Oracle connector as well.

For this purpose, OracleDialect has now supportsJoin equal to true.
Also, changed SQL query generation to use tableOrQuery method instead of options.tableOrQuery.

The rest of the change is test only:

Extracted pushdown util methods from V2JDBCTest to new trait V2JDBCPushdownTestUtils
Created new integration trait JDBCJoinPushdownIntegrationSuite that can be used for testing other connectors as well
Added OracleJoinPushdownIntegrationSuite as the first implementation of the trait
Changed JDBCV2JoinPushdownSuite to inherit JDBCJoinPushdownIntegrationSuite

Why are the changes needed?

Does this PR introduce any user-facing change?

Inner joins will be pushed down to Oracle data source only if spark.sql.optimizer.datasourceV2JoinPushdown SQL conf is set to true. Currently, the default value is false.

Previously, Spark SQL query

SELECT tbl1.id, tbl1.name, tbl2.id 
FROM oracleCatalog.tbl1 t1 
JOIN oracleCatalog.tbl2 t2 
ON t1.id = t2.id + 1

would produce the following Optimized plan:

== Optimized Logical Plan ==
Join Inner, (id#0 = (id#1 + 1))
:- Filter isnotnull(id#0)
:  +- RelationV2[id#0] oracleCatalog.tbl1
+- Filter isnotnull(id#1, name#2)
   +- RelationV2[id#1, name#2] oracleCatalog.tbl2

Now, with join pushdown enabled, the plan would be:

Project [ID_974bb0c2_a32c_4d5b_b6ee_745efa1f3a0c#3 AS id#0, ID#4 AS id#1, NAME#5 AS name#2]
+- RelationV2[ID_974bb0c2_a32c_4d5b_b6ee_745efa1f3a0c#3, ID#4, NAME#5] oracleCatalog.tbl1

When join is pushed down, the physical plan will contain PushedJoins information, which is the array of all the tables joined. For example, in the above case it would be:

PushedJoins: [oracleCatalog.tbl1, oracleCatalog.tbl2]

The generated SQL query would be:

SELECT
    "ID_974bb0c2_a32c_4d5b_b6ee_745efa1f3a0c",
    "ID",
    "NAME"
FROM
    (
        SELECT
            "ID_974bb0c2_a32c_4d5b_b6ee_745efa1f3a0c",
            "ID",
            "NAME"
        FROM
            (
                SELECT
                    "ID_974bb0c2_a32c_4d5b_b6ee_745efa1f3a0c",
                    "ID",
                    "NAME"
                FROM
                    (
                        SELECT
                            "ID" AS "ID_974bb0c2_a32c_4d5b_b6ee_745efa1f3a0c",
                            "NAME"
                        FROM
                            "SYSTEM"."TBL1"
                        WHERE
                            ("ID" IS NOT NULL)
                    ) join_subquery_4
                    INNER JOIN (
                        SELECT
                            "ID"
                        FROM
                            "SYSTEM"."TBL2"
                        WHERE
                            ("ID" IS NOT NULL)
                    ) join_subquery_5 ON "ID_974bb0c2_a32c_4d5b_b6ee_745efa1f3a0c" = "ID"
            )
    ) SPARK_GEN_SUBQ_30

How was this patch tested?

New tests.

Was this patch authored or co-authored using generative AI tooling?

sql/core/src/main/scala/org/apache/spark/sql/jdbc/OracleDialect.scala

andrej-db · 2025-07-17T08:11:49Z

...ests/src/test/scala/org/apache/spark/sql/jdbc/v2/join/JDBCJoinPushdownIntegrationSuite.scala

+      checkAnswer(df, rows)
+    }
+  }
+}


can we have an ANTI JOIN as well?
or is that unsupported?

anti join is not supported for pushdown, but yes, Spark has anti join

urosstan-db · 2025-07-17T15:09:57Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCPushdownTestUtils.scala

+import org.apache.spark.sql.execution.datasources.v2.{DataSourceV2ScanRelation, V1ScanWrapper}
+import org.apache.spark.sql.internal.SQLConf
+
+trait V2JDBCPushdownTestUtils extends ExplainSuiteHelper {


What do you think about DataSourcePushdownTestUtils?

Out of v2 folder, in connector folder

I've put it in sql/connector

sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcSQLQueryBuilder.scala

urosstan-db · 2025-07-17T15:17:45Z

connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala

-    assert(aggregates.isEmpty)
-  }
-
-  private def checkAggregatePushed(df: DataFrame, funcName: String): Unit = {


Good change

urosstan-db · 2025-07-17T15:19:59Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2JoinPushdownSuite.scala

 import org.apache.spark.sql.test.SharedSparkSession
 import org.apache.spark.util.Utils

-class JDBCV2JoinPushdownSuite extends QueryTest with SharedSparkSession with ExplainSuiteHelper {
+class JDBCV2JoinPushdownSuite


It is confusing to have jdbc.JDBCV2JoinPushdownSuite and v2.JDBCJoinPushdownIntegrationSuite. Can we name them somehow better?

sql/core/src/test/scala/org/apache/spark/sql/jdbc/v2/JDBCJoinPushdownIntegrationSuite.scala

urosstan-db · 2025-07-17T15:25:12Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/v2/JDBCJoinPushdownIntegrationSuite.scala

+  def qualifyTableName(tableName: String): String = namespaceOpt
+    .map(namespace => s"$namespace.$tableName").getOrElse(tableName)
+
+  private val fullyQualifiedTableName1: String = qualifyTableName(joinTableName1)


If someone overrides qualifyTableName, our val would be calculated in initialization of object, so would it use overriden method?

it might take the base method, so I changed it to lazy instead

urosstan-db · 2025-07-17T15:28:10Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/v2/JDBCJoinPushdownIntegrationSuite.scala

+
+  protected def caseConvert(tableName: String): String = tableName
+
+  protected def withConnection[T](f: Connection => T): T = {


Can we make this suite more generic and to decouple it from JDBC to make it reusable by other non JDBC connectors?

Maybe some class hieararchy like:

JoinPushdownIntegrationSuiteBase

JDBCJoinPushdownIntegrationSuiteBase extends JoinPushdownIntegrationSuiteBase

OracleJoinPushdownIntegrationSuiteBase extends JDBCJoinPushdownIntegrationSuiteBase

I would do it in separate PR if it's fine with you.

urosstan-db · 2025-07-17T15:29:18Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/v2/JDBCJoinPushdownIntegrationSuite.scala

+    val random = new java.util.Random(42)
+    val table1Data = (1 to 100).map { i =>
+      val id = i % 11
+      val amount = BigDecimal.valueOf(random.nextDouble() * 10000)
+        .setScale(2, BigDecimal.RoundingMode.HALF_UP)
+      val address = s"address_$i"
+      (id, amount, address)
+    }
+    val table2Data = (1 to 100).map { i =>
+      val id = (i % 17)
+      val next_id = (id + 1) % 17
+      val salary = BigDecimal.valueOf(random.nextDouble() * 50000)
+        .setScale(2, BigDecimal.RoundingMode.HALF_UP)
+      val surname = s"surname_$i"
+      (id, next_id, salary, surname)
+    }


Can this make some flakiness?

Is 42 seed of Random function?

42 is seed, so shouldn't be flaky

cloud-fan · 2025-07-18T03:33:45Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/v2/JDBCV2JoinPushdownSuite.scala

+  override val url = s"jdbc:h2:${tempDir.getCanonicalPath};user=testUser;password=testPass"
+
+  override val catalogName: String = "h2"
+  override val namespaceOpt: Option[String] = Some("test")


why it's an opt? most dialects must have a schema, right?

cloud-fan · 2025-07-18T03:35:49Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/v2/JDBCV2JoinPushdownSuite.scala

+    .set("spark.sql.catalog.h2.pushDownAggregate", "true")
+    .set("spark.sql.catalog.h2.pushDownLimit", "true")
+    .set("spark.sql.catalog.h2.pushDownOffset", "true")
+    .set("spark.sql.catalog.h2.pushDownJoin", "true")


shall we move most of the conf settings to the parent class? then here we only need

override def sparkConf: SparkConf = super.sparkConf.set("spark.sql.catalog.h2.driver", "org.h2.Driver")

cloud-fan · 2025-07-18T03:36:28Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/v2/JDBCV2JoinPushdownSuite.scala

+  val tempDir = Utils.createTempDir()
+  override val url = s"jdbc:h2:${tempDir.getCanonicalPath};user=testUser;password=testPass"
+
+  override val catalogName: String = "h2"


seems the catalog name doesn't matter for the test cases, shall we just hardcode jdbc_test in the parent class?

support join pushdown for oracle

5f00457

github-actions bot added the SQL label Jul 16, 2025

andrej-db approved these changes Jul 17, 2025

View reviewed changes

refactor tests

ff6d2e2