[SPARK-53805][SQL] Push Variant into DSv2 scan #52522

huaxingao · 2025-10-06T03:42:56Z

What changes were proposed in this pull request?

Push Variant into DSv2 scan

Why are the changes needed?

with the change, DSV2 scan only needs to fetch the necessary shredded columns required by the plan

Does this PR introduce any user-facing change?

No

How was this patch tested?

new tests

Was this patch authored or co-authored using generative AI tooling?

No

dongjoon-hyun

Thank you so much, @huaxingao .

cc @chenhao-db and @cloud-fan from SPARK-53805 .

#49235

dongjoon-hyun

+1, LGTM from my side.

dongjoon-hyun · 2025-10-07T21:15:02Z

cc @peter-toth , too.

singhpk234 · 2025-10-07T21:52:03Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PushVariantIntoScan.scala

      hadoopFsRelation@HadoopFsRelation(_, _, _, _, _: ParquetFileFormat, _), _)) =>
        rewritePlan(p, projectList, filters, relation, hadoopFsRelation)
+      case p@PhysicalOperation(projectList, filters, relation: DataSourceV2Relation) =>
+        rewriteV2RelationPlan(p, projectList, filters, relation.output, relation)


if we are sending the relation already do we need to send the relation.output seperately ?

I overlooked this. Removed.

singhpk234 · 2025-10-07T22:35:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala

      SchemaPruning,
      GroupBasedRowLevelOperationScanPlanning,
      V1Writes,
+      PushVariantIntoScan,


now PushVariantIntoScan runs before the PruneFileSourcePartition, which i think was for v1 sources, does this matter or if i were to ask did we just like add in later, just because it was a new rule ?

I don't think variant columns will ever be used in the partition schema. Schema transformations by PushVariantIntoScan shouldn't affect partition pruning in v1 sources.

cloud-fan · 2025-10-09T03:40:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PushVariantIntoScan.scala

      relation @ LogicalRelationWithTable(
      hadoopFsRelation@HadoopFsRelation(_, _, _, _, _: ParquetFileFormat, _), _)) =>
        rewritePlan(p, projectList, filters, relation, hadoopFsRelation)
+      case p@PhysicalOperation(projectList, filters, relation: DataSourceV2Relation) =>


Is there any code we can share between the v1 rewritePlan and the v2 rewriteV2RelationPlan?

Yes, there’s shared logic. I intentionally left the v1 rewritePlan unchanged in this PR to keep the diff small and easier to review. After this merges, I’ll do a small follow-up to have v1 rewritePlan reuse the common code. If you prefer, I can fold that refactor into this PR.

it's actually harder to review as I can't tell what's the key difference between the v1 and v2 versions with the current PR...

Sorry for the confusion. I have updated the code.

The logic for transforming variant columns to struct is identical between DSv1 and DSv2. Now they both use the same helper methods (collectAndRewriteVariants, buildAttributeMap, buildFilterAndProject).

The only difference is how the transformed schema is communicated to the data source. DSv1 stores the new schema in HadoopFsRelation.dataSchema and the file source reads this field directly; DSv2 has no schema field to update. The schema is communicated later when V2ScanRelationPushDown calls pruneColumns.

huaxingao · 2025-10-10T19:18:19Z

Merged to master. Thanks everyone for the review!

dongjoon-hyun · 2025-10-10T19:47:28Z

Thank you, @huaxingao and all!

…PushDownVariants ### What changes were proposed in this pull request? This patch goes to add DSv2 support to the optimization rule `PushVariantIntoScan`. The `PushVariantIntoScan` rule only supports DSv1 Parquet (`ParquetFileFormat`) source. It limits the effectiveness of variant type usage on DSv2. ### Why are the changes needed? Although #52522 tried to add DSv2 support recently, the implementation implicitly binds `pruneColumns` to this variant access pushdown which could cause unexpected errors on the DSv2 datasources which don't support that. It also breaks the API semantics. We need an explicit API between Spark and DSv2 datasource for the feature. #52522 also didn't test through this DSv2 variant pushdown feature actually on the built-in DSv2 Parquet datasource but on InMemoryTable. This patch reverts #52522 and proposes a new approach with comprehensive test coverage. ### Does this PR introduce _any_ user-facing change? Yes. After this PR, if users enable `spark.sql.variant.pushVariantIntoScan`, they can push down variant column accesses into DSv2 datasource if it is supported. ### How was this patch tested? Added new unit test suites `PushVariantIntoScanV2Suite` and `PushVariantIntoScanV2VectorizedSuite`. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code v2.0.13 Closes #52578 from viirya/pushvariantdsv2-pr. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? Push Variant into DSv2 scan ### Why are the changes needed? with the change, DSV2 scan only needs to fetch the necessary shredded columns required by the plan ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#52522 from huaxingao/variant-v2-pushdown. Authored-by: Huaxin Gao <[email protected]> Signed-off-by: Huaxin Gao <[email protected]>

…PushDownVariants ### What changes were proposed in this pull request? This patch goes to add DSv2 support to the optimization rule `PushVariantIntoScan`. The `PushVariantIntoScan` rule only supports DSv1 Parquet (`ParquetFileFormat`) source. It limits the effectiveness of variant type usage on DSv2. ### Why are the changes needed? Although apache#52522 tried to add DSv2 support recently, the implementation implicitly binds `pruneColumns` to this variant access pushdown which could cause unexpected errors on the DSv2 datasources which don't support that. It also breaks the API semantics. We need an explicit API between Spark and DSv2 datasource for the feature. apache#52522 also didn't test through this DSv2 variant pushdown feature actually on the built-in DSv2 Parquet datasource but on InMemoryTable. This patch reverts apache#52522 and proposes a new approach with comprehensive test coverage. ### Does this PR introduce _any_ user-facing change? Yes. After this PR, if users enable `spark.sql.variant.pushVariantIntoScan`, they can push down variant column accesses into DSv2 datasource if it is supported. ### How was this patch tested? Added new unit test suites `PushVariantIntoScanV2Suite` and `PushVariantIntoScanV2VectorizedSuite`. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code v2.0.13 Closes apache#52578 from viirya/pushvariantdsv2-pr. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

dongjoon-hyun · 2025-12-04T20:47:54Z

For the record, Apache Spark 4.1.0 RC2 got -1 due to the whole feature across three PRs including this.

[SPARK-53805][SQL] Push Variant into DSv2 scan #52522 (This PR)
[SPARK-53880][SQL] Fix DSv2 in PushVariantIntoScan by adding SupportsPushDownVariants #52578
[SPARK-54656][SQL] Refactor SupportsPushDownVariants to be a ScanBuilder mix-in #53276

Inevitably, to unblock Apache Spark 4.1.0, we are re-evaluating this whole feature.

…tal` ### What changes were proposed in this pull request? This PR aims to mark `SupportsPushDownVariants` as `Experimental` instead of `Evolving` in Apache Spark 4.1.x. ### Why are the changes needed? During Apache Spark 4.1.0 RC2, it turns out that this new `Variant` improvement feature still needs more time to stabilize. - #52522 - #52578 - #53276 - [[VOTE] Release Spark 4.1.0 (RC2)](https://lists.apache.org/thread/og4dn0g7r92qj22fdsmqoqs518k324q5) We had better mark this interface itself as `Experimental` in Apache Spark 4.1.0 while keeping it `Evolving` in `master` branch. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #53354 from dongjoon-hyun/SPARK-54616. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

[SPARK-53805][SQL] Push Variant into DSv2 scan

cd8e0d7

github-actions bot added the SQL label Oct 6, 2025

add new line at end of file

c8b9df5

dongjoon-hyun reviewed Oct 7, 2025

View reviewed changes

huaxingao mentioned this pull request Oct 7, 2025

Spark 4.0: Add variant round trip test for Spark apache/iceberg#14276

Merged

dongjoon-hyun approved these changes Oct 7, 2025

View reviewed changes

singhpk234 reviewed Oct 7, 2025

View reviewed changes

address comments

2092fce

cloud-fan reviewed Oct 9, 2025

View reviewed changes

reuse common code

9bb25cc

cloud-fan approved these changes Oct 10, 2025

View reviewed changes

asf-gitbox-commits closed this in a35c9f3 Oct 10, 2025

huaxingao deleted the variant-v2-pushdown branch October 10, 2025 19:18

viirya mentioned this pull request Oct 12, 2025

[SPARK-53880][SQL] Fix DSv2 in PushVariantIntoScan by adding SupportsPushDownVariants #52578

Closed

dongjoon-hyun mentioned this pull request Dec 5, 2025

[SPARK-54616][SQL][4.1] Mark SupportsPushDownVariants as Experimental #53354

Closed

[SPARK-53805][SQL] Push Variant into DSv2 scan #52522

[SPARK-53805][SQL] Push Variant into DSv2 scan #52522

Uh oh!

Conversation

huaxingao commented Oct 6, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Oct 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huaxingao commented Oct 10, 2025

Uh oh!

dongjoon-hyun commented Oct 10, 2025

Uh oh!

dongjoon-hyun commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dongjoon-hyun commented Dec 4, 2025 •

edited

Loading