Skip to content

Conversation

@zifeif2
Copy link

@zifeif2 zifeif2 commented Dec 2, 2025

What changes were proposed in this pull request?

Introducing StatePartitionAllColumnFamiliesWriter as part of the offline repartition project. In this PR, we only support a single-column-family operator.

This writer takes the repartitioned DataFrame returned from StatePartitionAllColumnFamiliesReader and writes it to a new version in the state store. See the comments for the DataFrame schema. In addition, this writer does not load previous state (since we are overwriting the state with the repartitioned data), and when committing, it will always commit a snapshot.

Major Changes

  • Introduce StatePartitionAllColumnFamiliesWriter
  • Introduce a new parameter loadEmpty for StateStoreProvider.getStore()
  • Introduce a new function loadEmpty for RocksDB

Why are the changes needed?

This will be used in offline repartitioning to allow OfflineRepartitioningRunner to directly write data to state store

Does this PR introduce any user-facing change?

No

How was this patch tested?

Integration tests in sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/state/StatePartitionAllColumnFamiliesWriterSuite.scala
Unit tests in sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala

Was this patch authored or co-authored using generative AI tooling?

Yes. Sonnet 4.5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant