Skip to content

Conversation

@ganeshashree
Copy link

What changes were proposed in this pull request?

Refactor MemoryStream to use SparkSession instead of SQLContext.

Why are the changes needed?

SQLContext is deprecated in newer versions of Spark.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Verified that the affected tests are passing successfully.

Was this patch authored or co-authored using generative AI tooling?

No


test("three hop pipeline") {
val session = spark
implicit val sparkSession: SparkSession = spark
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where was the previous implicit SQLContext defined?

Copy link
Author

@ganeshashree ganeshashree Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like it was getting implicit sqlContext defined in SharedSparkSession. Explicitly defining the implicit SparkSession is required because the existing implicit SparkSession was assigned to a non-implicit session variable, and it couldn't locate the implicit SparkSession within the anonymous block.

@cloud-fan
Copy link
Contributor

cc @HeartSaVioR

@HeartSaVioR
Copy link
Contributor

@ganeshashree
Thanks for the proposal. The change looks OK to me.

Have we checked the warn (build/log) message when we use SQLContext here? If we weren't providing the message to migrate easily, it might be beneficial to defer replacement of apply() and have intermediate migration step (deprecation of the existing methods and removal of them in Spark 5.0.0).

@HeartSaVioR HeartSaVioR changed the title [SPARK-53656][SQL] Refactor MemoryStream to use SparkSession instead of SQLContext [SPARK-53656][SS] Refactor MemoryStream to use SparkSession instead of SQLContext Sep 22, 2025
@ganeshashree
Copy link
Author

ganeshashree commented Sep 29, 2025

@ganeshashree Thanks for the proposal. The change looks OK to me.

Have we checked the warn (build/log) message when we use SQLContext here? If we weren't providing the message to migrate easily, it might be beneficial to defer replacement of apply() and have intermediate migration step (deprecation of the existing methods and removal of them in Spark 5.0.0).

@HeartSaVioR Thanks for reviewing. Currently, no warning appears in the build log when we use SQLContext. Creating two versions of MemoryStream.apply for SparkSession and SQLContext and showing a warning for SQLContext would require resolving ambiguity when both sparkSession and sqlContext are set as implicit variables. Since this is an internal API, please review whether it's acceptable to make this change and update the callers to use MemoryStream with an implicit SparkSession instead of SQLContext, where applicable. I'm exploring further to resolve the ambiguity by preferring SparkSession instead of SQLContext.

@ganeshashree ganeshashree force-pushed the SPARK-53656 branch 2 times, most recently from 7a8de69 to ae36d87 Compare October 5, 2025 15:56
@ganeshashree
Copy link
Author

@ganeshashree Thanks for the proposal. The change looks OK to me.
Have we checked the warn (build/log) message when we use SQLContext here? If we weren't providing the message to migrate easily, it might be beneficial to defer replacement of apply() and have intermediate migration step (deprecation of the existing methods and removal of them in Spark 5.0.0).

@HeartSaVioR Thanks for reviewing. Currently, no warning appears in the build log when we use SQLContext. Creating two versions of MemoryStream.apply for SparkSession and SQLContext and showing a warning for SQLContext would require resolving ambiguity when both sparkSession and sqlContext are set as implicit variables. Since this is an internal API, please review whether it's acceptable to make this change and update the callers to use MemoryStream with an implicit SparkSession instead of SQLContext, where applicable. I'm exploring further to resolve the ambiguity by preferring SparkSession instead of SQLContext.

Made changes to support two versions of MemoryStream.apply for SparkSession and SQLContext, with a warning for SQLContext, and also addressed ambiguity when both sparkSession and sqlContext are set as implicit variables by defining a low-priority trait.

override def commit(end: Offset): Unit = {}
}

object ContinuousMemoryStream {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we do the same low priority implicit trick here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@cloud-fan
Copy link
Contributor

@HeartSaVioR do you have any other concerns with this change?

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Could you please resolve the conflict?

@ganeshashree
Copy link
Author

ganeshashree commented Oct 20, 2025

Could you please resolve the conflict?

Done.

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 pending CI

@HeartSaVioR
Copy link
Contributor

Thanks! Merging to master.

@manuzhang
Copy link
Member

manuzhang commented Oct 27, 2025

Note this is not an internal API as Iceberg uses it in tests. Of course, we can easily change it at the caller side.

https://github.com/apache/iceberg/blob/68e555b94f4706a2af41dcb561c84007230c0bc1/spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/source/TestForwardCompatibility.java#L222-L224

@HeartSaVioR
Copy link
Contributor

I understand it's a "hard-to-understand" protocol, but historically, Apache Spark project considers the classes not documented in Scala/Java/Python doc as non-public API. I'm not a part of the discussion/decision, but IIUC there is a protocol with it.

@cloud-fan
Copy link
Contributor

@manuzhang we didn't remove the old method, how does it break iceberg tests?

@manuzhang
Copy link
Member

@cloud-fan which old method do you mean? The constructor has changed and that's breaking for Java code.

@ganeshashree
Copy link
Author

@cloud-fan which old method do you mean? The constructor has changed and that's breaking for Java code.

@manuzhang Thanks for reporting this. The current changes are backward compatible with Scala but not with Java. I see that two tests in Iceberg 4.0 are breaking due to this change. Is it fine to modify the tests to use SparkSession instead of sqlContext? Please let me know if you rely on the old version of the constructor that takes sqlContext as a parameter. I can consider making this change backward compatible with Java as well. However, since sqlContext is deprecated, it is best practice to use the new version of the constructor and pass sparkSession as a parameter.

@manuzhang
Copy link
Member

@ganeshashree Yes, I've already made the change in 4.1.0 support and the tests passed for 4.1.0-preview3(RC1). I just want to call out this should not be considered an internal API, especially for downstream Java projects.

@cloud-fan
Copy link
Contributor

cloud-fan commented Oct 28, 2025

We have a clear definition of public APIs: the APIs listed in the public doc such as https://spark.apache.org/docs/latest/api/scala/org/apache/spark/index.html are public.

Spark does not use modifiers like private[sql] to hide all the internal APIs, so that Spark plugins can do powerful things easily without using reflection. But it does not mean Spark guarantees backward compatibility for all these compile-time-public internal APIs, which is just not possible. Spark plugins are responsible for updating their code to catch up with Spark's internal APIs changes.

And this is also case by case. If an internal API is unfortunately widely used by many Spark plugins, Spark should try its best to keep backward compatibility.

Yicong-Huang pushed a commit to Yicong-Huang/spark that referenced this pull request Oct 30, 2025
…f SQLContext

### What changes were proposed in this pull request?

Refactor MemoryStream to use SparkSession instead of SQLContext.

### Why are the changes needed?

SQLContext is deprecated in newer versions of Spark.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Verified that the affected tests are passing successfully.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#52402 from ganeshashree/SPARK-53656.

Authored-by: Ganesha S <[email protected]>
Signed-off-by: Jungtaek Lim <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants