Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

save/load models in pyspark.ml.connect with arrow-based parquet format

Why are the changes needed?

pickle.load can run arbitrary codes and cause security issues

Does this PR introduce any user-facing change?

no, the whole module pyspark.ml.connect is never documented, so should not affect any end users

How was this patch tested?

ci

Was this patch authored or co-authored using generative AI tooling?

no

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @zhengruifeng .

Do you mean this is about storing and loading a transient and temporary data?

no, the whole module pyspark.ml.connect is never documented, so should not affect any end users

I'm wondering if we need to keep the existing code path. Is there any chance for the existing users to try to load old data format?

Copy link
Contributor

@allisonwang-db allisonwang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this introduce any behavior changes?

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, it should be fine. Arrow is a required dependency for connect in ML, and the data being used here has the explicit type.

pickle itself isn't considered as a stable type for storage (it does change even between minor Python versions)

@HyukjinKwon
Copy link
Member

I am not against adding the legacy code path but TBH ML connect is still sort of immature (hence not documented) cc @WeichenXu123 for second opinion if you have

@zhengruifeng
Copy link
Contributor Author

@dongjoon-hyun @allisonwang-db the affected algorithms were never documented in the API Reference so far, they were added in 3.5 as the first attempt to support ml on connect, then they were deprecated in 4.0. (To be conservative, we didn't remove them in 4.0)

This issue was raised by apache security team, I guess no end users will be affected.

@dongjoon-hyun
Copy link
Member

It sounds good to me.

@dongjoon-hyun dongjoon-hyun marked this pull request as draft December 2, 2025 00:56
@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Dec 2, 2025

BTW, please remove [WIP] before you are going to land this. I just switched to Draft status to prevent accidental merging with [WIP] in the title because this PR got approval already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants