-
Notifications
You must be signed in to change notification settings - Fork 29k
[WIP][PYTHON][ML] Don't use pickle to save/load models in pyspark.ml.connect
#53269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @zhengruifeng .
Do you mean this is about storing and loading a transient and temporary data?
no, the whole module pyspark.ml.connect is never documented, so should not affect any end users
I'm wondering if we need to keep the existing code path. Is there any chance for the existing users to try to load old data format?
allisonwang-db
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this introduce any behavior changes?
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding, it should be fine. Arrow is a required dependency for connect in ML, and the data being used here has the explicit type.
pickle itself isn't considered as a stable type for storage (it does change even between minor Python versions)
|
I am not against adding the legacy code path but TBH ML connect is still sort of immature (hence not documented) cc @WeichenXu123 for second opinion if you have |
|
@dongjoon-hyun @allisonwang-db the affected algorithms were never documented in the API Reference so far, they were added in 3.5 as the first attempt to support ml on connect, then they were deprecated in 4.0. (To be conservative, we didn't remove them in 4.0) This issue was raised by apache security team, I guess no end users will be affected. |
|
It sounds good to me. |
|
BTW, please remove |
What changes were proposed in this pull request?
save/load models in
pyspark.ml.connectwith arrow-based parquet formatWhy are the changes needed?
pickle.loadcan run arbitrary codes and cause security issuesDoes this PR introduce any user-facing change?
no, the whole module
pyspark.ml.connectis never documented, so should not affect any end usersHow was this patch tested?
ci
Was this patch authored or co-authored using generative AI tooling?
no