-
Notifications
You must be signed in to change notification settings - Fork 28.8k
[SPARK-52791][PS] Fix error when inferring a UDT with a null first element #51475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
first_idx = pser.first_valid_index() | ||
if first_idx is not None and hasattr(pser.loc[first_idx], "__UDT__"): | ||
return pser.loc[first_idx].__UDT__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! but using .loc
sounds dangerous, e.g.:
>>> pser = pd.Series([None,None,3], index=[1,0,1])
>>> i = pser.first_valid_index()
>>> pser.loc[i]
1 NaN
1 3.0
dtype: float64
How about using notnull()
?
first_idx = pser.first_valid_index() | |
if first_idx is not None and hasattr(pser.loc[first_idx], "__UDT__"): | |
return pser.loc[first_idx].__UDT__ | |
notnull = pser[pser.notnull()] | |
if hasattr(notnull.iloc[0], "__UDT__"): | |
return notnull.iloc[0].__UDT__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Accepted the change and all tests passing
Co-authored-by: Takuya UESHIN <[email protected]>
Merged to master, branch-4.0 and branch-3.5. |
@HyukjinKwon Can we backport this to 3.5 too? |
👌 |
…ement I modified the udt condition to check the first non-null element instead of the first element (which might be null). ``` import pyspark.pandas as ps from pyspark.ml.linalg import SparseVector sparse_values = {0: 0.1, 1: 1.1} ps_series = ps.Series([None, SparseVector(1, \{0: 1.2}), SparseVector(1, \{0: 3})]) ``` Error: ``` pyarrow.lib.ArrowInvalid: Could not convert SparseVector(1, {0: 1.2}) with type SparseVector: did not recognize Python value type when inferring an Arrow data type ``` This should work as normal, but it fails because the first element is None Yes, previously it would error, but now it works properly. This is a behavior change from all previous spark versions, and should probably be backported. Added a test No Closes #51475 from petern48/fix_infer_spark_type. Authored-by: Peter Nguyen <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 5182eb4) Signed-off-by: Hyukjin Kwon <[email protected]>
…ement I modified the udt condition to check the first non-null element instead of the first element (which might be null). ``` import pyspark.pandas as ps from pyspark.ml.linalg import SparseVector sparse_values = {0: 0.1, 1: 1.1} ps_series = ps.Series([None, SparseVector(1, \{0: 1.2}), SparseVector(1, \{0: 3})]) ``` Error: ``` pyarrow.lib.ArrowInvalid: Could not convert SparseVector(1, {0: 1.2}) with type SparseVector: did not recognize Python value type when inferring an Arrow data type ``` This should work as normal, but it fails because the first element is None Yes, previously it would error, but now it works properly. This is a behavior change from all previous spark versions, and should probably be backported. Added a test No Closes #51475 from petern48/fix_infer_spark_type. Authored-by: Peter Nguyen <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 5182eb4) Signed-off-by: Hyukjin Kwon <[email protected]>
What changes were proposed in this pull request?
I modified the udt condition to check the first non-null element instead of the first element (which might be null).
Why are the changes needed?
Error:
This should work as normal, but it fails because the first element is None
Does this PR introduce any user-facing change?
Yes, previously it would error, but now it works properly. This is a behavior change from all previous spark versions, and should probably be backported.
How was this patch tested?
Added a test
Was this patch authored or co-authored using generative AI tooling?
No