Skip to content

Conversation

petern48
Copy link
Contributor

What changes were proposed in this pull request?

I modified the udt condition to check the first non-null element instead of the first element (which might be null).

Why are the changes needed?

import pyspark.pandas as ps
from pyspark.ml.linalg import SparseVector
sparse_values = {0: 0.1, 1: 1.1}
ps_series = ps.Series([None, SparseVector(1, \{0: 1.2}), SparseVector(1, \{0: 3})])

Error:

pyarrow.lib.ArrowInvalid: Could not convert SparseVector(1, {0: 1.2}) with type SparseVector: did not recognize Python value type when inferring an Arrow data type

This should work as normal, but it fails because the first element is None

Does this PR introduce any user-facing change?

Yes, previously it would error, but now it works properly. This is a behavior change from all previous spark versions, and should probably be backported.

How was this patch tested?

Added a test

Was this patch authored or co-authored using generative AI tooling?

No

@petern48 petern48 marked this pull request as ready for review July 14, 2025 14:47
@petern48 petern48 changed the title [SPARK-52791][PS] Fix inferring a UDT errors when first element is null [SPARK-52791][PS] Fix error when inferring a UDT with a null first element Jul 14, 2025
@petern48
Copy link
Contributor Author

@ueshin @HyukjinKwon

Comment on lines 365 to 367
first_idx = pser.first_valid_index()
if first_idx is not None and hasattr(pser.loc[first_idx], "__UDT__"):
return pser.loc[first_idx].__UDT__
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! but using .loc sounds dangerous, e.g.:

>>> pser = pd.Series([None,None,3], index=[1,0,1])
>>> i = pser.first_valid_index()
>>> pser.loc[i]
1    NaN
1    3.0
dtype: float64

How about using notnull()?

Suggested change
first_idx = pser.first_valid_index()
if first_idx is not None and hasattr(pser.loc[first_idx], "__UDT__"):
return pser.loc[first_idx].__UDT__
notnull = pser[pser.notnull()]
if hasattr(notnull.iloc[0], "__UDT__"):
return notnull.iloc[0].__UDT__

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Accepted the change and all tests passing

@petern48 petern48 requested a review from ueshin July 22, 2025 16:14
@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jul 23, 2025

Merged to master, branch-4.0 and branch-3.5.

@petern48 petern48 deleted the fix_infer_spark_type branch July 23, 2025 04:06
@petern48
Copy link
Contributor Author

@HyukjinKwon Can we backport this to 3.5 too?

@HyukjinKwon
Copy link
Member

👌

HyukjinKwon pushed a commit that referenced this pull request Jul 23, 2025
…ement

I modified the udt condition to check the first non-null element instead of the first element (which might be null).

```
import pyspark.pandas as ps
from pyspark.ml.linalg import SparseVector
sparse_values = {0: 0.1, 1: 1.1}
ps_series = ps.Series([None, SparseVector(1, \{0: 1.2}), SparseVector(1, \{0: 3})])
```
Error:
```
pyarrow.lib.ArrowInvalid: Could not convert SparseVector(1, {0: 1.2}) with type SparseVector: did not recognize Python value type when inferring an Arrow data type
```
This should work as normal, but it fails because the first element is None

Yes, previously it would error, but now it works properly. This is a behavior change from all previous spark versions, and should probably be backported.

Added a test

No

Closes #51475 from petern48/fix_infer_spark_type.

Authored-by: Peter Nguyen <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 5182eb4)
Signed-off-by: Hyukjin Kwon <[email protected]>
HyukjinKwon pushed a commit that referenced this pull request Jul 23, 2025
…ement

I modified the udt condition to check the first non-null element instead of the first element (which might be null).

```
import pyspark.pandas as ps
from pyspark.ml.linalg import SparseVector
sparse_values = {0: 0.1, 1: 1.1}
ps_series = ps.Series([None, SparseVector(1, \{0: 1.2}), SparseVector(1, \{0: 3})])
```
Error:
```
pyarrow.lib.ArrowInvalid: Could not convert SparseVector(1, {0: 1.2}) with type SparseVector: did not recognize Python value type when inferring an Arrow data type
```
This should work as normal, but it fails because the first element is None

Yes, previously it would error, but now it works properly. This is a behavior change from all previous spark versions, and should probably be backported.

Added a test

No

Closes #51475 from petern48/fix_infer_spark_type.

Authored-by: Peter Nguyen <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 5182eb4)
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants