[SPARK-52791][PS] Fix error when inferring a UDT with a null first element #51475

petern48 · 2025-07-14T14:17:10Z

What changes were proposed in this pull request?

I modified the udt condition to check the first non-null element instead of the first element (which might be null).

Why are the changes needed?

import pyspark.pandas as ps
from pyspark.ml.linalg import SparseVector
sparse_values = {0: 0.1, 1: 1.1}
ps_series = ps.Series([None, SparseVector(1, \{0: 1.2}), SparseVector(1, \{0: 3})])

Error:

pyarrow.lib.ArrowInvalid: Could not convert SparseVector(1, {0: 1.2}) with type SparseVector: did not recognize Python value type when inferring an Arrow data type

This should work as normal, but it fails because the first element is None

Does this PR introduce any user-facing change?

Yes, previously it would error, but now it works properly. This is a behavior change from all previous spark versions, and should probably be backported.

How was this patch tested?

Added a test

Was this patch authored or co-authored using generative AI tooling?

No

petern48 · 2025-07-22T04:48:46Z

@ueshin @HyukjinKwon

ueshin · 2025-07-22T05:20:50Z

python/pyspark/pandas/typedef/typehints.py

+        first_idx = pser.first_valid_index()
+        if first_idx is not None and hasattr(pser.loc[first_idx], "__UDT__"):
+            return pser.loc[first_idx].__UDT__


Good catch! but using .loc sounds dangerous, e.g.:

>>> pser = pd.Series([None,None,3], index=[1,0,1]) >>> i = pser.first_valid_index() >>> pser.loc[i] 1 NaN 1 3.0 dtype: float64

How about using notnull()?

Suggested change

first_idx = pser.first_valid_index()

if first_idx is not None and hasattr(pser.loc[first_idx], "__UDT__"):

return pser.loc[first_idx].__UDT__

notnull = pser[pser.notnull()]

if hasattr(notnull.iloc[0], "__UDT__"):

return notnull.iloc[0].__UDT__

Nice! Accepted the change and all tests passing

Co-authored-by: Takuya UESHIN <[email protected]>

HyukjinKwon · 2025-07-23T00:14:30Z

Merged to master, branch-4.0 and branch-3.5.

petern48 · 2025-07-23T04:07:04Z

@HyukjinKwon Can we backport this to 3.5 too?

HyukjinKwon · 2025-07-23T06:37:17Z

👌

…ement I modified the udt condition to check the first non-null element instead of the first element (which might be null). ``` import pyspark.pandas as ps from pyspark.ml.linalg import SparseVector sparse_values = {0: 0.1, 1: 1.1} ps_series = ps.Series([None, SparseVector(1, \{0: 1.2}), SparseVector(1, \{0: 3})]) ``` Error: ``` pyarrow.lib.ArrowInvalid: Could not convert SparseVector(1, {0: 1.2}) with type SparseVector: did not recognize Python value type when inferring an Arrow data type ``` This should work as normal, but it fails because the first element is None Yes, previously it would error, but now it works properly. This is a behavior change from all previous spark versions, and should probably be backported. Added a test No Closes #51475 from petern48/fix_infer_spark_type. Authored-by: Peter Nguyen <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 5182eb4) Signed-off-by: Hyukjin Kwon <[email protected]>

Check first non-null element for infer_pd_series_spark_type

625c2ab

github-actions bot added PYTHON PANDAS API ON SPARK labels Jul 14, 2025

petern48 marked this pull request as ready for review July 14, 2025 14:47

petern48 changed the title ~~[SPARK-52791][PS] Fix inferring a UDT errors when first element is null~~ [SPARK-52791][PS] Fix error when inferring a UDT with a null first element Jul 14, 2025

ueshin reviewed Jul 22, 2025

View reviewed changes

petern48 and others added 2 commits July 21, 2025 23:25

Use notnull() mask instead of first_valid_index

9a7e8c0

Co-authored-by: Takuya UESHIN <[email protected]>

Add 'test_with_all_null'

8a3e1de

petern48 requested a review from ueshin July 22, 2025 16:14

HyukjinKwon approved these changes Jul 23, 2025

View reviewed changes

HyukjinKwon closed this in 5182eb4 Jul 23, 2025

petern48 deleted the fix_infer_spark_type branch July 23, 2025 04:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-52791][PS] Fix error when inferring a UDT with a null first element #51475

[SPARK-52791][PS] Fix error when inferring a UDT with a null first element #51475

Uh oh!

petern48 commented Jul 14, 2025

Uh oh!

petern48 commented Jul 22, 2025

Uh oh!

ueshin Jul 22, 2025

Uh oh!

petern48 Jul 22, 2025

Uh oh!

HyukjinKwon commented Jul 23, 2025 •

edited

Loading

Uh oh!

petern48 commented Jul 23, 2025

Uh oh!

HyukjinKwon commented Jul 23, 2025

Uh oh!

Uh oh!

[SPARK-52791][PS] Fix error when inferring a UDT with a null first element #51475

[SPARK-52791][PS] Fix error when inferring a UDT with a null first element #51475

Uh oh!

Conversation

petern48 commented Jul 14, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

petern48 commented Jul 22, 2025

Uh oh!

ueshin Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

petern48 Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petern48 commented Jul 23, 2025

Uh oh!

HyukjinKwon commented Jul 23, 2025

Uh oh!

Uh oh!

HyukjinKwon commented Jul 23, 2025 •

edited

Loading