[SPARK-52922][PS] Avoid CAST_INVALID_INPUT of "astype" in ANSI mode #51627

xinrong-meng · 2025-07-23T00:58:16Z

What changes were proposed in this pull request?

Avoid CAST_INVALID_INPUT of "astype" in ANSI mode

Why are the changes needed?

Ensure pandas on Spark works well with ANSI mode on.
Part of https://issues.apache.org/jira/browse/SPARK-52556.

Does this PR introduce any user-facing change?

>>> import pandas as pd
>>> import numpy as np
>>> 
>>> ps.set_option("compute.fail_on_ansi_mode", False)
>>> ps.set_option("compute.ansi_mode_support", True)

BEFORE
```py
>>> ps.Series(["abc"]).astype(int)
...
org.apache.spark.SparkNumberFormatException: [CAST_INVALID_INPUT] The value 'abc' of the type "STRING" cannot be cast to "BIGINT" because it is malformed. Correct the value as per the syntax, or change its target type. Use `try_cast` to tolerate malformed input and return NULL instead. SQLSTATE: 22018
...

AFTER

>>> ps.Series(["abc"]).astype(int)
0   NaN                                                                         
dtype: float64

Following non ANSI:

>>> spark.conf.set("spark.sql.ansi.enabled", False)
>>> ps.Series(["abc"]).astype(int)
0   NaN
dtype: float64

How was this patch tested?

Existing tests.

Was this patch authored or co-authored using generative AI tooling?

No.

xinrong-meng · 2025-07-24T21:49:50Z

@ueshin may I get a review please?

ueshin · 2025-07-26T00:46:25Z

I'm wondering which we should follow here, pandas or non-ansi:

>>> pd.Series(["abc"]).astype(int)
Traceback (most recent call last):
...
ValueError: invalid literal for int() with base 10: 'abc'

xinrong-meng · 2025-07-28T22:49:53Z

Hi @ueshin as for astype, it reaches parity with non-ansi, as shown below:

>>> spark.conf.set("spark.sql.ansi.enabled", True)
>>> ps.Series(["abc"]).astype(int)
0   NaN                                                                         
dtype: float64

>>> spark.conf.set("spark.sql.ansi.enabled", False)
>>> ps.Series(["abc"]).astype(int)
0   NaN
dtype: float64

To mimic pandas, we can check whether values of the string column all represent valid integers or decimals, for example

numeric_regex = r"^\s*-?\d+(\.\d+)?\s*$"
col.rlike(numeric_regex)).count() == 0

but that is costly, and may not cover all corner cases.

WDYT?

ueshin · 2025-07-30T18:46:22Z

I guess we can leave it as is, raising the error, as pandas also raises an error, although the error type is different.

xinrong-meng · 2025-07-30T22:05:09Z

Thank you @ueshin !

The error is thrown inside the JVM executor as SparkNumberFormatException for now

org.apache.spark.SparkNumberFormatException: [CAST_INVALID_INPUT] The value 'abc' of the type "STRING" cannot be cast to "BIGINT" because it is malformed. Correct the value as per the syntax, or change its target type. Use `try_cast` to tolerate malformed input and return NULL instead. SQLSTATE: 22018

We might want to propagate the exception to Python driver as a follow up for error improvement.
I will close the PR for now.

xinrong-meng added 2 commits July 22, 2025 17:47

'abc' as int

ead9c74

test

902784e

github-actions bot added PYTHON PANDAS API ON SPARK labels Jul 23, 2025

xinrong-meng changed the title ~~[WIP] Avoid CAST_INVALID_INPUT of "astype" in ANSI mode~~ [SPARK-52922][PS] Avoid CAST_INVALID_INPUT of "astype" in ANSI mode Jul 24, 2025

lint

28ed3e4

xinrong-meng marked this pull request as ready for review July 24, 2025 00:45

xinrong-meng requested a review from ueshin July 29, 2025 18:03

xinrong-meng closed this Jul 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-52922][PS] Avoid CAST_INVALID_INPUT of "astype" in ANSI mode #51627

[SPARK-52922][PS] Avoid CAST_INVALID_INPUT of "astype" in ANSI mode #51627

Uh oh!

xinrong-meng commented Jul 23, 2025 •

edited

Loading

Uh oh!

xinrong-meng commented Jul 24, 2025

Uh oh!

ueshin commented Jul 26, 2025

Uh oh!

xinrong-meng commented Jul 28, 2025 •

edited

Loading

Uh oh!

ueshin commented Jul 30, 2025

Uh oh!

xinrong-meng commented Jul 30, 2025

Uh oh!

Uh oh!

[SPARK-52922][PS] Avoid CAST_INVALID_INPUT of "astype" in ANSI mode #51627

[SPARK-52922][PS] Avoid CAST_INVALID_INPUT of "astype" in ANSI mode #51627

Uh oh!

Conversation

xinrong-meng commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

xinrong-meng commented Jul 24, 2025

Uh oh!

ueshin commented Jul 26, 2025

Uh oh!

xinrong-meng commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ueshin commented Jul 30, 2025

Uh oh!

xinrong-meng commented Jul 30, 2025

Uh oh!

Uh oh!

xinrong-meng commented Jul 23, 2025 •

edited

Loading

xinrong-meng commented Jul 28, 2025 •

edited

Loading