Skip to content

Conversation

xinrong-meng
Copy link
Member

@xinrong-meng xinrong-meng commented Jul 23, 2025

What changes were proposed in this pull request?

Avoid CAST_INVALID_INPUT of "astype" in ANSI mode

Why are the changes needed?

Ensure pandas on Spark works well with ANSI mode on.
Part of https://issues.apache.org/jira/browse/SPARK-52556.

Does this PR introduce any user-facing change?

>>> import pandas as pd
>>> import numpy as np
>>> 
>>> ps.set_option("compute.fail_on_ansi_mode", False)
>>> ps.set_option("compute.ansi_mode_support", True)

BEFORE
```py
>>> ps.Series(["abc"]).astype(int)
...
org.apache.spark.SparkNumberFormatException: [CAST_INVALID_INPUT] The value 'abc' of the type "STRING" cannot be cast to "BIGINT" because it is malformed. Correct the value as per the syntax, or change its target type. Use `try_cast` to tolerate malformed input and return NULL instead. SQLSTATE: 22018
...

AFTER

>>> ps.Series(["abc"]).astype(int)
0   NaN                                                                         
dtype: float64

Following non ANSI:

>>> spark.conf.set("spark.sql.ansi.enabled", False)
>>> ps.Series(["abc"]).astype(int)
0   NaN
dtype: float64

How was this patch tested?

Existing tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@xinrong-meng xinrong-meng changed the title [WIP] Avoid CAST_INVALID_INPUT of "astype" in ANSI mode [SPARK-52922][PS] Avoid CAST_INVALID_INPUT of "astype" in ANSI mode Jul 24, 2025
@xinrong-meng xinrong-meng marked this pull request as ready for review July 24, 2025 00:45
@xinrong-meng
Copy link
Member Author

@ueshin may I get a review please?

@ueshin
Copy link
Member

ueshin commented Jul 26, 2025

I'm wondering which we should follow here, pandas or non-ansi:

>>> pd.Series(["abc"]).astype(int)
Traceback (most recent call last):
...
ValueError: invalid literal for int() with base 10: 'abc'

@xinrong-meng
Copy link
Member Author

xinrong-meng commented Jul 28, 2025

Hi @ueshin as for astype, it reaches parity with non-ansi, as shown below:

>>> spark.conf.set("spark.sql.ansi.enabled", True)
>>> ps.Series(["abc"]).astype(int)
0   NaN                                                                         
dtype: float64

>>> spark.conf.set("spark.sql.ansi.enabled", False)
>>> ps.Series(["abc"]).astype(int)
0   NaN
dtype: float64

To mimic pandas, we can check whether values of the string column all represent valid integers or decimals, for example

numeric_regex = r"^\s*-?\d+(\.\d+)?\s*$"
col.rlike(numeric_regex)).count() == 0

but that is costly, and may not cover all corner cases.

WDYT?

@xinrong-meng xinrong-meng requested a review from ueshin July 29, 2025 18:03
@ueshin
Copy link
Member

ueshin commented Jul 30, 2025

I guess we can leave it as is, raising the error, as pandas also raises an error, although the error type is different.

@xinrong-meng
Copy link
Member Author

Thank you @ueshin !

The error is thrown inside the JVM executor as SparkNumberFormatException for now

org.apache.spark.SparkNumberFormatException: [CAST_INVALID_INPUT] The value 'abc' of the type "STRING" cannot be cast to "BIGINT" because it is malformed. Correct the value as per the syntax, or change its target type. Use `try_cast` to tolerate malformed input and return NULL instead. SQLSTATE: 22018

We might want to propagate the exception to Python driver as a follow up for error improvement.
I will close the PR for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants