Skip to content

Conversation

xinrong-meng
Copy link
Member

@xinrong-meng xinrong-meng commented Jun 26, 2025

What changes were proposed in this pull request?

Avoid CAST_INVALID_INPUT of replace in ANSI mode.

Specifically, under ANSI mode

  • used try_cast() to safely cast values
  • NaN checks, we now avoid F.isnan() on non-numeric types

An example of the spark plan difference between ANSI on/off is:

# if the original column is of StringType
# ANSI off
Column<'CASE WHEN in(C, 0, 1, 2, 3, 5, 6) THEN 4 ELSE C END'>

# ANSI on
Column<'CASE WHEN in(C, TRY_CAST(0 AS STRING), TRY_CAST(1 AS STRING), TRY_CAST(2 AS STRING), TRY_CAST(3 AS STRING), TRY_CAST(5 AS STRING), TRY_CAST(6 AS STRING)) THEN TRY_CAST(4 AS STRING) ELSE TRY_CAST(C AS STRING) END'>

Why are the changes needed?

Ensure pandas on Spark works well with ANSI mode on.
Part of https://issues.apache.org/jira/browse/SPARK-52556.

Does this PR introduce any user-facing change?

Yes, replace works in ANSI, for example

>>> ps.set_option("compute.fail_on_ansi_mode", False)
>>> ps.set_option("compute.ansi_mode_support", True)
>>> pdf = pd.DataFrame(
...             {"A": [0, 1, 2, 3, np.nan], "B": [5, 6, 7, 8, np.nan], "C": ["a", "b", "c", "d", None]},
...             index=np.random.rand(5),
...         )
>>> psdf = ps.from_pandas(pdf)
>>> psdf["C"].replace([0, 1, 2, 3, 5, 6], 4)
0.458472       a
0.749773       b
0.222904       c
0.397280       d
0.293933    None
Name: C, dtype: object
>>> psdf.replace([0, 1, 2, 3, 5, 6], [6, 5, 4, 3, 2, 1])
            A    B     C                                                        
0.458472  6.0  2.0     a
0.749773  5.0  1.0     b
0.222904  4.0  7.0     c
0.397280  3.0  8.0     d
0.293933  NaN  NaN  None

How was this patch tested?

Unit tests

Was this patch authored or co-authored using generative AI tooling?

No

@xinrong-meng
Copy link
Member Author

@ueshin may I get a review please?

@xinrong-meng xinrong-meng requested a review from ueshin July 29, 2025 18:08
@xinrong-meng
Copy link
Member Author

cc @HyukjinKwon @zhengruifeng please

Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, LGTM.

else:
cond = self.spark.column.isNull()
else:
lit = (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's avoid using the same name as well-known functions. It may unexpectedly overwrite the global function if exists.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, good catch! Renamed.

Comment on lines +5121 to +5126
to_replace_values = (
[to_replace]
if not is_list_like(to_replace) or isinstance(to_replace, str)
else to_replace
)
to_replace_values = cast(List[Any], to_replace_values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you try:

to_replace_values: List[Any] = (
    ...
)

to see mypy is happy with it? If not, it's fine with as-is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

                    to_replace_values: List[Any] = (
                        [to_replace]
                        if not is_list_like(to_replace) or isinstance(to_replace, str)
                        else to_replace
                    )

causes

python/pyspark/pandas/series.py:5122: error: Incompatible types in assignment (expression has type "list[Any] | int | float", variable has type "list[Any]")  [assignment]

I'm afraid we might have to keep to_replace_values = cast(List[Any], to_replace_values).

@xinrong-meng
Copy link
Member Author

CondaHTTPError: HTTP 000 CONNECTION FAILED for url https://conda.anaconda.org/conda-forge/linux-64/repodata.json

Retriggering tests

@xinrong-meng
Copy link
Member Author

Merged to master thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants