-
Notifications
You must be signed in to change notification settings - Fork 28.8k
[SPARK-52580][PS] Avoid CAST_INVALID_INPUT of replace
in ANSI mode
#51297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@ueshin may I get a review please? |
cc @HyukjinKwon @zhengruifeng please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, LGTM.
python/pyspark/pandas/series.py
Outdated
else: | ||
cond = self.spark.column.isNull() | ||
else: | ||
lit = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: let's avoid using the same name as well-known functions. It may unexpectedly overwrite the global function if exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, good catch! Renamed.
to_replace_values = ( | ||
[to_replace] | ||
if not is_list_like(to_replace) or isinstance(to_replace, str) | ||
else to_replace | ||
) | ||
to_replace_values = cast(List[Any], to_replace_values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you try:
to_replace_values: List[Any] = (
...
)
to see mypy
is happy with it? If not, it's fine with as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to_replace_values: List[Any] = (
[to_replace]
if not is_list_like(to_replace) or isinstance(to_replace, str)
else to_replace
)
causes
python/pyspark/pandas/series.py:5122: error: Incompatible types in assignment (expression has type "list[Any] | int | float", variable has type "list[Any]") [assignment]
I'm afraid we might have to keep to_replace_values = cast(List[Any], to_replace_values)
.
CondaHTTPError: HTTP 000 CONNECTION FAILED for url https://conda.anaconda.org/conda-forge/linux-64/repodata.json Retriggering tests |
Merged to master thanks! |
What changes were proposed in this pull request?
Avoid CAST_INVALID_INPUT of
replace
in ANSI mode.Specifically, under ANSI mode
An example of the spark plan difference between ANSI on/off is:
Why are the changes needed?
Ensure pandas on Spark works well with ANSI mode on.
Part of https://issues.apache.org/jira/browse/SPARK-52556.
Does this PR introduce any user-facing change?
Yes,
replace
works in ANSI, for exampleHow was this patch tested?
Unit tests
Was this patch authored or co-authored using generative AI tooling?
No