-
-
Notifications
You must be signed in to change notification settings - Fork 18.9k
API: mode.nan_is_na to consistently distinguish NaN-vs-NA #62040
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
1d85ad8
to
1ccaaa4
Compare
f0e5e34
to
71d1c03
Compare
71d1c03
to
b7ea9ae
Compare
Discussed in the dev call before last where I, @mroeschke, and @Dr-Irv were +1. Joris was unenthused but "not necessarily opposed". On slack @rhshadrach expressed a +1. All those opinions were to the concept, not the execution. |
gentle ping @mroeschke |
result = self._evaluate_op_method(other, op, ARROW_ARITHMETIC_FUNCS) | ||
if is_nan_na() and result.dtype.kind == "f": | ||
parr = result._pa_array | ||
mask = pc.is_nan(parr).to_numpy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be cast to_numpy()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That surprised me too, but yes. Without it pc.replace_with_mask raises
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious, what was the error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas/core/arrays/arrow/array.py:1008: in _arith_method
arr = pc.replace_with_mask(parr, mask, pa.scalar(None, type=parr.type))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
../../../.pyenv/versions/3.11.13/lib/python3.11/site-packages/pyarrow/compute.py:252: in wrapper
return func.call(args, None, memory_pool)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pyarrow/_compute.pyx:399: in pyarrow._compute.Function.call
???
pyarrow/error.pxi:155: in pyarrow.lib.pyarrow_internal_check_status
arrays = [np.nan] * len(columns) | ||
if dtype is not None and not isinstance(dtype, np.dtype): | ||
# e.g. test_dataframe_from_dict_of_series | ||
arrays = [NA] * len(columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we want the placeholder here to be nan
for StringDtype(na_value=nan)
, i.e.
if ... and isinstance(dtype, ExtensionDtype):
arrays = [dtype.na_value] * len(columns)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that'd probably be benign. would we expect pd.NA to ever not-work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah not sure if something like
df = pd.DataFrame({"a": ["b"]}, columns=["a", "b"], dtype=pd.StringDtype(na_value=np.nan))
df.loc[0, "b"]
Would correctly return nan here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it does
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.As discussed on the last dev call, this implements
"mode.nan_is_na"
(defaultTrue
) to consider NaN as either always-equivalent or never-equivalent to NA.This sits on top of
DataFrame.where
Still need to