-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
BUG FIX: to_json() with JSON Table Schema work correctly with string dtype. #61900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@@ -197,7 +195,7 @@ def convert_json_field_to_pandas_type(field) -> str | CategoricalDtype: | |||
""" | |||
typ = field["type"] | |||
if typ == "string": | |||
return "object" | |||
return field.get("extDtype", "object") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return field.get("extDtype", "object") | |
return field.get("extDtype", "object") |
Should that now default to "str" instead of "object"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, while converting back to pandas from JSON, extDtype will be missing for object type, so it will return object. For the str case, extDtype will be equal to str.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return field.get("extDtype", "object") | |
return field.get("extDtype", None) |
As another alternative: keep the dtype as None
here, which I think will mean in practice that we keep the inferred dtype from the construction.
Because right now (because of the above logic to use object
for "string" type), data written by older pandas will not have the extDtype
, and so will convert string columns to object when reading:
>>> pd.options.future.infer_string = False
>>> df = pd.DataFrame({"col_str": ["a", "b", "c", None], "col_object": ["str", 1, 1.5, None]})
>>> output = df.to_json(orient="table")
>>> output
'{"schema":{"fields":[{"name":"index","type":"integer"},{"name":"col_str","type":"string"},{"name":"col_object","type":"string"}],"primaryKey":["index"],"pandas_version":"1.4.0"},"data":[{"index":0,"col_str":"a","col_object":"str"},{"index":1,"col_str":"b","col_object":1},{"index":2,"col_str":"c","col_object":1.5},{"index":3,"col_str":null,"col_object":null}]}'
>>> pd.options.future.infer_string = True
>>> df2 = pd.read_json(StringIO(output), orient="table")
>>> df2
col_str col_object
0 a str
1 b 1
2 c 1.5
3 NaN NaN
>>> df2.dtypes
col_str object
col_object object
dtype: object
while ideally the above should have df2
to have str
dtype for the col_str column.
@@ -70,7 +70,7 @@ def test_build_table_schema(self, df_schema, using_infer_string): | |||
"primaryKey": ["idx"], | |||
} | |||
if using_infer_string: | |||
expected["fields"][2] = {"name": "B", "type": "any", "extDtype": "str"} | |||
expected["fields"][2] = {"name": "B", "type": "string", "extDtype": "str"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if we should still include the "extDtype": "str"
here for the default string dtype. While having this extDtype
for actual (opt-in) extension dtypes is very useful to support proper roundtripping, I am not sure it is useful for a default data type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I first removed it, but then converting back to pandas caused issues. I was not able to distinguish between object and str dtype because both had the exact same JSON.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that is a good point.
I am wondering for newly written data, if we shouldn't use "type": "any"
for true object dtype, so we can make that distinction that way.
Fixes: #61889
To ensure consistent behavior for to_json(), when dtype="str" is used, it will now output "type": "string" instead of "type": "any".
Before Fix:
After Fix:
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.