Skip to content

BUG FIX: to_json() with JSON Table Schema work correctly with string dtype. #61900

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

khemkaran10
Copy link
Contributor

Fixes: #61889

To ensure consistent behavior for to_json(), when dtype="str" is used, it will now output "type": "string" instead of "type": "any".

Before Fix:

>>> pd.Series(["a", "b", None], dtype="str").to_json(orient="table", index=False)
'{"schema":{"fields":[{"name":"values","type":"any","extDtype":"str"}],"pandas_version":"1.4.0"},"data":[{"values":"a"},{"values":"b"},{"values":null}]}'

After Fix:

>>> pd.Series(["a", "b", None], dtype="str").to_json(orient="table", index=False)
'{"schema":{"fields":[{"name":"values","type":"string","extDtype":"str"}],"pandas_version":"1.4.0"},"data":[{"values":"a"},{"values":"b"},{"values":null}]}'

@mroeschke mroeschke added IO JSON read_json, to_json, json_normalize Strings String extension data type and string data labels Jul 21, 2025
@@ -197,7 +195,7 @@ def convert_json_field_to_pandas_type(field) -> str | CategoricalDtype:
"""
typ = field["type"]
if typ == "string":
return "object"
return field.get("extDtype", "object")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return field.get("extDtype", "object")
return field.get("extDtype", "object")

Should that now default to "str" instead of "object"?

Copy link
Contributor Author

@khemkaran10 khemkaran10 Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, while converting back to pandas from JSON, extDtype will be missing for object type, so it will return object. For the str case, extDtype will be equal to str.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return field.get("extDtype", "object")
return field.get("extDtype", None)

As another alternative: keep the dtype as None here, which I think will mean in practice that we keep the inferred dtype from the construction.

Because right now (because of the above logic to use object for "string" type), data written by older pandas will not have the extDtype, and so will convert string columns to object when reading:

>>> pd.options.future.infer_string = False
>>> df = pd.DataFrame({"col_str": ["a", "b", "c", None], "col_object": ["str", 1, 1.5, None]})
>>> output = df.to_json(orient="table")
>>> output
'{"schema":{"fields":[{"name":"index","type":"integer"},{"name":"col_str","type":"string"},{"name":"col_object","type":"string"}],"primaryKey":["index"],"pandas_version":"1.4.0"},"data":[{"index":0,"col_str":"a","col_object":"str"},{"index":1,"col_str":"b","col_object":1},{"index":2,"col_str":"c","col_object":1.5},{"index":3,"col_str":null,"col_object":null}]}'
>>> pd.options.future.infer_string = True
>>> df2 = pd.read_json(StringIO(output), orient="table")
>>> df2
  col_str col_object
0       a        str
1       b          1
2       c        1.5
3     NaN        NaN
>>> df2.dtypes
col_str       object
col_object    object
dtype: object

while ideally the above should have df2 to have str dtype for the col_str column.

@@ -70,7 +70,7 @@ def test_build_table_schema(self, df_schema, using_infer_string):
"primaryKey": ["idx"],
}
if using_infer_string:
expected["fields"][2] = {"name": "B", "type": "any", "extDtype": "str"}
expected["fields"][2] = {"name": "B", "type": "string", "extDtype": "str"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we should still include the "extDtype": "str" here for the default string dtype. While having this extDtype for actual (opt-in) extension dtypes is very useful to support proper roundtripping, I am not sure it is useful for a default data type.

Copy link
Contributor Author

@khemkaran10 khemkaran10 Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I first removed it, but then converting back to pandas caused issues. I was not able to distinguish between object and str dtype because both had the exact same JSON.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that is a good point.
I am wondering for newly written data, if we shouldn't use "type": "any" for true object dtype, so we can make that distinction that way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: make to_json with JSON Table Schema work correctly with string dtype
3 participants