Skip to content

Conversation

@YukunR
Copy link

@YukunR YukunR commented Nov 4, 2025

Description

read_parquet documentation specifies that only str / os.PathLike or file-like objects are supported.
Previously, passing unexpected container types such as list would reach backends and fail inconsistently.

This PR adds an early TypeError check at the public API boundary to provide clear, consistent behaviour across pyarrow / fastparquet engines.

Tests are added for both valid and invalid path types.

@YukunR YukunR changed the title BUG: Validate path type in read_parquet, reject non-path/file-like (#62964) BUG: Validate path type in read_parquet, reject non-path/file-like Nov 4, 2025
Comment on lines +271 to +284
def test_read_parquet_valid_path_types(tmp_path, engine):
# GH #62922
df = pd.DataFrame({"a": [1]})
path = tmp_path / "test_read_parquet.parquet"
df.to_parquet(path, engine=engine)
# str
read_parquet(str(path), engine=engine)
# os.PathLike
read_parquet(pathlib.Path(path), engine=engine)
# file-like object
buf = BytesIO()
df.to_parquet(buf, engine=engine)
buf.seek(0)
read_parquet(buf, engine=engine)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is not needed

Comment on lines +257 to +261
bad_path_types = [
[str(path)], # list
(str(path),), # tuple
b"raw-bytes", # bytes
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing the list case is sufficient

raise TypeError(
f"read_parquet expected str/os.PathLike or file-like object, "
f"got {type(path).__name__} type"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be done using get_handle in the _get_path_or_handle function.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for the review!

I looked into _get_path_or_handle more closely.

get_handle is only invoked when we already know path_or_handle is a string and not a directory. For invalid inputs like a list, we never reach that branch as they get passed through stringify_path unchanged.

Also, _get_path_or_handle is only used in PyArrowImpl.read, not in FastParquetImpl.read. So if we only rely on _get_path_or_handle to validate input, validation coverage would be asymmetric across engines.

So, I propose to validate the path type in read_parquet before engine dispatch. Alternatively, I can factor a small _validate_parquet_path_arg(path) helper and call it at the top of both PyArrowImpl.read and FastParquetImpl.read

Let me know which placement you prefer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: read_parquet fails when passing a list of remote files (gcs)

2 participants