Skip to content

Add support for Pyarrow filesystem specific properties #2251

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

geruh
Copy link
Contributor

@geruh geruh commented Jul 25, 2025

closes #2247

Rationale for this change

Add support for PyArrow filesystem specific properties allowing users to pass properties directly to the underlying PyArrow filesystem implementations (e.g. S3FileSystem, GcsFileSystem, AzureFileSystem, etc.) without requiring mapping support in PyIceberg.

Since Pyarrow is a compiled implementation, it won't soft ignore extra parameters making this functionality pretty tedious. BUt now depending on the file systems the properties can be passed through to Pyarrow and interpreted there directly.

Supported filesystems:

Docs:
Screenshot 2025-08-04 at 4 16 08 PM

Are these changes tested?

Added tests and tested locally with S3

Are there any user-facing changes?

Yes new capabilities, updated docs

@kevinjqliu
Copy link
Contributor

Thank you for the PR!

This is a great way to unblock end users from configuring the different file systems. Instead of allowlisting each property individually, for each file system implementation constructor, this will allow us to pass args to the file system constructor.

We should verify that the implementation does not conflict with existing properties. For example, s3.access-key-id can now also be passed in as access_key in S3FileSystem but we should prioritize s3.access-key-id
https://github.com/apache/iceberg/blob/571056929091f1e62412500045f71e5ba6ea00ad/aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIOProperties.java#L249

@@ -83,3 +84,46 @@ def get_header_properties(
) -> Properties:
header_prefix_len = len(HEADER_PREFIX)
return {key[header_prefix_len:]: value for key, value in properties.items() if key.startswith(HEADER_PREFIX)}


def properties_with_prefix(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinjqliu
Copy link
Contributor

@geruh and i chatted offline about this PR and aligned on this approach.

I love that we have so many unit/integration tests, makes me feel confident about this refactor

@kevinjqliu kevinjqliu requested review from mccormickt12 and Fokko and removed request for mccormickt12 August 4, 2025 22:17
Copy link
Contributor Author

@geruh geruh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve

@Fokko
Copy link
Contributor

Fokko commented Aug 5, 2025

Sorry for jumping in so late here, I was out touching grass.

While I see what we're trying to achieve here, and provide flexibility to the user, I think it comes at a cost. First of all, the most obivous one; it creates multiple paths to set the same configuration, which might be confusing for new users. That said, the intended of allow-listing the configuration is an effort to consolidate the configuration accross implementations, but also across FileIO's. For example, s3.connect-timeout and s3.request-timeout is picked up both by fsspec and PyArrow. This avoids having to go though the documentation of each of the underlying implementation, or in the case of PyArrow, through the source code :D

I think the main question here is, to what extend we want to try to hide the implementation behind the FileIO, or do we want to expose the implementation details to the end-users. I'm leaning towards the first, but curious what others' think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[feature request] improve file io configs
4 participants