Skip to content

Fix filesystem #2291

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Fix filesystem #2291

wants to merge 4 commits into from

Conversation

mccormickt12
Copy link

Rationale for this change

For hdfs it's common to get scheme and netloc from config and have paths be just the uri. Add environment variables to support this case.

example
tmccormi@ltx1-hcl14866 [ ~/python ]$ export DEFAULT_SCHEME=hdfs
tmccormi@ltx1-hcl14866 [ ~/python ]$ export DEFAULT_NETLOC=ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000

Are these changes tested?

Tested in test env at linkedin and with unit tests

Are there any user-facing changes?

No user facing changes by default. If you add these env variables, if file path doesn't have scheme/netloc it'll use the defaults specified.

return uri.scheme, uri.netloc, uri.path

# Load defaults from environment
default_scheme = os.getenv("DEFAULT_SCHEME", "file")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use central config instead of direct usage of env variables? pyiceberg/utils/config.py

this would enable configuration via file OR env variables, which is how most other configs are documented and exposed to catalog construction.

@cbb330
Copy link

cbb330 commented Aug 6, 2025

thanks Tom! overall excited for this PR because I've had to hack around this with an overriden PyArrowFileIO e.g.

class HDFSFileIO(PyArrowFileIO):
    """Simple PyArrowFileIO that defaults paths without scheme to HDFS"""

    @override
    def new_input(self, location: str) -> PyArrowFile:
        """Fix paths without scheme to use HDFS"""
        if not urlparse(location).scheme and location.startswith('/'):
            hdfs_host = self.properties.get(HDFS_HOST, 'localhost')
            hdfs_port = self.properties.get(HDFS_PORT, '9000')
            location = f'hdfs://{hdfs_host}:{hdfs_port}{location}'
        return super().new_input(location)

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I added a comment about passing the properties and structuring the specific code for hdfs

Comment on lines +401 to +403
default_scheme = config.get_str("default-scheme") or "file"
default_netloc = config.get_str("default-netloc") or ""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i think its better to pass these in through the properties field
https://py.iceberg.apache.org/configuration/#hdfs

we can get the env variable and then pass into the properties.

Comment on lines +404 to +415
# Apply logic
scheme = uri.scheme or default_scheme
netloc = uri.netloc or default_netloc

if scheme in ("hdfs", "viewfs"):
return scheme, netloc, uri.path
else:
return uri.scheme, uri.netloc, f"{uri.netloc}{uri.path}"
# For non-HDFS URIs, include netloc in the path if present
path = uri.path if uri.scheme else os.path.abspath(location)
if netloc and not path.startswith(netloc):
path = f"{netloc}{path}"
return scheme, netloc, path
Copy link
Contributor

@kevinjqliu kevinjqliu Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i actually really want to get rid of this if {scheme} logic here.

Is there a way to refactor these changes down to the _initialize_hdfs_fs? so we can keep the hdfs logic in the same place?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't see a nice way to do this since the path used in the pyarrowfile is actually different in the different cases, i tried to see if we could use the same path with netloc in it for hdfs but it doesn't seem to work
#2291 (comment)

@mccormickt12
Copy link
Author

this shows that setting the netloc on filesystem creation and having it in the path (as is done for the other fs types) doesn't work for hdfs

>>> hdfs = fs.HadoopFileSystem(host='ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com', port=9000)
25/08/07 17:21:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
>>> table_base = "/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567"
>>> long_table_base = "ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567"
>>> hdfs.get_file_info(fs.FileSelector(table_base))
25/08/07 17:22:00 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
[<FileInfo for '/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/00000-3ec53886-ceae-46f2-a926-050afb7f95b9.metadata.json': type=FileType.File, size=2900>, <FileInfo for '/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/00001-fc1f6c92-0449-4deb-8908-097db5f6589a.metadata.json': type=FileType.File, size=4366>, <FileInfo for '/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/data': type=FileType.Directory>, <FileInfo for '/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/metadata': type=FileType.Directory>]
>>> hdfs.get_file_info(fs.FileSelector(long_table_base))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/_fs.pyx", line 582, in pyarrow._fs.FileSystem.get_file_info
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
FileNotFoundError: [Errno 2] HDFS list directory failed. Detail: [errno 2] No such file or directory

@kevinjqliu
Copy link
Contributor

kevinjqliu commented Aug 10, 2025

What is the proper way to address an absolute path in HadoopFileSystem? your example shows that /path/to/file/ works but {host}/path/to/file does not work. Should {host}/path/to/file also work?

Im trying to see what the requirements are here. I only found examples with hdfs://

Also im curious if HadoopFileSystem.from_uri will work for long_table_base

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants