Fix filesystem #2291

mccormickt12 · 2025-08-06T16:09:38Z

Rationale for this change

For hdfs it's common to get scheme and netloc from config and have paths be just the uri. Add environment variables to support this case.

example
tmccormi@ltx1-hcl14866 [ ~/python ]$ export DEFAULT_SCHEME=hdfs
tmccormi@ltx1-hcl14866 [ ~/python ]$ export DEFAULT_NETLOC=ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000

Are these changes tested?

Tested in test env at linkedin and with unit tests

Are there any user-facing changes?

No user facing changes by default. If you add these env variables, if file path doesn't have scheme/netloc it'll use the defaults specified.

…pecified in file path

cbb330 · 2025-08-06T17:50:37Z

pyiceberg/io/pyarrow.py

-            return uri.scheme, uri.netloc, uri.path
+
+        # Load defaults from environment
+        default_scheme = os.getenv("DEFAULT_SCHEME", "file")


can we use central config instead of direct usage of env variables? pyiceberg/utils/config.py

this would enable configuration via file OR env variables, which is how most other configs are documented and exposed to catalog construction.

cbb330 · 2025-08-06T17:54:27Z

thanks Tom! overall excited for this PR because I've had to hack around this with an overriden PyArrowFileIO e.g.

class HDFSFileIO(PyArrowFileIO):
    """Simple PyArrowFileIO that defaults paths without scheme to HDFS"""

    @override
    def new_input(self, location: str) -> PyArrowFile:
        """Fix paths without scheme to use HDFS"""
        if not urlparse(location).scheme and location.startswith('/'):
            hdfs_host = self.properties.get(HDFS_HOST, 'localhost')
            hdfs_port = self.properties.get(HDFS_PORT, '9000')
            location = f'hdfs://{hdfs_host}:{hdfs_port}{location}'
        return super().new_input(location)

kevinjqliu

Thanks for the PR! I added a comment about passing the properties and structuring the specific code for hdfs

kevinjqliu · 2025-08-07T15:34:39Z

pyiceberg/io/pyarrow.py

+        default_scheme = config.get_str("default-scheme") or "file"
+        default_netloc = config.get_str("default-netloc") or ""
+


nit: i think its better to pass these in through the properties field
https://py.iceberg.apache.org/configuration/#hdfs

we can get the env variable and then pass into the properties.

kevinjqliu · 2025-08-07T15:35:35Z

pyiceberg/io/pyarrow.py

+        # Apply logic
+        scheme = uri.scheme or default_scheme
+        netloc = uri.netloc or default_netloc
+
+        if scheme in ("hdfs", "viewfs"):
+            return scheme, netloc, uri.path
        else:
-            return uri.scheme, uri.netloc, f"{uri.netloc}{uri.path}"
+            # For non-HDFS URIs, include netloc in the path if present
+            path = uri.path if uri.scheme else os.path.abspath(location)
+            if netloc and not path.startswith(netloc):
+                path = f"{netloc}{path}"
+            return scheme, netloc, path


i actually really want to get rid of this if {scheme} logic here.

Is there a way to refactor these changes down to the _initialize_hdfs_fs? so we can keep the hdfs logic in the same place?

i don't see a nice way to do this since the path used in the pyarrowfile is actually different in the different cases, i tried to see if we could use the same path with netloc in it for hdfs but it doesn't seem to work
#2291 (comment)

mccormickt12 · 2025-08-07T17:23:32Z

this shows that setting the netloc on filesystem creation and having it in the path (as is done for the other fs types) doesn't work for hdfs

>>> hdfs = fs.HadoopFileSystem(host='ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com', port=9000)
25/08/07 17:21:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
>>> table_base = "/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567"
>>> long_table_base = "ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567"
>>> hdfs.get_file_info(fs.FileSelector(table_base))
25/08/07 17:22:00 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
[<FileInfo for '/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/00000-3ec53886-ceae-46f2-a926-050afb7f95b9.metadata.json': type=FileType.File, size=2900>, <FileInfo for '/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/00001-fc1f6c92-0449-4deb-8908-097db5f6589a.metadata.json': type=FileType.File, size=4366>, <FileInfo for '/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/data': type=FileType.Directory>, <FileInfo for '/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/metadata': type=FileType.Directory>]
>>> hdfs.get_file_info(fs.FileSelector(long_table_base))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/_fs.pyx", line 582, in pyarrow._fs.FileSystem.get_file_info
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
FileNotFoundError: [Errno 2] HDFS list directory failed. Detail: [errno 2] No such file or directory

kevinjqliu · 2025-08-10T20:32:13Z

What is the proper way to address an absolute path in HadoopFileSystem? your example shows that /path/to/file/ works but {host}/path/to/file does not work. Should {host}/path/to/file also work?

Im trying to see what the requirements are here. I only found examples with hdfs://

Also im curious if HadoopFileSystem.from_uri will work for long_table_base

Tom McCormick added 3 commits August 6, 2025 11:47

fix file system with env variables to set scheme and net loc if not s…

b40849e

…pecified in file path

add test

1b6f981

fix linting

ae22e64

cbb330 reviewed Aug 6, 2025

View reviewed changes

add get_str config support and get val from config

f4d16f2

kevinjqliu reviewed Aug 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix filesystem #2291

Fix filesystem #2291

mccormickt12 commented Aug 6, 2025

Uh oh!

cbb330 Aug 6, 2025

Uh oh!

cbb330 commented Aug 6, 2025 •

edited

Loading

Uh oh!

kevinjqliu left a comment

Uh oh!

kevinjqliu Aug 7, 2025

Uh oh!

kevinjqliu Aug 7, 2025 •

edited

Loading

Uh oh!

mccormickt12 Aug 7, 2025

Uh oh!

mccormickt12 commented Aug 7, 2025

Uh oh!

kevinjqliu commented Aug 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

		default_scheme = config.get_str("default-scheme") or "file"
		default_netloc = config.get_str("default-netloc") or ""

Fix filesystem #2291

Are you sure you want to change the base?

Fix filesystem #2291

Conversation

mccormickt12 commented Aug 6, 2025

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

cbb330 Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

cbb330 commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mccormickt12 Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

mccormickt12 commented Aug 7, 2025

Uh oh!

kevinjqliu commented Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

cbb330 commented Aug 6, 2025 •

edited

Loading

kevinjqliu Aug 7, 2025 •

edited

Loading

kevinjqliu commented Aug 10, 2025 •

edited

Loading