Skip to content

Conversation

@OneSizeFitsQuorum
Copy link
Contributor

@OneSizeFitsQuorum OneSizeFitsQuorum commented Nov 5, 2025

Allow getattr(f1, "size", None) returns an int. So that we can set size properly here

Allow 'size' parameter to be a callable that returns an int.
@OneSizeFitsQuorum OneSizeFitsQuorum changed the title Update size parameter to accept callable Update size parameter to accept callable for callback.set_size() Nov 5, 2025
@martindurant
Copy link
Member

This is maybe OK, but I didn't understand the original issue.

Certainly, this PR should add documentation about how to use this change, and a test showing it in practice.

@OneSizeFitsQuorum
Copy link
Contributor Author

Thanks a lot for reviewing this! Maybe we can change this line from callback.set_size(getattr(f1, "size", None)) to callback.set_size(getattr(f1, "size", None)()) in here.
or add a wrapper function to ensure the parameters passed to set_size is a int, then we do not need update set_size(). What's your opinion?
image

@OneSizeFitsQuorum
Copy link
Contributor Author

@martindurant hi, could you please review this again~

@martindurant
Copy link
Member

I guess this is fine, but I still don't know why size would be a function: can you make a test case that hits this extra line?

@OneSizeFitsQuorum
Copy link
Contributor Author

@martindurant HadoopFileSystem inherits from ArrowFSWrapper, and ArrowFSWrapper inherits from AbstractFileSystem. However, AbstractFileSystem does not have a size field, only a size function. Therefore, when you use getattr(f, "size", None) to retrieve it, you get a callable function rather than a field. You need to invoke the function to obtain the corresponding size value.

@martindurant
Copy link
Member

Yes, agreed that size is a method on filesystems, .size(path)->int; I think the .size you are after is an attribute on a file object, though - no? e.g., https://github.com/fsspec/filesystem_spec/blob/master/fsspec/spec.py#L1920 .

Signed-off-by: OneSizeFitsQuorum <[email protected]>
@OneSizeFitsQuorum
Copy link
Contributor Author

@martindurant Yes! AbstractBufferedFile has a size attribute, but HadoopFileSystem inherits from ArrowFSWrapper, and ArrowFSWrapper inherits from AbstractFileSystem. So HadoopFileSystem only has a size method rather than an attribute. That's why I encounter this error and submit this pr~

fs.get_file(remote_dir + "/test_file.txt", str(local_file))
with open(local_file, "rb") as f:
assert f.read() == data
with pytest.raises(OSError, match="only valid on seekable files"):
Copy link
Member

@martindurant martindurant Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we get() the file just because it's not seekable? Isn't that exactly what the code before this PR was doing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we call size() inside the get_file()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we can read a stream until it's done. We don't want that workflow to become unusable.

@martindurant
Copy link
Member

OK, finally I understand your problem. Our open methods return fsspec.implementations.arrow.ArrowFile . These defer to arrow for various methods including size. And the arrow definition of size ( https://arrow.apache.org/docs/python/generated/pyarrow.NativeFile.html#pyarrow.NativeFile.size ) is a method not an attribute.

So what we actually need to do is, rather than change the callback mechanism, make size a property:

    @property
    def size(self)
        return self.stream.size()

and actually this gives a chance to intercept too - are there cases where we already know the size rather than having to calculate it again?

@OneSizeFitsQuorum
Copy link
Contributor Author

@martindurant
Apologies for the unclear wording earlier, which caused you to spend extra time, but fortunately, You now fully understand my point, and I respect you for your patience!

So, if I understand correctly, you're suggesting that we should submit a PR to pyarrow to add a size attribute? However, there are two concerns:

  • This would introduce a variable and a method with the same name in pyarrow, which is generally not best practice and could lead to confusion.
  • I'm unsure if other fsspec implementations also use size as a method. If that’s the case, making a more generalized change in fsspec seems like the most cost-effective solution, and it would also improve the adaptability of fsspec.

What’s your opinion on this?

@martindurant
Copy link
Member

No, there is no need to change arrow - the class is here: https://github.com/fsspec/filesystem_spec/blob/master/fsspec/implementations/arrow.py#L230

Signed-off-by: OneSizeFitsQuorum <[email protected]>
Signed-off-by: OneSizeFitsQuorum <[email protected]>
@OneSizeFitsQuorum OneSizeFitsQuorum changed the title Update size parameter to accept callable for callback.set_size() Fix get_file failed when enable child callback to regard size as a attribute Nov 19, 2025
@OneSizeFitsQuorum OneSizeFitsQuorum changed the title Fix get_file failed when enable child callback to regard size as a attribute Fix get_file failed when enabling child callback to regard size as a attribute Nov 19, 2025
Signed-off-by: OneSizeFitsQuorum <[email protected]>
Signed-off-by: OneSizeFitsQuorum <[email protected]>
Signed-off-by: OneSizeFitsQuorum <[email protected]>
@OneSizeFitsQuorum
Copy link
Contributor Author

OneSizeFitsQuorum commented Nov 19, 2025

@martindurant Yes, you've found the most elegant solution! I found that for the get function, once the callback passed is not the default default lookback but rather a tqdmcallback with some actual behavior, it triggers an error. After setting size as a property, I don't see any more issues

BTW, the left ci failure seems irrelevant with this pr, as I comment out all my changes in this commit, but the ci still fails

Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please remove "size" from the list at L226

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants