-
Notifications
You must be signed in to change notification settings - Fork 416
Fix get_file failed when enabling child callback to regard size as a attribute #1944
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Allow 'size' parameter to be a callable that returns an int.
|
This is maybe OK, but I didn't understand the original issue. Certainly, this PR should add documentation about how to use this change, and a test showing it in practice. |
|
Thanks a lot for reviewing this! Maybe we can change this line from |
|
@martindurant hi, could you please review this again~ |
|
I guess this is fine, but I still don't know why |
|
@martindurant HadoopFileSystem inherits from ArrowFSWrapper, and ArrowFSWrapper inherits from AbstractFileSystem. However, AbstractFileSystem does not have a size field, only a size function. Therefore, when you use getattr(f, "size", None) to retrieve it, you get a callable function rather than a field. You need to invoke the function to obtain the corresponding size value. |
|
Yes, agreed that size is a method on filesystems, .size(path)->int; I think the .size you are after is an attribute on a file object, though - no? e.g., https://github.com/fsspec/filesystem_spec/blob/master/fsspec/spec.py#L1920 . |
7c63e02 to
607cacb
Compare
Signed-off-by: OneSizeFitsQuorum <[email protected]>
607cacb to
29e2ae8
Compare
|
@martindurant Yes! AbstractBufferedFile has a size attribute, but HadoopFileSystem inherits from ArrowFSWrapper, and ArrowFSWrapper inherits from AbstractFileSystem. So HadoopFileSystem only has a size method rather than an attribute. That's why I encounter this error and submit this pr~ |
Signed-off-by: OneSizeFitsQuorum <[email protected]>
| fs.get_file(remote_dir + "/test_file.txt", str(local_file)) | ||
| with open(local_file, "rb") as f: | ||
| assert f.read() == data | ||
| with pytest.raises(OSError, match="only valid on seekable files"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we get() the file just because it's not seekable? Isn't that exactly what the code before this PR was doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we call size() inside the get_file()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we can read a stream until it's done. We don't want that workflow to become unusable.
|
OK, finally I understand your problem. Our open methods return fsspec.implementations.arrow.ArrowFile . These defer to arrow for various methods including size. And the arrow definition of size ( https://arrow.apache.org/docs/python/generated/pyarrow.NativeFile.html#pyarrow.NativeFile.size ) is a method not an attribute. So what we actually need to do is, rather than change the callback mechanism, make size a property: and actually this gives a chance to intercept too - are there cases where we already know the size rather than having to calculate it again? |
|
@martindurant So, if I understand correctly, you're suggesting that we should submit a PR to pyarrow to add a size attribute? However, there are two concerns:
What’s your opinion on this? |
|
No, there is no need to change arrow - the class is here: https://github.com/fsspec/filesystem_spec/blob/master/fsspec/implementations/arrow.py#L230 |
Signed-off-by: OneSizeFitsQuorum <[email protected]> Signed-off-by: OneSizeFitsQuorum <[email protected]>
0b43c35 to
0aa6da5
Compare
Signed-off-by: OneSizeFitsQuorum <[email protected]>
Signed-off-by: OneSizeFitsQuorum <[email protected]>
Signed-off-by: OneSizeFitsQuorum <[email protected]>
|
@martindurant Yes, you've found the most elegant solution! I found that for the get function, once the callback passed is not the default default lookback but rather a tqdmcallback with some actual behavior, it triggers an error. After setting size as a property, I don't see any more issues BTW, the left ci failure seems irrelevant with this pr, as I comment out all my changes in this commit, but the ci still fails |
martindurant
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please remove "size" from the list at L226

Allow
getattr(f1, "size", None)returns an int. So that we can set size properly here