-
Notifications
You must be signed in to change notification settings - Fork 1k
add fread file connection support #7422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Generated via commit 6f4d90f Download link for the artifact containing the test results: ↓ atime-results.zip
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #7422 +/- ##
========================================
Coverage 99.13% 99.13%
========================================
Files 85 85
Lines 16618 16720 +102
========================================
+ Hits 16474 16576 +102
Misses 144 144 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Looking forward to reviewing this. I was wondering who was going to be
the first brave soul to start using the connections API after they
"officially" became experimental in r88181-r88223 (April-May 2025). Not
counting the even braver souls who've been working around the non-API
checks, 'curl' and 'iotools' are the only CRAN packages that openly
link to these entry points (as they have been doing for a long time).
|
|
I guess there are some more cool kids on the CRAN block using |
aitap
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be good after fixing the potential resource leak caused by errors in R_ReadConnection.
R/fread.R
Outdated
|
|
||
| needs_reopen = FALSE | ||
| if (con_open) { | ||
| binary_modes = c("rb", "r+b", "wb", "w+b", "ab", "a+b") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use of binary modes disables both newline conversions and encoding conversions. Would a user want to use file(encoding="...") to decode a file not in UTF-8 or native encoding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could not make file(encoding=...) work with the example in #6148 and certain text modes, hence, I left it as a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently, decoding the data requires calling the fgetc connection method, not the read method. No other method calls iconv to decode data.
spillConnectionToFile could choose between the two implementations depending on whether con->text is TRUE. If the connection is not yet open but con->encname is specified, it could perform the readLines trick for decoding to UTF-8 instead of the native encoding.
src/freadR.c
Outdated
| size_t nrows_seen = 0; | ||
|
|
||
| while (true) { | ||
| size_t nread = R_ReadConnection(con, buffer, chunk_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some read methods can signal R errors (e.g. bzfile, zstdfile for corrupted files). We can move the loop into a function to be called by R_ExecWithCleanup or use more static variables with deferred cleanup in case of a failed call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find! We should probably do this in fread too right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly, yes. There may be even a clever way to implement that without having to teach fread.c about R API.
| STOP(_("spillConnectionToFile: failed to allocate buffer")); // # nocov | ||
| } | ||
|
|
||
| size_t total_read = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An R connection may produce more than 4 gigabytes of data even on a 32-bit system, because R enables Large File Support. fread had no use for it because it relies on the file fitting in the address space to work, so I think that for a large file, fwrite below would fail before it could overflow total_read. So it should be safe to use size_t here.
Co-authored-by: aitap <[email protected]>
Co-authored-by: aitap <[email protected]>
Co-authored-by: aitap <[email protected]>
Wrap the helper functions too. Avoid double negatives.
Otherwise truncation occurs silently, possibly setting the limit to something like 100.
CHAR() could in theory return Latin-1 or UTF-8 text. translateChar() checks the encoding bits and only converts if needed, releasing the memory upon return from the .Call().

Closes #561
Also closes #4329 with a nice workaround by just wrapping the file to read with
file()and use the connection interface.Spills to file since this seems to cover almost all cases. Also respects
nrowsparameter, so for peeking it does not need to spill the whole file.We can't magically make file connections faster since we do not have random access like with
mmap.For the benchmark below, note that my
tempdiralready points to my SSD (therefore we cant see a big difference) and I dont have a HDD on this PC.Extending this to
1e8rows instead of1e7and verbose also shows that half of the time spent is for spilling to disk (for large files).