Skip to content

Conversation

@ben-schwen
Copy link
Member

@ben-schwen ben-schwen commented Nov 8, 2025

Closes #561
Also closes #4329 with a nice workaround by just wrapping the file to read with file() and use the connection interface.

Spills to file since this seems to cover almost all cases. Also respects nrows parameter, so for peeking it does not need to spill the whole file.

We can't magically make file connections faster since we do not have random access like with mmap.

For the benchmark below, note that my tempdir already points to my SSD (therefore we cant see a big difference) and I dont have a HDD on this PC.

fread_con_vs_readtable

Extending this to 1e8 rows instead of 1e7 and verbose also shows that half of the time spent is for spilling to disk (for large files).

Read 100000000 rows x 4 columns from 3.916GiB (4205299167 bytes) file in 00:02.151 wall clock time
[12] Finalizing the datatable
  Type counts:
         1 : int32     '7'
         2 : float64   '9'
         1 : string    'E'
=============================
   2.158s ( 50%) Spill connection to tempfile (3.916GiB)
   0.000s (  0%) Memory map 3.916GiB file
   0.002s (  0%) sep=',' ncol=4 and header detection
   0.000s (  0%) Column type detection using 10049 sample rows
   0.413s ( 10%) Allocation of 109975410 rows x 4 cols (2.868GiB) of which 100000000 ( 91%) rows used
   1.736s ( 40%) Reading 4010 chunks (0 swept) of 1.000MiB (each chunk 24937 rows) using 10 threads
   +    0.575s ( 13%) Parse to row-major thread buffers (grown 0 times)
   +    0.759s ( 18%) Transpose
   +    0.402s (  9%) Waiting
   0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
   4.309s        Total
library(data.table)
library(atime)
set.seed(123)
N = 1e7
test_df <- data.frame(
    a = sample(1:1000, N, replace=TRUE),
    b = rnorm(N),
    c = sample(letters, N, replace=TRUE),
    d = runif(N)
)
f = tempfile(fileext = '.csv')
fwrite(test_df, f)

Nseq = 10^seq(2, log10(N), .25)
read = atime(N = Nseq, seconds.limit=1,
    fread_con = fread(file(f), nrows = N),
    fread_con_RAM = fread(file(f), nrows = N, tmpdir = "/dev/shm"),
    readtable = read.table(file(f), header=TRUE, sep=',', nrows = N),
    fread = fread(f, nrows = N)
)

plot(read)

@github-actions
Copy link

github-actions bot commented Nov 8, 2025

  • HEAD=fread_connections slower P<0.001 for setDT improved in #5427
    Comparison Plot

Generated via commit 6f4d90f

Download link for the artifact containing the test results: ↓ atime-results.zip

Task Duration
R setup and installing dependencies 2 minutes and 48 seconds
Installing different package versions 1 minutes and 6 seconds
Running and plotting the test cases 2 minutes and 56 seconds

@codecov
Copy link

codecov bot commented Nov 8, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.13%. Comparing base (df7fa80) to head (6f4d90f).
⚠️ Report is 9 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff            @@
##           master    #7422    +/-   ##
========================================
  Coverage   99.13%   99.13%            
========================================
  Files          85       85            
  Lines       16618    16720   +102     
========================================
+ Hits        16474    16576   +102     
  Misses        144      144            

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@aitap
Copy link
Contributor

aitap commented Nov 8, 2025 via email

@ben-schwen
Copy link
Member Author

I guess there are some more cool kids on the CRAN block using R_GetConnection (although not too many)

Copy link
Contributor

@aitap aitap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be good after fixing the potential resource leak caused by errors in R_ReadConnection.

R/fread.R Outdated

needs_reopen = FALSE
if (con_open) {
binary_modes = c("rb", "r+b", "wb", "w+b", "ab", "a+b")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use of binary modes disables both newline conversions and encoding conversions. Would a user want to use file(encoding="...") to decode a file not in UTF-8 or native encoding?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not make file(encoding=...) work with the example in #6148 and certain text modes, hence, I left it as a follow up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently, decoding the data requires calling the fgetc connection method, not the read method. No other method calls iconv to decode data.

spillConnectionToFile could choose between the two implementations depending on whether con->text is TRUE. If the connection is not yet open but con->encname is specified, it could perform the readLines trick for decoding to UTF-8 instead of the native encoding.

src/freadR.c Outdated
size_t nrows_seen = 0;

while (true) {
size_t nread = R_ReadConnection(con, buffer, chunk_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some read methods can signal R errors (e.g. bzfile, zstdfile for corrupted files). We can move the loop into a function to be called by R_ExecWithCleanup or use more static variables with deferred cleanup in case of a failed call.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find! We should probably do this in fread too right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly, yes. There may be even a clever way to implement that without having to teach fread.c about R API.

STOP(_("spillConnectionToFile: failed to allocate buffer")); // # nocov
}

size_t total_read = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An R connection may produce more than 4 gigabytes of data even on a 32-bit system, because R enables Large File Support. fread had no use for it because it relies on the file fitting in the address space to work, so I think that for a large file, fwrite below would fail before it could overflow total_read. So it should be safe to use size_t here.

ben-schwen and others added 8 commits November 11, 2025 23:37
Wrap the helper functions too. Avoid double negatives.
Otherwise truncation occurs silently, possibly setting the limit to
something like 100.
CHAR() could in theory return Latin-1 or UTF-8 text. translateChar()
checks the encoding bits and only converts if needed, releasing the
memory upon return from the .Call().
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fread tries to map memory for the entire file when using nrows [R-Forge #4931] Support file connections for fread

4 participants