add fread file connection support #7422

ben-schwen · 2025-11-08T15:20:16Z

Closes #561
Also closes #4329 with a nice workaround by just wrapping the file to read with file() and use the connection interface.

Spills to file since this seems to cover almost all cases. Also respects nrows parameter, so for peeking it does not need to spill the whole file.

We can't magically make file connections faster since we do not have random access like with mmap.

For the benchmark below, note that my tempdir already points to my SSD (therefore we cant see a big difference) and I dont have a HDD on this PC.

Extending this to 1e8 rows instead of 1e7 and verbose also shows that half of the time spent is for spilling to disk (for large files).

Read 100000000 rows x 4 columns from 3.916GiB (4205299167 bytes) file in 00:02.151 wall clock time
[12] Finalizing the datatable
  Type counts:
         1 : int32     '7'
         2 : float64   '9'
         1 : string    'E'
=============================
   2.158s ( 50%) Spill connection to tempfile (3.916GiB)
   0.000s (  0%) Memory map 3.916GiB file
   0.002s (  0%) sep=',' ncol=4 and header detection
   0.000s (  0%) Column type detection using 10049 sample rows
   0.413s ( 10%) Allocation of 109975410 rows x 4 cols (2.868GiB) of which 100000000 ( 91%) rows used
   1.736s ( 40%) Reading 4010 chunks (0 swept) of 1.000MiB (each chunk 24937 rows) using 10 threads
   +    0.575s ( 13%) Parse to row-major thread buffers (grown 0 times)
   +    0.759s ( 18%) Transpose
   +    0.402s (  9%) Waiting
   0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
   4.309s        Total

library(data.table)
library(atime)
set.seed(123)
N = 1e7
test_df <- data.frame(
    a = sample(1:1000, N, replace=TRUE),
    b = rnorm(N),
    c = sample(letters, N, replace=TRUE),
    d = runif(N)
)
f = tempfile(fileext = '.csv')
fwrite(test_df, f)

Nseq = 10^seq(2, log10(N), .25)
read = atime(N = Nseq, seconds.limit=1,
    fread_con = fread(file(f), nrows = N),
    fread_con_RAM = fread(file(f), nrows = N, tmpdir = "/dev/shm"),
    readtable = read.table(file(f), header=TRUE, sep=',', nrows = N),
    fread = fread(f, nrows = N)
)

plot(read)

github-actions · 2025-11-08T15:38:47Z

HEAD=fread_connections slower P<0.001 for setDT improved in #5427

Generated via commit 6f4d90f

Download link for the artifact containing the test results: ↓ atime-results.zip

Task	Duration
R setup and installing dependencies	2 minutes and 48 seconds
Installing different package versions	1 minutes and 6 seconds
Running and plotting the test cases	2 minutes and 56 seconds

codecov · 2025-11-08T15:58:07Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.13%. Comparing base (df7fa80) to head (6f4d90f).
⚠️ Report is 9 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff            @@
##           master    #7422    +/-   ##
========================================
  Coverage   99.13%   99.13%            
========================================
  Files          85       85            
  Lines       16618    16720   +102     
========================================
+ Hits        16474    16576   +102     
  Misses        144      144

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

aitap · 2025-11-08T16:48:33Z

Looking forward to reviewing this. I was wondering who was going to be the first brave soul to start using the connections API after they "officially" became experimental in r88181-r88223 (April-May 2025). Not counting the even braver souls who've been working around the non-API checks, 'curl' and 'iotools' are the only CRAN packages that openly link to these entry points (as they have been doing for a long time).

ben-schwen · 2025-11-08T17:38:23Z

I guess there are some more cool kids on the CRAN block using R_GetConnection (although not too many)

aitap

Should be good after fixing the potential resource leak caused by errors in R_ReadConnection.

R/fread.R

src/freadR.c

aitap · 2025-11-09T21:32:45Z

R/fread.R

+
+    needs_reopen = FALSE
+    if (con_open) {
+      binary_modes = c("rb", "r+b", "wb", "w+b", "ab", "a+b")


Use of binary modes disables both newline conversions and encoding conversions. Would a user want to use file(encoding="...") to decode a file not in UTF-8 or native encoding?

I could not make file(encoding=...) work with the example in #6148 and certain text modes, hence, I left it as a follow up.

Apparently, decoding the data requires calling the fgetc connection method, not the read method. No other method calls iconv to decode data.

spillConnectionToFile could choose between the two implementations depending on whether con->text is TRUE. If the connection is not yet open but con->encname is specified, it could perform the readLines trick for decoding to UTF-8 instead of the native encoding.

R/fread.R

aitap · 2025-11-09T21:45:18Z

src/freadR.c

+  size_t nrows_seen = 0;
+
+  while (true) {
+    size_t nread = R_ReadConnection(con, buffer, chunk_size);


Some read methods can signal R errors (e.g. bzfile, zstdfile for corrupted files). We can move the loop into a function to be called by R_ExecWithCleanup or use more static variables with deferred cleanup in case of a failed call.

Nice find! We should probably do this in fread too right?

Possibly, yes. There may be even a clever way to implement that without having to teach fread.c about R API.

aitap · 2025-11-09T22:00:50Z

src/freadR.c

+    STOP(_("spillConnectionToFile: failed to allocate buffer")); // # nocov
+  }
+
+  size_t total_read = 0;


An R connection may produce more than 4 gigabytes of data even on a 32-bit system, because R enables Large File Support. fread had no use for it because it relies on the file fitting in the address space to work, so I think that for a large file, fwrite below would fail before it could overflow total_read. So it should be safe to use size_t here.

src/freadR.c

R/fread.R

Co-authored-by: aitap <[email protected]>

R/fread.R

inst/tests/tests.Rraw

man/reopen_connection.Rd

man/connection_opener.Rd

Wrap the helper functions too. Avoid double negatives.

Otherwise truncation occurs silently, possibly setting the limit to something like 100.

CHAR() could in theory return Latin-1 or UTF-8 text. translateChar() checks the encoding bits and only converts if needed, releasing the memory upon return from the .Call().

ben-schwen added 2 commits November 8, 2025 16:09

add fread connection support

00dc7bb

fix testnum

78bce0e

ben-schwen requested review from MichaelChirico and aitap as code owners November 8, 2025 15:20

make linterse happy

0afd468

make linters even more happy

fa79c8c

remove read bytes %d since this can overflow

9590e22

ben-schwen added 3 commits November 8, 2025 17:54

add coverage

58b3386

be fully experimental API compliant

995d2dc

more coverage

3866b6d

update error message for nrow and mmap

8294c6f

aitap reviewed Nov 9, 2025

View reviewed changes

MichaelChirico reviewed Nov 9, 2025

View reviewed changes

R/fread.R Outdated Show resolved Hide resolved

ben-schwen and others added 11 commits November 10, 2025 13:13

add wording changes

1b7cec7

Co-authored-by: aitap <[email protected]>

add connections guard

9b3c387

Co-authored-by: aitap <[email protected]>

add strerrors

3da8943

Co-authored-by: aitap <[email protected]>

add errno lib

f6f9ed3

add reopen_connection generic

5a98e62

close con on exit

d76c3a5

adjust doc

d520cd4

update conncection info

c3f7cf6

reopen connection

4235a5c

change modes

e37b0ee

update docs

2bcfc6c

ben-schwen commented Nov 10, 2025

View reviewed changes

R/fread.R Outdated Show resolved Hide resolved

add nocov

2e67cc2

use R_ExecWithCleanup to clean up on errors

4383ae2

MichaelChirico reviewed Nov 11, 2025

View reviewed changes

inst/tests/tests.Rraw Show resolved Hide resolved

MichaelChirico reviewed Nov 11, 2025

View reviewed changes

man/reopen_connection.Rd Outdated Show resolved Hide resolved

ben-schwen added 3 commits November 11, 2025 22:22

add test for consuming before fread

5e91780

use factory pattern

5182c0c

add aliases for S3 methods

441c557

MichaelChirico reviewed Nov 11, 2025

View reviewed changes

man/connection_opener.Rd Outdated Show resolved Hide resolved

ben-schwen and others added 8 commits November 11, 2025 23:37

Merge branch 'master' into fread_connections

63fa168

capture print in test

a609fda

fix namespace

daabbb7

rename to binary_reopener

1a98f38

More #ifdef wrapping for connections API

5eef830

Wrap the helper functions too. Avoid double negatives.

R_FINITE will always be true for a size_t argument

0c6eff5

Fail when nrow_limit exceeds SIZE_MAX

236bc5c

Otherwise truncation occurs silently, possibly setting the limit to something like 100.

Use translateChar() for native encoding string

6f4d90f

CHAR() could in theory return Latin-1 or UTF-8 text. translateChar() checks the encoding bits and only converts if needed, releasing the memory upon return from the .Call().

add fread file connection support #7422

Are you sure you want to change the base?

add fread file connection support #7422

Uh oh!

Conversation

ben-schwen commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

aitap commented Nov 8, 2025 via email

Uh oh!

ben-schwen commented Nov 8, 2025

Uh oh!

aitap left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aitap Nov 9, 2025

Choose a reason for hiding this comment

Uh oh!

ben-schwen Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

aitap Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

aitap Nov 9, 2025

Choose a reason for hiding this comment

Uh oh!

ben-schwen Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

aitap Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

aitap Nov 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ben-schwen commented Nov 8, 2025 •

edited

Loading

github-actions bot commented Nov 8, 2025 •

edited

Loading

codecov bot commented Nov 8, 2025 •

edited

Loading