feat: Add support for reading whole text files to `read_text` by plotor · Pull Request #6354 · Eventual-Inc/Daft

plotor · 2026-03-06T07:04:17Z

Changes Made

Add a whole_text option to the read_text API to support reading whole text contents as a single line. Consider scenarios such as inference scenarios where the content of a text might be a complete prompt, in which case it shouldn't be read line by line.

Related Issues

greptile-apps · 2026-03-06T09:07:49Z

Greptile Summary

This PR adds a whole_text parameter to read_text that, when True, reads each file as a single row in the resulting DataFrame instead of splitting on newlines. The change is well-structured: a new early-return path in stream_text delegates to a dedicated read_into_whole_text_stream function, and the option is threaded cleanly through all relevant layers (TextConvertOptions, TextSourceConfig, the pyo3 constructor, and the Python API).

Key findings:

Limit pushdown not respected in whole-text mode: read_into_whole_text_stream never checks convert_options.limit. In the line-oriented path, the remaining counter prevents excess reads; no equivalent guard exists for the whole-text path. If a scan task is assigned limit = Some(0) (all rows already satisfied upstream), the implementation will still read and yield the file — producing incorrect results.
_chunk_size effect undocumented for whole_text=True: The chunk size parameter is silently ignored in whole-text mode; a brief doc note would prevent user confusion.
Test coverage: Good new tests for empty files, globs, gzip, and path columns; a limit-pushdown test for whole_text=True would strengthen coverage of this edge case.

Confidence Score: 3/5

The feature works correctly for the primary use case, but missing limit-pushdown validation in the whole-text path can produce incorrect row counts under certain query plans.
The implementation is clean and well-tested for common scenarios. However, the logic gap around convert_options.limit in read_into_whole_text_stream is a real correctness issue: when a scan task is assigned limit=0 (all rows already satisfied upstream), the whole-text path will still yield a row, while the line-mode path would correctly yield nothing. This is a departure from the existing pattern and could silently produce incorrect query results under limit pushdown optimization, which is rare in practice but represents an oversight rather than a deliberate design decision.
src/daft-text/src/read.rs — the read_into_whole_text_stream function needs a limit == Some(0) early-exit guard to match the behaviour of read_into_line_chunk_stream.

_{Last reviewed commit: ecaddb9}

greptile-apps · 2026-03-06T09:07:52Z

src/daft-text/src/read.rs

+    Ok(try_stream! {
+        let mut content = String::new();
+        reader.read_to_string(&mut content).await?;
+
+        // Apply skip_blank_lines if needed (for whole file, this means skip if entire content is blank)
+        if convert_options.skip_blank_lines && content.trim().is_empty() {
+            return;
+        }
+
+        yield content;
+    })


Limit pushdown not respected in whole_text mode

convert_options.limit is completely ignored inside read_into_whole_text_stream. In the existing line-oriented path (read_into_line_chunk_stream), the limit is enforced via a remaining counter that short-circuits the loop when it reaches zero. Here, if a scan task is given limit = Some(0) (i.e., the overall query limit is already satisfied by prior scan tasks), this function will still read the entire file and yield one row — producing incorrect results.

A minimal guard at the top of the try_stream! block would address this:

Suggested change

Ok(try_stream! {

let mut content = String::new();

reader.read_to_string(&mut content).await?;

// Apply skip_blank_lines if needed (for whole file, this means skip if entire content is blank)

if convert_options.skip_blank_lines && content.trim().is_empty() {

return;

}

yield content;

})

Ok(try_stream! {

// Respect limit pushdown: in whole-text mode each file is exactly one row.

if convert_options.limit == Some(0) {

return;

}

let mut content = String::new();

reader.read_to_string(&mut content).await?;

// Apply skip_blank_lines if needed (for whole file, this means skip if entire content is blank)

if convert_options.skip_blank_lines && content.trim().is_empty() {

return;

}

yield content;

})

daft/io/_text.py

Signed-off-by: plotor <zhenchao.wang@hotmail.com>

plotor · 2026-03-06T10:47:36Z

This is a supplementary implementation for #6111, adding a whole_text parameter to control whether load the entire text as a single line. Please giving a review when you have time. TKS @desmondcheongzx

github-actions bot added the feat label Mar 6, 2026

plotor force-pushed the zhenchao-read-text branch from 3abc79b to ecaddb9 Compare March 6, 2026 08:54

plotor marked this pull request as ready for review March 6, 2026 08:57

plotor requested a review from a team as a code owner March 6, 2026 08:57

greptile-apps bot reviewed Mar 6, 2026

View reviewed changes

feat: Add support for reading whole text files to read_text

977676c

Signed-off-by: plotor <zhenchao.wang@hotmail.com>

plotor force-pushed the zhenchao-read-text branch from ecaddb9 to 977676c Compare March 6, 2026 09:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add support for reading whole text files to `read_text`#6354

feat: Add support for reading whole text files to `read_text`#6354
plotor wants to merge 1 commit intoEventual-Inc:mainfrom
plotor:zhenchao-read-text

plotor commented Mar 6, 2026

Uh oh!

greptile-apps bot commented Mar 6, 2026

Uh oh!

greptile-apps bot Mar 6, 2026

Uh oh!

Uh oh!

plotor commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

plotor commented Mar 6, 2026

Changes Made

Related Issues

Uh oh!

greptile-apps bot commented Mar 6, 2026

Greptile Summary

Confidence Score: 3/5

Uh oh!

greptile-apps bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

plotor commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant