Skip to content

feat: Add support for reading whole text files to read_text#6354

Open
plotor wants to merge 1 commit intoEventual-Inc:mainfrom
plotor:zhenchao-read-text
Open

feat: Add support for reading whole text files to read_text#6354
plotor wants to merge 1 commit intoEventual-Inc:mainfrom
plotor:zhenchao-read-text

Conversation

@plotor
Copy link
Collaborator

@plotor plotor commented Mar 6, 2026

Changes Made

Add a whole_text option to the read_text API to support reading whole text contents as a single line. Consider scenarios such as inference scenarios where the content of a text might be a complete prompt, in which case it shouldn't be read line by line.

Related Issues

@github-actions github-actions bot added the feat label Mar 6, 2026
@plotor plotor force-pushed the zhenchao-read-text branch from 3abc79b to ecaddb9 Compare March 6, 2026 08:54
@plotor plotor marked this pull request as ready for review March 6, 2026 08:57
@plotor plotor requested a review from a team as a code owner March 6, 2026 08:57
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 6, 2026

Greptile Summary

This PR adds a whole_text parameter to read_text that, when True, reads each file as a single row in the resulting DataFrame instead of splitting on newlines. The change is well-structured: a new early-return path in stream_text delegates to a dedicated read_into_whole_text_stream function, and the option is threaded cleanly through all relevant layers (TextConvertOptions, TextSourceConfig, the pyo3 constructor, and the Python API).

Key findings:

  • Limit pushdown not respected in whole-text mode: read_into_whole_text_stream never checks convert_options.limit. In the line-oriented path, the remaining counter prevents excess reads; no equivalent guard exists for the whole-text path. If a scan task is assigned limit = Some(0) (all rows already satisfied upstream), the implementation will still read and yield the file — producing incorrect results.

  • _chunk_size effect undocumented for whole_text=True: The chunk size parameter is silently ignored in whole-text mode; a brief doc note would prevent user confusion.

  • Test coverage: Good new tests for empty files, globs, gzip, and path columns; a limit-pushdown test for whole_text=True would strengthen coverage of this edge case.

Confidence Score: 3/5

  • The feature works correctly for the primary use case, but missing limit-pushdown validation in the whole-text path can produce incorrect row counts under certain query plans.
  • The implementation is clean and well-tested for common scenarios. However, the logic gap around convert_options.limit in read_into_whole_text_stream is a real correctness issue: when a scan task is assigned limit=0 (all rows already satisfied upstream), the whole-text path will still yield a row, while the line-mode path would correctly yield nothing. This is a departure from the existing pattern and could silently produce incorrect query results under limit pushdown optimization, which is rare in practice but represents an oversight rather than a deliberate design decision.
  • src/daft-text/src/read.rs — the read_into_whole_text_stream function needs a limit == Some(0) early-exit guard to match the behaviour of read_into_line_chunk_stream.

Last reviewed commit: ecaddb9

Comment on lines +108 to +118
Ok(try_stream! {
let mut content = String::new();
reader.read_to_string(&mut content).await?;

// Apply skip_blank_lines if needed (for whole file, this means skip if entire content is blank)
if convert_options.skip_blank_lines && content.trim().is_empty() {
return;
}

yield content;
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Limit pushdown not respected in whole_text mode

convert_options.limit is completely ignored inside read_into_whole_text_stream. In the existing line-oriented path (read_into_line_chunk_stream), the limit is enforced via a remaining counter that short-circuits the loop when it reaches zero. Here, if a scan task is given limit = Some(0) (i.e., the overall query limit is already satisfied by prior scan tasks), this function will still read the entire file and yield one row — producing incorrect results.

A minimal guard at the top of the try_stream! block would address this:

Suggested change
Ok(try_stream! {
let mut content = String::new();
reader.read_to_string(&mut content).await?;
// Apply skip_blank_lines if needed (for whole file, this means skip if entire content is blank)
if convert_options.skip_blank_lines && content.trim().is_empty() {
return;
}
yield content;
})
Ok(try_stream! {
// Respect limit pushdown: in whole-text mode each file is exactly one row.
if convert_options.limit == Some(0) {
return;
}
let mut content = String::new();
reader.read_to_string(&mut content).await?;
// Apply skip_blank_lines if needed (for whole file, this means skip if entire content is blank)
if convert_options.skip_blank_lines && content.trim().is_empty() {
return;
}
yield content;
})

Signed-off-by: plotor <zhenchao.wang@hotmail.com>
@plotor plotor force-pushed the zhenchao-read-text branch from ecaddb9 to 977676c Compare March 6, 2026 09:52
@plotor
Copy link
Collaborator Author

plotor commented Mar 6, 2026

This is a supplementary implementation for #6111, adding a whole_text parameter to control whether load the entire text as a single line. Please giving a review when you have time. TKS @desmondcheongzx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant