fix: handle multi-line CSV fields in paged get_as_file() by carlkesselman · Pull Request #203 · informatics-isi-edu/deriva-py

carlkesselman · 2026-03-19T23:44:39Z

Summary

Fix infinite paging loop in get_as_file() when CSV data contains multi-line quoted fields (RFC 4180)
Add _read_last_csv_record() helper that reads back the last complete record from the destination file using Python's csv module
Add test suite for the fix covering multi-line fields, large fields, edge cases

Problem

The paged CSV download determines the @after() cursor by parsing the last raw byte line of each page. When fields contain embedded newlines (e.g., OCR text with grid data), the "last line" is a fragment inside a quoted value. This fragment produces an invalid RID for the cursor that sorts before all real RIDs, causing every subsequent page to re-fetch all records.

Impact: In one case, 121 records were duplicated 6,814 times, producing an 824K-row, 2 GB CSV file from a 2,130-row table.

Fix

After writing each page to disk, read back the last complete CSV record from the file using csv.DictReader, which correctly handles multi-line quoted fields. Uses a chunked reverse-read strategy (starting from end of file) to avoid loading the entire file into memory.

Test plan

Unit tests for _read_last_csv_record with single-line records
Unit tests with multi-line quoted fields (the bug scenario)
Unit tests with large multi-line fields (~100KB per record)
Edge cases: empty file, header-only, single row, commas and quotes in fields
Integration test with real ERMrest catalog containing multi-line CSV data

🤖 Generated with Claude Code

@after

The paged CSV download in get_as_file() determined the @after() cursor by parsing the last raw byte line of each page as a CSV record. When fields contain embedded newlines (RFC 4180 quoted fields), the "last line" is a fragment inside a quoted value, not a complete record. This produced an invalid RID for the cursor (e.g., whitespace + quote character) that sorts before all real RIDs, causing every subsequent page to re-fetch all records — an infinite loop. In one case, 121 records were duplicated 6,814 times producing an 824K-row, 2 GB file. Fix: after writing a page, read back the last complete CSV record from the destination file using Python's csv module, which handles multi-line quoted fields correctly. Uses a chunked reverse-read strategy to avoid loading the entire file into memory. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: handle multi-line CSV fields in paged get_as_file()#203

fix: handle multi-line CSV fields in paged get_as_file()#203
carlkesselman wants to merge 1 commit intomasterfrom
paging-bug

carlkesselman commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

carlkesselman commented Mar 19, 2026

Summary

Problem

Fix

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant