Skip to content

fix: handle multi-line CSV fields in paged get_as_file()#203

Open
carlkesselman wants to merge 1 commit intomasterfrom
paging-bug
Open

fix: handle multi-line CSV fields in paged get_as_file()#203
carlkesselman wants to merge 1 commit intomasterfrom
paging-bug

Conversation

@carlkesselman
Copy link
Copy Markdown
Contributor

Summary

  • Fix infinite paging loop in get_as_file() when CSV data contains multi-line quoted fields (RFC 4180)
  • Add _read_last_csv_record() helper that reads back the last complete record from the destination file using Python's csv module
  • Add test suite for the fix covering multi-line fields, large fields, edge cases

Problem

The paged CSV download determines the @after() cursor by parsing the last raw byte line of each page. When fields contain embedded newlines (e.g., OCR text with grid data), the "last line" is a fragment inside a quoted value. This fragment produces an invalid RID for the cursor that sorts before all real RIDs, causing every subsequent page to re-fetch all records.

Impact: In one case, 121 records were duplicated 6,814 times, producing an 824K-row, 2 GB CSV file from a 2,130-row table.

Fix

After writing each page to disk, read back the last complete CSV record from the file using csv.DictReader, which correctly handles multi-line quoted fields. Uses a chunked reverse-read strategy (starting from end of file) to avoid loading the entire file into memory.

Test plan

  • Unit tests for _read_last_csv_record with single-line records
  • Unit tests with multi-line quoted fields (the bug scenario)
  • Unit tests with large multi-line fields (~100KB per record)
  • Edge cases: empty file, header-only, single row, commas and quotes in fields
  • Integration test with real ERMrest catalog containing multi-line CSV data

🤖 Generated with Claude Code

The paged CSV download in get_as_file() determined the @after() cursor
by parsing the last raw byte line of each page as a CSV record. When
fields contain embedded newlines (RFC 4180 quoted fields), the "last
line" is a fragment inside a quoted value, not a complete record.

This produced an invalid RID for the cursor (e.g., whitespace + quote
character) that sorts before all real RIDs, causing every subsequent
page to re-fetch all records — an infinite loop. In one case, 121
records were duplicated 6,814 times producing an 824K-row, 2 GB file.

Fix: after writing a page, read back the last complete CSV record from
the destination file using Python's csv module, which handles multi-line
quoted fields correctly. Uses a chunked reverse-read strategy to avoid
loading the entire file into memory.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant