Skip to content

fix(entity_file): count CSV rows with csv.reader so multiline fields are correct#72

Open
gkorland wants to merge 2 commits intomasterfrom
fix/csv-count-multiline-aware
Open

fix(entity_file): count CSV rows with csv.reader so multiline fields are correct#72
gkorland wants to merge 2 commits intomasterfrom
fix/csv-count-multiline-aware

Conversation

@gkorland
Copy link
Copy Markdown

Summary

EntityFile.count_entities used sum(1 for line in self.infile), which counts physical file lines rather than CSV rows. Any field containing an embedded newline (legal under RFC 4180 and emitted by Excel and other tools for multiline cells) inflated the count, causing the progress bar to under-report progress and leaving "X remaining" stale at completion.

Changes

  • falkordb_bulk_loader/entity_file.py: count_entities now constructs a temporary csv.reader with the same delimiter / quoting / escapechar / skipinitialspace settings as the live reader and counts the rows it yields. Behaviour for single-line CSVs is unchanged.

Testing

uv run flake8 falkordb_bulk_loader clean.

Memory / Performance Impact

A second pass through the file via csv.reader (vs. raw line iteration) – marginally slower for files with no embedded newlines, but only run once per file at startup.

Related Issues

From the comprehensive code-review report (BUG-12).

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 26, 2026

Warning

Rate limit exceeded

@gkorland has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 42 minutes and 49 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fd54bde1-025e-47fd-8118-d88867efd454

📥 Commits

Reviewing files that changed from the base of the PR and between 3954856 and 90fd457.

📒 Files selected for processing (4)
  • falkordb_bulk_loader/config.py
  • falkordb_bulk_loader/entity_file.py
  • test/test_config.py
  • test/test_label.py
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/csv-count-multiline-aware

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 42 minutes and 49 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 26, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.52%. Comparing base (3954856) to head (90fd457).

Additional details and impacted files
@@            Coverage Diff             @@
##           master      #72      +/-   ##
==========================================
+ Coverage   89.50%   89.52%   +0.01%     
==========================================
  Files           9        9              
  Lines         610      611       +1     
==========================================
+ Hits          546      547       +1     
  Misses         64       64              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@gkorland gkorland marked this pull request as ready for review April 26, 2026 17:54
@gkorland
Copy link
Copy Markdown
Author

Copilot Code Review

⚠️ Critical issue found:

The new count_entities() method uses csv.reader with the default config settings (quoting=csv.QUOTE_NONE), which prevents proper parsing of RFC 4180 compliant multiline CSV fields. This causes:

  • Multiline quoted fields to be counted as multiple rows
  • Entity count to be incorrect
  • The same issue exists in the main reader

Suggested fix: Ensure count_entities() uses a quoting mode that properly handles quoted fields (e.g., csv.QUOTE_MINIMAL), or verify that Config defaults are appropriate for RFC 4180 CSV parsing.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes CSV entity row counting so progress reporting is correct when CSV fields contain embedded newlines (RFC 4180-compliant multiline cells).

Changes:

  • Update EntityFile.count_entities() to count logical CSV rows using csv.reader instead of counting physical file lines.
  • Align the counting reader’s parsing configuration with the main CSV reader (delimiter/quoting/escapechar/skipinitialspace).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread falkordb_bulk_loader/entity_file.py Outdated
Comment on lines +214 to +225
# Use the same csv.reader configuration as the main reader so that
# rows containing embedded newlines (legal under RFC 4180) are
# counted as a single row rather than as N file lines.
self.entities_count = 0
self.entities_count = sum(1 for line in self.infile)
# seek back
counting_reader = csv.reader(
self.infile,
delimiter=self.config.separator,
skipinitialspace=True,
quoting=self.config.quoting,
escapechar=self.config.escapechar,
)
self.entities_count = sum(1 for _ in counting_reader)
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change fixes row-counting for embedded newlines, but there’s no test covering the new behavior. Please add a unit test that writes a CSV with a quoted multiline field (e.g., a cell containing a newline) and asserts entities_count matches the number of CSV rows (excluding the header), so regressions in counting/progress bar length are caught.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the follow-up commit (90fd457). A new test test_count_entities_multiline_field in test/test_label.py writes a CSV containing a quoted field with an embedded newline, then asserts label.entities_count == 2 (the number of logical CSV rows, not the 4 physical lines). All CI checks pass.

Comment thread falkordb_bulk_loader/entity_file.py Outdated
Comment on lines 225 to 227
self.entities_count = sum(1 for _ in counting_reader)
# seek back so the live reader starts at the beginning
self.infile.seek(0)
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

count_entities() rewinds the shared file handle with seek(0), but the instance’s self.reader was already created and consumed the header earlier. Reusing the same csv.reader after seeking means the header is effectively read twice (once in convert_header(), again in next(self.reader)), which makes self.reader.line_num inaccurate for later error messages. Consider counting via a separate file handle (or rewinding back to the post-header position and avoiding the extra header read / recreating the reader) so the live reader’s state and line numbers stay consistent.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 90fd457. count_entities() now opens a separate io.open() handle on self.infile.name, skips the header inside that handle, counts the remaining rows, and closes it — self.reader and its line_num are never touched. The redundant next(self.reader) # Skip the header row. in __init__ was also removed, since convert_header() already advances the reader past the header.

gkorland and others added 2 commits April 29, 2026 09:37
…are correct

`count_entities` previously used `sum(1 for line in self.infile)` which
counts file lines, not CSV rows. Any field containing an embedded
newline (legal under RFC 4180 and emitted by tools like Excel for
multiline cells) inflated the count, causing the progress bar to under-
report progress and leaving stale "X remaining" indicators when the
true row count was already reached.

Use a temporary `csv.reader` configured the same way as the live reader
so logically-single rows containing embedded newlines are counted once.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… quoting to QUOTE_MINIMAL

- count_entities() now opens a fresh file handle rather than reusing
  self.infile, so self.reader.line_num is never disturbed by the
  counting pass and error messages report the correct source line.
- Remove the redundant next(self.reader) from __init__ that existed
  only to re-skip the header after the old seek(0); convert_header()
  already positions the reader past the header.
- Change Config default quoting from csv.QUOTE_NONE (3) to
  csv.QUOTE_MINIMAL (0) so that RFC 4180 quoted multiline fields are
  handled correctly by both the live reader and the counting reader.
- Update test_config.py to reflect the new default.
- Add test_count_entities_multiline_field to test_label.py, which
  writes a CSV with an embedded-newline field and asserts
  entities_count == 2 (not the 4 physical lines).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@gkorland gkorland force-pushed the fix/csv-count-multiline-aware branch from 458f747 to 90fd457 Compare April 29, 2026 06:38
@gkorland
Copy link
Copy Markdown
Author

Agent Review Summary

All three review items have been addressed in commit 90fd457 (rebased onto master, force-pushed with --force-with-lease). CI: ✅ all 9 checks green.

Issues found & fixed

# Source Issue Fix
1 Copilot review No unit test for multiline-field counting Added test_count_entities_multiline_field in test/test_label.py: writes a CSV with an embedded-newline field and asserts entities_count == 2 (not 4 physical lines)
2 Copilot review self.reader.line_num becomes inaccurate because count_entities() shared self.infile and called seek(0), disturbing the live reader's state count_entities() now opens a separate io.open(self.infile.name) handle, skips the header there, counts rows, and closes it — self.reader is never touched. Removed the now-redundant next(self.reader) # Skip the header row. from __init__
3 @gkorland Default quoting=3 (csv.QUOTE_NONE) in Config prevents RFC 4180 multiline quoted fields from being parsed correctly by both the live reader and the counting reader Changed default to csv.QUOTE_MINIMAL (0) in config.py; updated test_config.py accordingly

Files changed

  • falkordb_bulk_loader/config.py — import csv; default quoting=csv.QUOTE_MINIMAL
  • falkordb_bulk_loader/entity_file.py — separate file handle in count_entities(); remove stale next(self.reader) from __init__
  • test/test_config.py — update default-quoting assertion
  • test/test_label.py — add multiline-field counting test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants