fix(entity_file): count CSV rows with csv.reader so multiline fields are correct#72
fix(entity_file): count CSV rows with csv.reader so multiline fields are correct#72
csv.reader so multiline fields are correct#72Conversation
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 42 minutes and 49 seconds.Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #72 +/- ##
==========================================
+ Coverage 89.50% 89.52% +0.01%
==========================================
Files 9 9
Lines 610 611 +1
==========================================
+ Hits 546 547 +1
Misses 64 64 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Copilot Code ReviewThe new
Suggested fix: Ensure |
There was a problem hiding this comment.
Pull request overview
Fixes CSV entity row counting so progress reporting is correct when CSV fields contain embedded newlines (RFC 4180-compliant multiline cells).
Changes:
- Update
EntityFile.count_entities()to count logical CSV rows usingcsv.readerinstead of counting physical file lines. - Align the counting reader’s parsing configuration with the main CSV reader (delimiter/quoting/escapechar/skipinitialspace).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Use the same csv.reader configuration as the main reader so that | ||
| # rows containing embedded newlines (legal under RFC 4180) are | ||
| # counted as a single row rather than as N file lines. | ||
| self.entities_count = 0 | ||
| self.entities_count = sum(1 for line in self.infile) | ||
| # seek back | ||
| counting_reader = csv.reader( | ||
| self.infile, | ||
| delimiter=self.config.separator, | ||
| skipinitialspace=True, | ||
| quoting=self.config.quoting, | ||
| escapechar=self.config.escapechar, | ||
| ) | ||
| self.entities_count = sum(1 for _ in counting_reader) |
There was a problem hiding this comment.
This change fixes row-counting for embedded newlines, but there’s no test covering the new behavior. Please add a unit test that writes a CSV with a quoted multiline field (e.g., a cell containing a newline) and asserts entities_count matches the number of CSV rows (excluding the header), so regressions in counting/progress bar length are caught.
There was a problem hiding this comment.
Fixed in the follow-up commit (90fd457). A new test test_count_entities_multiline_field in test/test_label.py writes a CSV containing a quoted field with an embedded newline, then asserts label.entities_count == 2 (the number of logical CSV rows, not the 4 physical lines). All CI checks pass.
| self.entities_count = sum(1 for _ in counting_reader) | ||
| # seek back so the live reader starts at the beginning | ||
| self.infile.seek(0) |
There was a problem hiding this comment.
count_entities() rewinds the shared file handle with seek(0), but the instance’s self.reader was already created and consumed the header earlier. Reusing the same csv.reader after seeking means the header is effectively read twice (once in convert_header(), again in next(self.reader)), which makes self.reader.line_num inaccurate for later error messages. Consider counting via a separate file handle (or rewinding back to the post-header position and avoiding the extra header read / recreating the reader) so the live reader’s state and line numbers stay consistent.
There was a problem hiding this comment.
Fixed in 90fd457. count_entities() now opens a separate io.open() handle on self.infile.name, skips the header inside that handle, counts the remaining rows, and closes it — self.reader and its line_num are never touched. The redundant next(self.reader) # Skip the header row. in __init__ was also removed, since convert_header() already advances the reader past the header.
…are correct `count_entities` previously used `sum(1 for line in self.infile)` which counts file lines, not CSV rows. Any field containing an embedded newline (legal under RFC 4180 and emitted by tools like Excel for multiline cells) inflated the count, causing the progress bar to under- report progress and leaving stale "X remaining" indicators when the true row count was already reached. Use a temporary `csv.reader` configured the same way as the live reader so logically-single rows containing embedded newlines are counted once. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… quoting to QUOTE_MINIMAL - count_entities() now opens a fresh file handle rather than reusing self.infile, so self.reader.line_num is never disturbed by the counting pass and error messages report the correct source line. - Remove the redundant next(self.reader) from __init__ that existed only to re-skip the header after the old seek(0); convert_header() already positions the reader past the header. - Change Config default quoting from csv.QUOTE_NONE (3) to csv.QUOTE_MINIMAL (0) so that RFC 4180 quoted multiline fields are handled correctly by both the live reader and the counting reader. - Update test_config.py to reflect the new default. - Add test_count_entities_multiline_field to test_label.py, which writes a CSV with an embedded-newline field and asserts entities_count == 2 (not the 4 physical lines). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
458f747 to
90fd457
Compare
Agent Review SummaryAll three review items have been addressed in commit 90fd457 (rebased onto master, force-pushed with Issues found & fixed
Files changed
|
Summary
EntityFile.count_entitiesusedsum(1 for line in self.infile), which counts physical file lines rather than CSV rows. Any field containing an embedded newline (legal under RFC 4180 and emitted by Excel and other tools for multiline cells) inflated the count, causing the progress bar to under-report progress and leaving "X remaining" stale at completion.Changes
falkordb_bulk_loader/entity_file.py:count_entitiesnow constructs a temporarycsv.readerwith the samedelimiter/quoting/escapechar/skipinitialspacesettings as the live reader and counts the rows it yields. Behaviour for single-line CSVs is unchanged.Testing
uv run flake8 falkordb_bulk_loaderclean.Memory / Performance Impact
A second pass through the file via
csv.reader(vs. raw line iteration) – marginally slower for files with no embedded newlines, but only run once per file at startup.Related Issues
From the comprehensive code-review report (BUG-12).