fix(entity_file): count CSV rows with `csv.reader` so multiline fields are correct by gkorland · Pull Request #72 · FalkorDB/falkordb-bulk-loader

gkorland · 2026-04-26T17:13:06Z

Summary

EntityFile.count_entities used sum(1 for line in self.infile), which counts physical file lines rather than CSV rows. Any field containing an embedded newline (legal under RFC 4180 and emitted by Excel and other tools for multiline cells) inflated the count, causing the progress bar to under-report progress and leaving "X remaining" stale at completion.

Changes

falkordb_bulk_loader/entity_file.py: count_entities now constructs a temporary csv.reader with the same delimiter / quoting / escapechar / skipinitialspace settings as the live reader and counts the rows it yields. Behaviour for single-line CSVs is unchanged.

Testing

uv run flake8 falkordb_bulk_loader clean.

Memory / Performance Impact

A second pass through the file via csv.reader (vs. raw line iteration) – marginally slower for files with no embedded newlines, but only run once per file at startup.

Related Issues

From the comprehensive code-review report (BUG-12).

coderabbitai · 2026-04-26T17:13:13Z

Warning

Rate limit exceeded

@gkorland has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 42 minutes and 49 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fd54bde1-025e-47fd-8118-d88867efd454

📥 Commits

Reviewing files that changed from the base of the PR and between 3954856 and 90fd457.

📒 Files selected for processing (4)

falkordb_bulk_loader/config.py
falkordb_bulk_loader/entity_file.py
test/test_config.py
test/test_label.py

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/csv-count-multiline-aware

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 0/1 reviews remaining, refill in 42 minutes and 49 seconds.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-04-26T17:14:01Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.52%. Comparing base (3954856) to head (90fd457).

Additional details and impacted files

@@            Coverage Diff             @@
##           master      #72      +/-   ##
==========================================
+ Coverage   89.50%   89.52%   +0.01%     
==========================================
  Files           9        9              
  Lines         610      611       +1     
==========================================
+ Hits          546      547       +1     
  Misses         64       64

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

gkorland · 2026-04-26T18:15:12Z

Copilot Code Review

⚠️ Critical issue found:

The new count_entities() method uses csv.reader with the default config settings (quoting=csv.QUOTE_NONE), which prevents proper parsing of RFC 4180 compliant multiline CSV fields. This causes:

Multiline quoted fields to be counted as multiple rows
Entity count to be incorrect
The same issue exists in the main reader

Suggested fix: Ensure count_entities() uses a quoting mode that properly handles quoted fields (e.g., csv.QUOTE_MINIMAL), or verify that Config defaults are appropriate for RFC 4180 CSV parsing.

Copilot

Pull request overview

Fixes CSV entity row counting so progress reporting is correct when CSV fields contain embedded newlines (RFC 4180-compliant multiline cells).

Changes:

Update EntityFile.count_entities() to count logical CSV rows using csv.reader instead of counting physical file lines.
Align the counting reader’s parsing configuration with the main CSV reader (delimiter/quoting/escapechar/skipinitialspace).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-29T06:29:25Z

+        # Use the same csv.reader configuration as the main reader so that
+        # rows containing embedded newlines (legal under RFC 4180) are
+        # counted as a single row rather than as N file lines.
        self.entities_count = 0
-        self.entities_count = sum(1 for line in self.infile)
-        # seek back
+        counting_reader = csv.reader(
+            self.infile,
+            delimiter=self.config.separator,
+            skipinitialspace=True,
+            quoting=self.config.quoting,
+            escapechar=self.config.escapechar,
+        )
+        self.entities_count = sum(1 for _ in counting_reader)


This change fixes row-counting for embedded newlines, but there’s no test covering the new behavior. Please add a unit test that writes a CSV with a quoted multiline field (e.g., a cell containing a newline) and asserts entities_count matches the number of CSV rows (excluding the header), so regressions in counting/progress bar length are caught.

Fixed in the follow-up commit (90fd457). A new test test_count_entities_multiline_field in test/test_label.py writes a CSV containing a quoted field with an embedded newline, then asserts label.entities_count == 2 (the number of logical CSV rows, not the 4 physical lines). All CI checks pass.

Copilot · 2026-04-29T06:29:25Z

+        self.entities_count = sum(1 for _ in counting_reader)
+        # seek back so the live reader starts at the beginning
        self.infile.seek(0)


count_entities() rewinds the shared file handle with seek(0), but the instance’s self.reader was already created and consumed the header earlier. Reusing the same csv.reader after seeking means the header is effectively read twice (once in convert_header(), again in next(self.reader)), which makes self.reader.line_num inaccurate for later error messages. Consider counting via a separate file handle (or rewinding back to the post-header position and avoiding the extra header read / recreating the reader) so the live reader’s state and line numbers stay consistent.

Fixed in 90fd457. count_entities() now opens a separate io.open() handle on self.infile.name, skips the header inside that handle, counts the remaining rows, and closes it — self.reader and its line_num are never touched. The redundant next(self.reader) # Skip the header row. in __init__ was also removed, since convert_header() already advances the reader past the header.

…are correct `count_entities` previously used `sum(1 for line in self.infile)` which counts file lines, not CSV rows. Any field containing an embedded newline (legal under RFC 4180 and emitted by tools like Excel for multiline cells) inflated the count, causing the progress bar to under- report progress and leaving stale "X remaining" indicators when the true row count was already reached. Use a temporary `csv.reader` configured the same way as the live reader so logically-single rows containing embedded newlines are counted once. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

… quoting to QUOTE_MINIMAL - count_entities() now opens a fresh file handle rather than reusing self.infile, so self.reader.line_num is never disturbed by the counting pass and error messages report the correct source line. - Remove the redundant next(self.reader) from __init__ that existed only to re-skip the header after the old seek(0); convert_header() already positions the reader past the header. - Change Config default quoting from csv.QUOTE_NONE (3) to csv.QUOTE_MINIMAL (0) so that RFC 4180 quoted multiline fields are handled correctly by both the live reader and the counting reader. - Update test_config.py to reflect the new default. - Add test_count_entities_multiline_field to test_label.py, which writes a CSV with an embedded-newline field and asserts entities_count == 2 (not the 4 physical lines). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

gkorland · 2026-04-29T06:41:03Z

Agent Review Summary

All three review items have been addressed in commit 90fd457 (rebased onto master, force-pushed with --force-with-lease). CI: ✅ all 9 checks green.

Issues found & fixed

#	Source	Issue	Fix
1	Copilot review	No unit test for multiline-field counting	Added `test_count_entities_multiline_field` in `test/test_label.py`: writes a CSV with an embedded-newline field and asserts `entities_count == 2` (not 4 physical lines)
2	Copilot review	`self.reader.line_num` becomes inaccurate because `count_entities()` shared `self.infile` and called `seek(0)`, disturbing the live reader's state	`count_entities()` now opens a separate `io.open(self.infile.name)` handle, skips the header there, counts rows, and closes it — `self.reader` is never touched. Removed the now-redundant `next(self.reader) # Skip the header row.` from `__init__`
3	@gkorland	Default `quoting=3` (`csv.QUOTE_NONE`) in `Config` prevents RFC 4180 multiline quoted fields from being parsed correctly by both the live reader and the counting reader	Changed default to `csv.QUOTE_MINIMAL` (0) in `config.py`; updated `test_config.py` accordingly

Files changed

falkordb_bulk_loader/config.py — import csv; default quoting=csv.QUOTE_MINIMAL
falkordb_bulk_loader/entity_file.py — separate file handle in count_entities(); remove stale next(self.reader) from __init__
test/test_config.py — update default-quoting assertion
test/test_label.py — add multiline-field counting test

gkorland marked this pull request as ready for review April 26, 2026 17:54

gkorland requested a review from Copilot April 29, 2026 06:25

Copilot started reviewing on behalf of gkorland April 29, 2026 06:25 View session

Copilot AI reviewed Apr 29, 2026

View reviewed changes

gkorland and others added 2 commits April 29, 2026 09:37

gkorland force-pushed the fix/csv-count-multiline-aware branch from 458f747 to 90fd457 Compare April 29, 2026 06:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(entity_file): count CSV rows with `csv.reader` so multiline fields are correct#72

fix(entity_file): count CSV rows with `csv.reader` so multiline fields are correct#72
gkorland wants to merge 2 commits intomasterfrom
fix/csv-count-multiline-aware

gkorland commented Apr 26, 2026

Uh oh!

coderabbitai Bot commented Apr 26, 2026 •

edited

Loading

Rate limit exceeded

Uh oh!

codecov Bot commented Apr 26, 2026 •

edited

Loading

Uh oh!

gkorland commented Apr 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 29, 2026

Uh oh!

gkorland Apr 29, 2026

Uh oh!

Copilot AI Apr 29, 2026

Uh oh!

gkorland Apr 29, 2026

Uh oh!

gkorland commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gkorland commented Apr 26, 2026

Summary

Changes

Testing

Memory / Performance Impact

Related Issues

Uh oh!

coderabbitai Bot commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Uh oh!

codecov Bot commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

gkorland commented Apr 26, 2026

Copilot Code Review

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

gkorland Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

gkorland Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

gkorland commented Apr 29, 2026

Agent Review Summary

Issues found & fixed

Files changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Apr 26, 2026 •

edited

Loading

codecov Bot commented Apr 26, 2026 •

edited

Loading