Add .iterator() to unbounded querysets#7940 by acwhite211 · Pull Request #8019 · specify/specify7

acwhite211 · 2026-04-23T21:53:23Z

Fixes #7864
Contributed by @foozleface
Based on the #7940 PR with additional fixes

Approximately 80 callsites in the backend use .all() without .iterator(), causing Django to cache all results in the QuerySet internal result cache. For large tables this means loading entire result sets into memory at once. This PR adds .iterator(chunk_size=2000) to 9 high-impact paths where unbounded querysets are iterated.

Implementation

Add .iterator(chunk_size=2000) to COG prep consolidation queries in cog_preps.py (2 callsites)
Add .iterator(chunk_size=2000) to role policy serialization in permissions/views.py
Add .iterator(chunk_size=2000) to deaccession total calculation in calculated_fields.py
Add .iterator(chunk_size=2000) to dependent to-many serialization in api/serializers.py
Add .iterator(chunk_size=2000) to tree definition rank loading in trees/views.py
Fix batch edit date-part field name handling to use date-part-aware lookup keys, preventing mislabeled column headers for temporal fields
Fix workbench upload to not overwrite explicit createdbyagent values in upload_table.py
Add tests verifying .iterator() is called on the target querysets

Testing instructions

Run batch edit with date fields (catalogedDate Full Date, catalogedDate Month, etc.) and verify column headers are correct
Run a workbench upload that explicitly sets createdbyagent and verify the value is preserved
Run the test suite (or just make sure the github action passed): python manage.py test specifyweb.specify.tests.test_queryset_iterators
Optional: Monitor memory usage during large query exports or COG prep operations

Summary by CodeRabbit

New Features
- Added export cache utilities to improve export performance and data structure handling.
Bug Fixes
- Fixed date-part field naming in batch edit to ensure temporal fields are properly recognized and localized.
Tests
- Added tests to verify proper memory-efficient query iteration across critical operations.

…7864)

emenslin

Run batch edit with date fields (catalogedDate Full Date, catalogedDate Month, etc.) and verify column headers are correct
Run a workbench upload that explicitly sets createdbyagent and verify the value is preserved

I was only able to test the first two, but it looks good!

Iwantexpresso

Run batch edit with date fields (catalogedDate Full Date, catalogedDate Month, etc.) and verify column headers are correct
Run a workbench upload that explicitly sets createdbyagent and verify the value is preserved
Run the test suite: python manage.py test specifyweb.specify.tests.test_queryset_iterators
Monitor memory usage during large query exports or COG prep operations

tests seem to have passed with no issues!

for memory usage I used the Firefox profiler to keep track of the memory usage catering between the begging of an export and a little after the notification of the finalization is received

CSV export profiler:

Permalink to firefox profiler capture:
https://profiler.firefox.com/public/9cyn3dgavxdx06758t5pet17rs8t706b3xwskk0/calltree/?globalTrackOrder=0w3&thread=3&v=16

here's a screenshot of the memory usage (orange graphs ) to give a slight idea of the memory usage itself

KML export profiler:
Permalink to firefox profiler capture
https://profiler.firefox.com/public/w1x3a4e7k3wkncx8c285t5bkg9kz62a3150ny4r/calltree/?globalTrackOrder=0w4&implementation=js&thread=0&v=16

SC of results of kml export

Nothing looked out of the ordinary, there is some spikes in between the exports beginning and finalization but the memory usage either stays constant after that or decreases after a brief period

bhumikaguptaa

Run batch edit with date fields (catalogedDate Full Date, catalogedDate Month, etc.) and verify column headers are correct
Run a workbench upload that explicitly sets createdbyagent and verify the value is preserved
Optional: Monitor memory usage during large query exports or COG prep operations

For the last instruction i just used large dataset to Batch Edit and export and there were no server or gateway error pop up. Everything else looks good.

CarolineDenis · 2026-05-19T10:51:58Z

@CodeRabbit full review

coderabbitai · 2026-05-19T10:52:04Z

✅ Actions performed

Full review triggered.

coderabbitai · 2026-05-19T10:52:14Z

📝 Walkthrough

Walkthrough

This PR reduces memory usage for large QuerySet operations by adopting Django's iterator(chunk_size=2000) across multiple services, introduces new export cache utilities with built-in chunked iteration, adds temporal field naming support to batch edit localization, and validates iterator adoption through source-code inspection tests.

Changes

QuerySet Memory and Field Handling Optimization

Layer / File(s)	Summary
Export Cache Utilities with Iterator Batching `specifyweb/backend/export/cache.py`	New module provides `_field_to_cache_entry`, `_build_single_cache`, and `build_cache_tables` to serialize export extensions and fields into cache dictionaries. All functions internally use `iterator(chunk_size=2000)` when processing extension mappings and building per-extension field arrays.
Queryset Iterator Adoption Across Services `specifyweb/backend/export/extract_query.py`, `specifyweb/backend/interactions/cog_preps.py`, `specifyweb/backend/permissions/views.py`, `specifyweb/backend/trees/views.py`, `specifyweb/specify/api/calculated_fields.py`, `specifyweb/specify/api/serializers.py`, `specifyweb/specify/management/commands/print_tree.py`	Seven modules adopt `iterator(chunk_size=2000)` to stream large result sets in batches instead of materializing full QuerySets in memory. Affects field iteration in export XML construction, child ID lookups in COG recursion, policy aggregation in role serialization, rank fetching in tree information retrieval, preparation counting in deaccession calculations, many-to-many object serialization, and taxon traversal in tree printing.
Batch Edit Date-Part-Aware Field Naming `specifyweb/backend/stored_queries/batch_edit.py`	Introduces `_get_date_part_field_name` helper to compute date-part-suffixed field keys (e.g., `field_name__Month`). Updates `_lookup_in_fields` to use suffixed names for localization label lookups, enhances `naive_field_format` to append visible date-part labels to temporal field captions, and aligns `run_batch_edit_query` localization dumps with the new naming convention.
Iterator Usage Validation Tests `specifyweb/specify/tests/test_queryset_iterators.py`	New test class `TestIteratorUsageInSource` inspects source code of targeted functions across export, permissions, trees, serializers, and management commands. Uses `_get_source` helper to extract and dedent function implementations, then asserts that critical QuerySet iteration points include `iterator(chunk_size=2000)` to enforce memory-safe practices.

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Testing Instructions	⚠️ Warning	Testing instructions incomplete. Batch_edit date-part and upload createdbyagent changes lack automated tests and documented manual testing procedures.	Add tests for date-part and createdbyagent changes. Document manual testing steps in test docstrings or TESTING guide with expected outcomes and verification procedures.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely captures the main change: adding .iterator() calls to unbounded querysets for memory optimization, directly addressing issue `#7940`.
Linked Issues check	✅ Passed	All code changes directly implement the requirements from issue `#7864`: .iterator(chunk_size=2000) added to nine high-impact querysets [`#7864`], batch edit date-part field handling fixed [`#7864`], and tests added to verify iterator usage [`#7864`].
Out of Scope Changes check	✅ Passed	The addition of batch_edit date-part field name handling and export cache module are in-scope enhancements directly supporting the iterator optimization goals for correctness and streaming.
Automatic Tests	✅ Passed	PR includes necessary automatic tests: 9 test methods verifying .iterator() usage for all iterator changes, with each test corresponding to actual implementations.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch issue-7864

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

specifyweb/backend/export/cache.py (1)
28-31: 💤 Low value

Consider removing redundant .all() call.

The .all() call on line 30 is redundant since RelatedManager supports .iterator() directly. While harmless, removing it simplifies the code.
♻️ Proposed simplification
         'fields': [
             _field_to_cache_entry(field)
-            for field in fields.all().iterator(chunk_size=2000)
+            for field in fields.iterator(chunk_size=2000)
         ],
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@specifyweb/backend/export/cache.py` around lines 28 - 31, The list
comprehension calling _field_to_cache_entry over
fields.all().iterator(chunk_size=2000) should drop the redundant .all() since
RelatedManager supports .iterator() directly; change
fields.all().iterator(chunk_size=2000) to fields.iterator(chunk_size=2000)
(keeping _field_to_cache_entry and the chunk_size unchanged) to simplify the
code.
specifyweb/backend/trees/views.py (1)
550-554: 💤 Low value

Review the necessity of iterator() with chunk_size=2000 for tree definition items.

Tree definition items (ranks) are small: Taxon trees have 36, while others range from 4–9 items. The chunk_size=2000 parameter is oversized for these datasets. More importantly, the list comprehension on line 553 immediately materializes all ranks into memory regardless, limiting the iterator's benefit. Consider simplifying to a direct query without iterator() if memory impact is negligible.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@specifyweb/backend/trees/views.py` around lines 550 - 554, The current code
uses definition.treedefitems.order_by('rankid').iterator(chunk_size=2000) but
immediately materializes all items with [obj_to_data(rank) for rank in ranks],
so the large chunk_size and iterator() are unnecessary; replace the iterator
call with a direct ordered queryset (definition.treedefitems.order_by('rankid'))
or a simple list(...) if you want eager evaluation, and then keep the existing
list comprehension over that queryset to produce ranks — this removes the
oversized chunk_size and simplifies the logic while preserving behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@specifyweb/specify/tests/test_queryset_iterators.py`:
- Around line 24-112: The tests currently inspect source strings for
".iterator(chunk_size=2000)" which is brittle; replace each source-inspection
test for the functions/methods (to_many_to_data, calculate_totals_deaccession,
Command.handle, extract_query, build_cache_tables, _build_single_cache,
get_cog_consolidated_preps, serialize_role, get_all_tree_information) with
behavior-based assertions that QuerySet.iterator is actually invoked with
chunk_size=2000: run the target function and use unittest.mock.patch (or
patch.object) to spy on django.db.models.query.QuerySet.iterator with wraps=True
and assert it was called with chunk_size=2000 for the specific queryset(s) the
function uses (or alternatively use Django's assertNumQueries / inspect
connection.queries to verify batched iteration), ensuring you patch at the right
import target so the iterator call inside each target function is observed.

---

Nitpick comments:
In `@specifyweb/backend/export/cache.py`:
- Around line 28-31: The list comprehension calling _field_to_cache_entry over
fields.all().iterator(chunk_size=2000) should drop the redundant .all() since
RelatedManager supports .iterator() directly; change
fields.all().iterator(chunk_size=2000) to fields.iterator(chunk_size=2000)
(keeping _field_to_cache_entry and the chunk_size unchanged) to simplify the
code.

In `@specifyweb/backend/trees/views.py`:
- Around line 550-554: The current code uses
definition.treedefitems.order_by('rankid').iterator(chunk_size=2000) but
immediately materializes all items with [obj_to_data(rank) for rank in ranks],
so the large chunk_size and iterator() are unnecessary; replace the iterator
call with a direct ordered queryset (definition.treedefitems.order_by('rankid'))
or a simple list(...) if you want eager evaluation, and then keep the existing
list comprehension over that queryset to produce ranks — this removes the
oversized chunk_size and simplifies the logic while preserving behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: bafc7378-5fe6-4073-97df-581972c966d7

📥 Commits

Reviewing files that changed from the base of the PR and between b50a592 and f7ebbe4.

📒 Files selected for processing (10)

specifyweb/backend/export/cache.py
specifyweb/backend/export/extract_query.py
specifyweb/backend/interactions/cog_preps.py
specifyweb/backend/permissions/views.py
specifyweb/backend/stored_queries/batch_edit.py
specifyweb/backend/trees/views.py
specifyweb/specify/api/calculated_fields.py
specifyweb/specify/api/serializers.py
specifyweb/specify/management/commands/print_tree.py
specifyweb/specify/tests/test_queryset_iterators.py

coderabbitai · 2026-05-19T10:57:10Z

+    def test_serializers_to_many_uses_iterator(self):
+        """to_many_to_data should use .iterator() when serializing dependent collections."""
+        from specifyweb.specify.api.serializers import to_many_to_data
+        source = _get_source(to_many_to_data)
+        self.assertIn(
+            '.iterator(chunk_size=2000)',
+            source,
+            "to_many_to_data should use .iterator(chunk_size=2000) on objs.all()"
+        )
+
+    def test_calculated_fields_deaccession_uses_iterator(self):
+        """calculate_totals_deaccession should use .iterator() on the filter queryset."""
+        from specifyweb.specify.api.calculated_fields import calculate_totals_deaccession
+        source = _get_source(calculate_totals_deaccession)
+        self.assertIn(
+            '.iterator(chunk_size=2000)',
+            source,
+            "calculate_totals_deaccession should use .iterator(chunk_size=2000)"
+        )
+
+    def test_print_tree_taxon_uses_iterator(self):
+        """print_tree management command should use .iterator() on Taxon.objects.all()."""
+        from specifyweb.specify.management.commands.print_tree import Command
+        source = _get_source(Command.handle)
+        self.assertIn(
+            '.iterator(chunk_size=2000)',
+            source,
+            "print_tree should use .iterator(chunk_size=2000) on Taxon.objects.all()"
+        )
+
+    def test_export_extract_query_uses_iterator(self):
+        """extract_query should use .iterator() on query.fields.all()."""
+        from specifyweb.backend.export.extract_query import extract_query
+        source = _get_source(extract_query)
+        self.assertIn(
+            '.iterator(chunk_size=2000)',
+            source,
+            "extract_query should use .iterator(chunk_size=2000) on query.fields.all()"
+        )
+
+    def test_export_cache_build_uses_iterator(self):
+        """build_cache_tables should use .iterator() on extensions.all()."""
+        from specifyweb.backend.export.cache import build_cache_tables
+        source = _get_source(build_cache_tables)
+        self.assertIn(
+            '.iterator(chunk_size=2000)',
+            source,
+            "build_cache_tables should use .iterator(chunk_size=2000) on extensions.all()"
+        )
+
+    def test_export_cache_fields_uses_iterator(self):
+        """_build_single_cache should use .iterator() on fields.all()."""
+        from specifyweb.backend.export.cache import _build_single_cache
+        source = _get_source(_build_single_cache)
+        self.assertIn(
+            '.iterator(chunk_size=2000)',
+            source,
+            "_build_single_cache should use .iterator(chunk_size=2000) on fields.all()"
+        )
+
+    def test_cog_preps_child_cogs_uses_iterator(self):
+        """get_cog_consolidated_preps should use .iterator() on child COG queries."""
+        from specifyweb.backend.interactions.cog_preps import get_cog_consolidated_preps
+        source = _get_source(get_cog_consolidated_preps)
+        self.assertIn(
+            '.iterator(chunk_size=2000)',
+            source,
+            "get_cog_consolidated_preps should use .iterator(chunk_size=2000)"
+        )
+
+    def test_permissions_serialize_role_uses_iterator(self):
+        """serialize_role should use .iterator() on role.policies.all()."""
+        from specifyweb.backend.permissions.views import serialize_role
+        source = _get_source(serialize_role)
+        self.assertIn(
+            '.iterator(chunk_size=2000)',
+            source,
+            "serialize_role should use .iterator(chunk_size=2000) on role.policies.all()"
+        )
+
+    def test_tree_views_ranks_uses_iterator(self):
+        """get_all_tree_information should use .iterator() on treedefitems."""
+        from specifyweb.backend.trees.views import get_all_tree_information
+        source = _get_source(get_all_tree_information)
+        self.assertIn(
+            '.iterator(chunk_size=2000)',
+            source,
+            "get_all_tree_information should use .iterator(chunk_size=2000) on ranks"
+        )


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Source inspection tests are brittle and will break with refactoring.

All nine test methods use string matching on source code to verify .iterator(chunk_size=2000) usage. This approach has significant maintenance risks:

Refactoring breaks tests: If the implementation extracts .iterator() calls into a helper function, decorator, or mixin, these tests fail even though the behavior remains correct.

False positive risk: String matching can incorrectly pass if .iterator(chunk_size=2000) appears in comments, string literals, or unrelated context.

No behavioral verification: Tests don't confirm that memory is actually reduced or that the iterator is invoked at runtime.

Consider behavior-based alternatives that verify actual functionality:

Mock QuerySet.iterator() and assert it's called with chunk_size=2000 on the target querysets.

Use Django's assertNumQueries or inspect connection.queries to verify batched query behavior.

Add integration tests with large datasets that measure memory usage.

Use unittest.mock.patch with wraps=True to spy on iterator calls without breaking behavior.

If source inspection is retained as a temporary migration guard, document the brittleness and plan to migrate to behavior-based tests once the .iterator() adoption is stable.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@specifyweb/specify/tests/test_queryset_iterators.py` around lines 24 - 112, The tests currently inspect source strings for ".iterator(chunk_size=2000)" which is brittle; replace each source-inspection test for the functions/methods (to_many_to_data, calculate_totals_deaccession, Command.handle, extract_query, build_cache_tables, _build_single_cache, get_cog_consolidated_preps, serialize_role, get_all_tree_information) with behavior-based assertions that QuerySet.iterator is actually invoked with chunk_size=2000: run the target function and use unittest.mock.patch (or patch.object) to spy on django.db.models.query.QuerySet.iterator with wraps=True and assert it was called with chunk_size=2000 for the specific queryset(s) the function uses (or alternatively use Django's assertNumQueries / inspect connection.queries to verify batched iteration), ensuring you patch at the right import target so the iterator call inside each target function is observed.

foozleface and others added 3 commits April 7, 2026 21:39

perf: add .iterator() to unbounded querysets to reduce memory usage (#…

ad5e3cf

…7864)

Merge branch 'main' into cas/perf-iterators-7864

634307c

Add export cache module to stream large querysets

f7ebbe4

github-project-automation Bot added this to General Tester Board Apr 23, 2026

github-project-automation Bot moved this to 📋Back Log in General Tester Board Apr 23, 2026

acwhite211 added this to the 7.12.2 milestone Apr 23, 2026

acwhite211 requested review from a team April 23, 2026 22:11

emenslin approved these changes Apr 28, 2026

View reviewed changes

emenslin requested a review from a team April 28, 2026 18:10

Iwantexpresso approved these changes Apr 28, 2026

View reviewed changes

bhumikaguptaa approved these changes May 6, 2026

View reviewed changes

coderabbitai Bot requested changes May 19, 2026

View reviewed changes

github-project-automation Bot moved this from 📋Back Log to Dev Attention Needed in General Tester Board May 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add .iterator() to unbounded querysets#7940#8019

Add .iterator() to unbounded querysets#7940#8019
acwhite211 wants to merge 3 commits into
mainfrom
issue-7864

acwhite211 commented Apr 23, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

emenslin left a comment

Uh oh!

Iwantexpresso left a comment •

edited

Loading

Uh oh!

bhumikaguptaa left a comment

Uh oh!

CarolineDenis commented May 19, 2026

Uh oh!

coderabbitai Bot commented May 19, 2026

Uh oh!

coderabbitai Bot commented May 19, 2026 •

edited

Loading

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

acwhite211 commented Apr 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implementation

Testing instructions

Summary by CodeRabbit

Uh oh!

emenslin left a comment

Choose a reason for hiding this comment

Uh oh!

Iwantexpresso left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bhumikaguptaa left a comment

Choose a reason for hiding this comment

Uh oh!

CarolineDenis commented May 19, 2026

Uh oh!

coderabbitai Bot commented May 19, 2026

Uh oh!

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

acwhite211 commented Apr 23, 2026 •

edited by coderabbitai Bot

Loading

Iwantexpresso left a comment •

edited

Loading

coderabbitai Bot commented May 19, 2026 •

edited

Loading