Skip to content

Add .iterator() to unbounded querysets#7940#8019

Open
acwhite211 wants to merge 3 commits into
mainfrom
issue-7864
Open

Add .iterator() to unbounded querysets#7940#8019
acwhite211 wants to merge 3 commits into
mainfrom
issue-7864

Conversation

@acwhite211
Copy link
Copy Markdown
Member

@acwhite211 acwhite211 commented Apr 23, 2026

Fixes #7864
Contributed by @foozleface
Based on the #7940 PR with additional fixes

Approximately 80 callsites in the backend use .all() without .iterator(), causing Django to cache all results in the QuerySet internal result cache. For large tables this means loading entire result sets into memory at once. This PR adds .iterator(chunk_size=2000) to 9 high-impact paths where unbounded querysets are iterated.

Implementation

  • Add .iterator(chunk_size=2000) to COG prep consolidation queries in cog_preps.py (2 callsites)
  • Add .iterator(chunk_size=2000) to role policy serialization in permissions/views.py
  • Add .iterator(chunk_size=2000) to deaccession total calculation in calculated_fields.py
  • Add .iterator(chunk_size=2000) to dependent to-many serialization in api/serializers.py
  • Add .iterator(chunk_size=2000) to tree definition rank loading in trees/views.py
  • Fix batch edit date-part field name handling to use date-part-aware lookup keys, preventing mislabeled column headers for temporal fields
  • Fix workbench upload to not overwrite explicit createdbyagent values in upload_table.py
  • Add tests verifying .iterator() is called on the target querysets

Testing instructions

  • Run batch edit with date fields (catalogedDate Full Date, catalogedDate Month, etc.) and verify column headers are correct
  • Run a workbench upload that explicitly sets createdbyagent and verify the value is preserved
  • Run the test suite (or just make sure the github action passed): python manage.py test specifyweb.specify.tests.test_queryset_iterators
  • Optional: Monitor memory usage during large query exports or COG prep operations

Summary by CodeRabbit

  • New Features

    • Added export cache utilities to improve export performance and data structure handling.
  • Bug Fixes

    • Fixed date-part field naming in batch edit to ensure temporal fields are properly recognized and localized.
  • Tests

    • Added tests to verify proper memory-efficient query iteration across critical operations.

Review Change Stack

@github-project-automation github-project-automation Bot moved this to 📋Back Log in General Tester Board Apr 23, 2026
@acwhite211 acwhite211 added this to the 7.12.2 milestone Apr 23, 2026
@acwhite211 acwhite211 requested review from a team April 23, 2026 22:11
Copy link
Copy Markdown
Collaborator

@emenslin emenslin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Run batch edit with date fields (catalogedDate Full Date, catalogedDate Month, etc.) and verify column headers are correct
  • Run a workbench upload that explicitly sets createdbyagent and verify the value is preserved

I was only able to test the first two, but it looks good!

@emenslin emenslin requested a review from a team April 28, 2026 18:10
Copy link
Copy Markdown
Contributor

@Iwantexpresso Iwantexpresso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Run batch edit with date fields (catalogedDate Full Date, catalogedDate Month, etc.) and verify column headers are correct
  • Run a workbench upload that explicitly sets createdbyagent and verify the value is preserved
  • Run the test suite: python manage.py test specifyweb.specify.tests.test_queryset_iterators
  • Monitor memory usage during large query exports or COG prep operations

tests seem to have passed with no issues!

Image

for memory usage I used the Firefox profiler to keep track of the memory usage catering between the begging of an export and a little after the notification of the finalization is received

CSV export profiler:

Permalink to firefox profiler capture:
https://profiler.firefox.com/public/9cyn3dgavxdx06758t5pet17rs8t706b3xwskk0/calltree/?globalTrackOrder=0w3&thread=3&v=16

here's a screenshot of the memory usage (orange graphs ) to give a slight idea of the memory usage itself

Image

KML export profiler:
Permalink to firefox profiler capture
https://profiler.firefox.com/public/w1x3a4e7k3wkncx8c285t5bkg9kz62a3150ny4r/calltree/?globalTrackOrder=0w4&implementation=js&thread=0&v=16

SC of results of kml export

Image

Nothing looked out of the ordinary, there is some spikes in between the exports beginning and finalization but the memory usage either stays constant after that or decreases after a brief period

Copy link
Copy Markdown
Collaborator

@bhumikaguptaa bhumikaguptaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Run batch edit with date fields (catalogedDate Full Date, catalogedDate Month, etc.) and verify column headers are correct
  • Run a workbench upload that explicitly sets createdbyagent and verify the value is preserved
  • Optional: Monitor memory usage during large query exports or COG prep operations

For the last instruction i just used large dataset to Batch Edit and export and there were no server or gateway error pop up. Everything else looks good.

@CarolineDenis
Copy link
Copy Markdown
Contributor

@CodeRabbit full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

✅ Actions performed

Full review triggered.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

📝 Walkthrough

Walkthrough

This PR reduces memory usage for large QuerySet operations by adopting Django's iterator(chunk_size=2000) across multiple services, introduces new export cache utilities with built-in chunked iteration, adds temporal field naming support to batch edit localization, and validates iterator adoption through source-code inspection tests.

Changes

QuerySet Memory and Field Handling Optimization

Layer / File(s) Summary
Export Cache Utilities with Iterator Batching
specifyweb/backend/export/cache.py
New module provides _field_to_cache_entry, _build_single_cache, and build_cache_tables to serialize export extensions and fields into cache dictionaries. All functions internally use iterator(chunk_size=2000) when processing extension mappings and building per-extension field arrays.
Queryset Iterator Adoption Across Services
specifyweb/backend/export/extract_query.py, specifyweb/backend/interactions/cog_preps.py, specifyweb/backend/permissions/views.py, specifyweb/backend/trees/views.py, specifyweb/specify/api/calculated_fields.py, specifyweb/specify/api/serializers.py, specifyweb/specify/management/commands/print_tree.py
Seven modules adopt iterator(chunk_size=2000) to stream large result sets in batches instead of materializing full QuerySets in memory. Affects field iteration in export XML construction, child ID lookups in COG recursion, policy aggregation in role serialization, rank fetching in tree information retrieval, preparation counting in deaccession calculations, many-to-many object serialization, and taxon traversal in tree printing.
Batch Edit Date-Part-Aware Field Naming
specifyweb/backend/stored_queries/batch_edit.py
Introduces _get_date_part_field_name helper to compute date-part-suffixed field keys (e.g., field_name__Month). Updates _lookup_in_fields to use suffixed names for localization label lookups, enhances naive_field_format to append visible date-part labels to temporal field captions, and aligns run_batch_edit_query localization dumps with the new naming convention.
Iterator Usage Validation Tests
specifyweb/specify/tests/test_queryset_iterators.py
New test class TestIteratorUsageInSource inspects source code of targeted functions across export, permissions, trees, serializers, and management commands. Uses _get_source helper to extract and dedent function implementations, then asserts that critical QuerySet iteration points include iterator(chunk_size=2000) to enforce memory-safe practices.
🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Testing Instructions ⚠️ Warning Testing instructions incomplete. Batch_edit date-part and upload createdbyagent changes lack automated tests and documented manual testing procedures. Add tests for date-part and createdbyagent changes. Document manual testing steps in test docstrings or TESTING guide with expected outcomes and verification procedures.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely captures the main change: adding .iterator() calls to unbounded querysets for memory optimization, directly addressing issue #7940.
Linked Issues check ✅ Passed All code changes directly implement the requirements from issue #7864: .iterator(chunk_size=2000) added to nine high-impact querysets [#7864], batch edit date-part field handling fixed [#7864], and tests added to verify iterator usage [#7864].
Out of Scope Changes check ✅ Passed The addition of batch_edit date-part field name handling and export cache module are in-scope enhancements directly supporting the iterator optimization goals for correctness and streaming.
Automatic Tests ✅ Passed PR includes necessary automatic tests: 9 test methods verifying .iterator() usage for all iterator changes, with each test corresponding to actual implementations.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch issue-7864

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
specifyweb/backend/export/cache.py (1)

28-31: 💤 Low value

Consider removing redundant .all() call.

The .all() call on line 30 is redundant since RelatedManager supports .iterator() directly. While harmless, removing it simplifies the code.

♻️ Proposed simplification
         'fields': [
             _field_to_cache_entry(field)
-            for field in fields.all().iterator(chunk_size=2000)
+            for field in fields.iterator(chunk_size=2000)
         ],
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@specifyweb/backend/export/cache.py` around lines 28 - 31, The list
comprehension calling _field_to_cache_entry over
fields.all().iterator(chunk_size=2000) should drop the redundant .all() since
RelatedManager supports .iterator() directly; change
fields.all().iterator(chunk_size=2000) to fields.iterator(chunk_size=2000)
(keeping _field_to_cache_entry and the chunk_size unchanged) to simplify the
code.
specifyweb/backend/trees/views.py (1)

550-554: 💤 Low value

Review the necessity of iterator() with chunk_size=2000 for tree definition items.

Tree definition items (ranks) are small: Taxon trees have 36, while others range from 4–9 items. The chunk_size=2000 parameter is oversized for these datasets. More importantly, the list comprehension on line 553 immediately materializes all ranks into memory regardless, limiting the iterator's benefit. Consider simplifying to a direct query without iterator() if memory impact is negligible.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@specifyweb/backend/trees/views.py` around lines 550 - 554, The current code
uses definition.treedefitems.order_by('rankid').iterator(chunk_size=2000) but
immediately materializes all items with [obj_to_data(rank) for rank in ranks],
so the large chunk_size and iterator() are unnecessary; replace the iterator
call with a direct ordered queryset (definition.treedefitems.order_by('rankid'))
or a simple list(...) if you want eager evaluation, and then keep the existing
list comprehension over that queryset to produce ranks — this removes the
oversized chunk_size and simplifies the logic while preserving behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@specifyweb/specify/tests/test_queryset_iterators.py`:
- Around line 24-112: The tests currently inspect source strings for
".iterator(chunk_size=2000)" which is brittle; replace each source-inspection
test for the functions/methods (to_many_to_data, calculate_totals_deaccession,
Command.handle, extract_query, build_cache_tables, _build_single_cache,
get_cog_consolidated_preps, serialize_role, get_all_tree_information) with
behavior-based assertions that QuerySet.iterator is actually invoked with
chunk_size=2000: run the target function and use unittest.mock.patch (or
patch.object) to spy on django.db.models.query.QuerySet.iterator with wraps=True
and assert it was called with chunk_size=2000 for the specific queryset(s) the
function uses (or alternatively use Django's assertNumQueries / inspect
connection.queries to verify batched iteration), ensuring you patch at the right
import target so the iterator call inside each target function is observed.

---

Nitpick comments:
In `@specifyweb/backend/export/cache.py`:
- Around line 28-31: The list comprehension calling _field_to_cache_entry over
fields.all().iterator(chunk_size=2000) should drop the redundant .all() since
RelatedManager supports .iterator() directly; change
fields.all().iterator(chunk_size=2000) to fields.iterator(chunk_size=2000)
(keeping _field_to_cache_entry and the chunk_size unchanged) to simplify the
code.

In `@specifyweb/backend/trees/views.py`:
- Around line 550-554: The current code uses
definition.treedefitems.order_by('rankid').iterator(chunk_size=2000) but
immediately materializes all items with [obj_to_data(rank) for rank in ranks],
so the large chunk_size and iterator() are unnecessary; replace the iterator
call with a direct ordered queryset (definition.treedefitems.order_by('rankid'))
or a simple list(...) if you want eager evaluation, and then keep the existing
list comprehension over that queryset to produce ranks — this removes the
oversized chunk_size and simplifies the logic while preserving behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: bafc7378-5fe6-4073-97df-581972c966d7

📥 Commits

Reviewing files that changed from the base of the PR and between b50a592 and f7ebbe4.

📒 Files selected for processing (10)
  • specifyweb/backend/export/cache.py
  • specifyweb/backend/export/extract_query.py
  • specifyweb/backend/interactions/cog_preps.py
  • specifyweb/backend/permissions/views.py
  • specifyweb/backend/stored_queries/batch_edit.py
  • specifyweb/backend/trees/views.py
  • specifyweb/specify/api/calculated_fields.py
  • specifyweb/specify/api/serializers.py
  • specifyweb/specify/management/commands/print_tree.py
  • specifyweb/specify/tests/test_queryset_iterators.py

Comment on lines +24 to +112
def test_serializers_to_many_uses_iterator(self):
"""to_many_to_data should use .iterator() when serializing dependent collections."""
from specifyweb.specify.api.serializers import to_many_to_data
source = _get_source(to_many_to_data)
self.assertIn(
'.iterator(chunk_size=2000)',
source,
"to_many_to_data should use .iterator(chunk_size=2000) on objs.all()"
)

def test_calculated_fields_deaccession_uses_iterator(self):
"""calculate_totals_deaccession should use .iterator() on the filter queryset."""
from specifyweb.specify.api.calculated_fields import calculate_totals_deaccession
source = _get_source(calculate_totals_deaccession)
self.assertIn(
'.iterator(chunk_size=2000)',
source,
"calculate_totals_deaccession should use .iterator(chunk_size=2000)"
)

def test_print_tree_taxon_uses_iterator(self):
"""print_tree management command should use .iterator() on Taxon.objects.all()."""
from specifyweb.specify.management.commands.print_tree import Command
source = _get_source(Command.handle)
self.assertIn(
'.iterator(chunk_size=2000)',
source,
"print_tree should use .iterator(chunk_size=2000) on Taxon.objects.all()"
)

def test_export_extract_query_uses_iterator(self):
"""extract_query should use .iterator() on query.fields.all()."""
from specifyweb.backend.export.extract_query import extract_query
source = _get_source(extract_query)
self.assertIn(
'.iterator(chunk_size=2000)',
source,
"extract_query should use .iterator(chunk_size=2000) on query.fields.all()"
)

def test_export_cache_build_uses_iterator(self):
"""build_cache_tables should use .iterator() on extensions.all()."""
from specifyweb.backend.export.cache import build_cache_tables
source = _get_source(build_cache_tables)
self.assertIn(
'.iterator(chunk_size=2000)',
source,
"build_cache_tables should use .iterator(chunk_size=2000) on extensions.all()"
)

def test_export_cache_fields_uses_iterator(self):
"""_build_single_cache should use .iterator() on fields.all()."""
from specifyweb.backend.export.cache import _build_single_cache
source = _get_source(_build_single_cache)
self.assertIn(
'.iterator(chunk_size=2000)',
source,
"_build_single_cache should use .iterator(chunk_size=2000) on fields.all()"
)

def test_cog_preps_child_cogs_uses_iterator(self):
"""get_cog_consolidated_preps should use .iterator() on child COG queries."""
from specifyweb.backend.interactions.cog_preps import get_cog_consolidated_preps
source = _get_source(get_cog_consolidated_preps)
self.assertIn(
'.iterator(chunk_size=2000)',
source,
"get_cog_consolidated_preps should use .iterator(chunk_size=2000)"
)

def test_permissions_serialize_role_uses_iterator(self):
"""serialize_role should use .iterator() on role.policies.all()."""
from specifyweb.backend.permissions.views import serialize_role
source = _get_source(serialize_role)
self.assertIn(
'.iterator(chunk_size=2000)',
source,
"serialize_role should use .iterator(chunk_size=2000) on role.policies.all()"
)

def test_tree_views_ranks_uses_iterator(self):
"""get_all_tree_information should use .iterator() on treedefitems."""
from specifyweb.backend.trees.views import get_all_tree_information
source = _get_source(get_all_tree_information)
self.assertIn(
'.iterator(chunk_size=2000)',
source,
"get_all_tree_information should use .iterator(chunk_size=2000) on ranks"
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Source inspection tests are brittle and will break with refactoring.

All nine test methods use string matching on source code to verify .iterator(chunk_size=2000) usage. This approach has significant maintenance risks:

  • Refactoring breaks tests: If the implementation extracts .iterator() calls into a helper function, decorator, or mixin, these tests fail even though the behavior remains correct.
  • False positive risk: String matching can incorrectly pass if .iterator(chunk_size=2000) appears in comments, string literals, or unrelated context.
  • No behavioral verification: Tests don't confirm that memory is actually reduced or that the iterator is invoked at runtime.

Consider behavior-based alternatives that verify actual functionality:

  • Mock QuerySet.iterator() and assert it's called with chunk_size=2000 on the target querysets.
  • Use Django's assertNumQueries or inspect connection.queries to verify batched query behavior.
  • Add integration tests with large datasets that measure memory usage.
  • Use unittest.mock.patch with wraps=True to spy on iterator calls without breaking behavior.

If source inspection is retained as a temporary migration guard, document the brittleness and plan to migrate to behavior-based tests once the .iterator() adoption is stable.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@specifyweb/specify/tests/test_queryset_iterators.py` around lines 24 - 112,
The tests currently inspect source strings for ".iterator(chunk_size=2000)"
which is brittle; replace each source-inspection test for the functions/methods
(to_many_to_data, calculate_totals_deaccession, Command.handle, extract_query,
build_cache_tables, _build_single_cache, get_cog_consolidated_preps,
serialize_role, get_all_tree_information) with behavior-based assertions that
QuerySet.iterator is actually invoked with chunk_size=2000: run the target
function and use unittest.mock.patch (or patch.object) to spy on
django.db.models.query.QuerySet.iterator with wraps=True and assert it was
called with chunk_size=2000 for the specific queryset(s) the function uses (or
alternatively use Django's assertNumQueries / inspect connection.queries to
verify batched iteration), ensuring you patch at the right import target so the
iterator call inside each target function is observed.

@github-project-automation github-project-automation Bot moved this from 📋Back Log to Dev Attention Needed in General Tester Board May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Dev Attention Needed

Development

Successfully merging this pull request may close these issues.

Use iterators to evaluate large Django QuerySets

6 participants