Add .iterator() to unbounded querysets#7940#8019
Conversation
emenslin
left a comment
There was a problem hiding this comment.
- Run batch edit with date fields (catalogedDate Full Date, catalogedDate Month, etc.) and verify column headers are correct
- Run a workbench upload that explicitly sets createdbyagent and verify the value is preserved
I was only able to test the first two, but it looks good!
There was a problem hiding this comment.
- Run batch edit with date fields (catalogedDate Full Date, catalogedDate Month, etc.) and verify column headers are correct
- Run a workbench upload that explicitly sets createdbyagent and verify the value is preserved
- Run the test suite:
python manage.py test specifyweb.specify.tests.test_queryset_iterators - Monitor memory usage during large query exports or COG prep operations
tests seem to have passed with no issues!
for memory usage I used the Firefox profiler to keep track of the memory usage catering between the begging of an export and a little after the notification of the finalization is received
CSV export profiler:
Permalink to firefox profiler capture:
https://profiler.firefox.com/public/9cyn3dgavxdx06758t5pet17rs8t706b3xwskk0/calltree/?globalTrackOrder=0w3&thread=3&v=16
here's a screenshot of the memory usage (orange graphs ) to give a slight idea of the memory usage itself
KML export profiler:
Permalink to firefox profiler capture
https://profiler.firefox.com/public/w1x3a4e7k3wkncx8c285t5bkg9kz62a3150ny4r/calltree/?globalTrackOrder=0w4&implementation=js&thread=0&v=16
SC of results of kml export
Nothing looked out of the ordinary, there is some spikes in between the exports beginning and finalization but the memory usage either stays constant after that or decreases after a brief period
bhumikaguptaa
left a comment
There was a problem hiding this comment.
- Run batch edit with date fields (catalogedDate Full Date, catalogedDate Month, etc.) and verify column headers are correct
- Run a workbench upload that explicitly sets createdbyagent and verify the value is preserved
- Optional: Monitor memory usage during large query exports or COG prep operations
For the last instruction i just used large dataset to Batch Edit and export and there were no server or gateway error pop up. Everything else looks good.
|
@CodeRabbit full review |
✅ Actions performedFull review triggered. |
📝 WalkthroughWalkthroughThis PR reduces memory usage for large QuerySet operations by adopting Django's ChangesQuerySet Memory and Field Handling Optimization
🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
specifyweb/backend/export/cache.py (1)
28-31: 💤 Low valueConsider removing redundant
.all()call.The
.all()call on line 30 is redundant sinceRelatedManagersupports.iterator()directly. While harmless, removing it simplifies the code.♻️ Proposed simplification
'fields': [ _field_to_cache_entry(field) - for field in fields.all().iterator(chunk_size=2000) + for field in fields.iterator(chunk_size=2000) ],🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@specifyweb/backend/export/cache.py` around lines 28 - 31, The list comprehension calling _field_to_cache_entry over fields.all().iterator(chunk_size=2000) should drop the redundant .all() since RelatedManager supports .iterator() directly; change fields.all().iterator(chunk_size=2000) to fields.iterator(chunk_size=2000) (keeping _field_to_cache_entry and the chunk_size unchanged) to simplify the code.specifyweb/backend/trees/views.py (1)
550-554: 💤 Low valueReview the necessity of
iterator()withchunk_size=2000for tree definition items.Tree definition items (ranks) are small: Taxon trees have 36, while others range from 4–9 items. The
chunk_size=2000parameter is oversized for these datasets. More importantly, the list comprehension on line 553 immediately materializes all ranks into memory regardless, limiting the iterator's benefit. Consider simplifying to a direct query withoutiterator()if memory impact is negligible.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@specifyweb/backend/trees/views.py` around lines 550 - 554, The current code uses definition.treedefitems.order_by('rankid').iterator(chunk_size=2000) but immediately materializes all items with [obj_to_data(rank) for rank in ranks], so the large chunk_size and iterator() are unnecessary; replace the iterator call with a direct ordered queryset (definition.treedefitems.order_by('rankid')) or a simple list(...) if you want eager evaluation, and then keep the existing list comprehension over that queryset to produce ranks — this removes the oversized chunk_size and simplifies the logic while preserving behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@specifyweb/specify/tests/test_queryset_iterators.py`:
- Around line 24-112: The tests currently inspect source strings for
".iterator(chunk_size=2000)" which is brittle; replace each source-inspection
test for the functions/methods (to_many_to_data, calculate_totals_deaccession,
Command.handle, extract_query, build_cache_tables, _build_single_cache,
get_cog_consolidated_preps, serialize_role, get_all_tree_information) with
behavior-based assertions that QuerySet.iterator is actually invoked with
chunk_size=2000: run the target function and use unittest.mock.patch (or
patch.object) to spy on django.db.models.query.QuerySet.iterator with wraps=True
and assert it was called with chunk_size=2000 for the specific queryset(s) the
function uses (or alternatively use Django's assertNumQueries / inspect
connection.queries to verify batched iteration), ensuring you patch at the right
import target so the iterator call inside each target function is observed.
---
Nitpick comments:
In `@specifyweb/backend/export/cache.py`:
- Around line 28-31: The list comprehension calling _field_to_cache_entry over
fields.all().iterator(chunk_size=2000) should drop the redundant .all() since
RelatedManager supports .iterator() directly; change
fields.all().iterator(chunk_size=2000) to fields.iterator(chunk_size=2000)
(keeping _field_to_cache_entry and the chunk_size unchanged) to simplify the
code.
In `@specifyweb/backend/trees/views.py`:
- Around line 550-554: The current code uses
definition.treedefitems.order_by('rankid').iterator(chunk_size=2000) but
immediately materializes all items with [obj_to_data(rank) for rank in ranks],
so the large chunk_size and iterator() are unnecessary; replace the iterator
call with a direct ordered queryset (definition.treedefitems.order_by('rankid'))
or a simple list(...) if you want eager evaluation, and then keep the existing
list comprehension over that queryset to produce ranks — this removes the
oversized chunk_size and simplifies the logic while preserving behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: bafc7378-5fe6-4073-97df-581972c966d7
📒 Files selected for processing (10)
specifyweb/backend/export/cache.pyspecifyweb/backend/export/extract_query.pyspecifyweb/backend/interactions/cog_preps.pyspecifyweb/backend/permissions/views.pyspecifyweb/backend/stored_queries/batch_edit.pyspecifyweb/backend/trees/views.pyspecifyweb/specify/api/calculated_fields.pyspecifyweb/specify/api/serializers.pyspecifyweb/specify/management/commands/print_tree.pyspecifyweb/specify/tests/test_queryset_iterators.py
| def test_serializers_to_many_uses_iterator(self): | ||
| """to_many_to_data should use .iterator() when serializing dependent collections.""" | ||
| from specifyweb.specify.api.serializers import to_many_to_data | ||
| source = _get_source(to_many_to_data) | ||
| self.assertIn( | ||
| '.iterator(chunk_size=2000)', | ||
| source, | ||
| "to_many_to_data should use .iterator(chunk_size=2000) on objs.all()" | ||
| ) | ||
|
|
||
| def test_calculated_fields_deaccession_uses_iterator(self): | ||
| """calculate_totals_deaccession should use .iterator() on the filter queryset.""" | ||
| from specifyweb.specify.api.calculated_fields import calculate_totals_deaccession | ||
| source = _get_source(calculate_totals_deaccession) | ||
| self.assertIn( | ||
| '.iterator(chunk_size=2000)', | ||
| source, | ||
| "calculate_totals_deaccession should use .iterator(chunk_size=2000)" | ||
| ) | ||
|
|
||
| def test_print_tree_taxon_uses_iterator(self): | ||
| """print_tree management command should use .iterator() on Taxon.objects.all().""" | ||
| from specifyweb.specify.management.commands.print_tree import Command | ||
| source = _get_source(Command.handle) | ||
| self.assertIn( | ||
| '.iterator(chunk_size=2000)', | ||
| source, | ||
| "print_tree should use .iterator(chunk_size=2000) on Taxon.objects.all()" | ||
| ) | ||
|
|
||
| def test_export_extract_query_uses_iterator(self): | ||
| """extract_query should use .iterator() on query.fields.all().""" | ||
| from specifyweb.backend.export.extract_query import extract_query | ||
| source = _get_source(extract_query) | ||
| self.assertIn( | ||
| '.iterator(chunk_size=2000)', | ||
| source, | ||
| "extract_query should use .iterator(chunk_size=2000) on query.fields.all()" | ||
| ) | ||
|
|
||
| def test_export_cache_build_uses_iterator(self): | ||
| """build_cache_tables should use .iterator() on extensions.all().""" | ||
| from specifyweb.backend.export.cache import build_cache_tables | ||
| source = _get_source(build_cache_tables) | ||
| self.assertIn( | ||
| '.iterator(chunk_size=2000)', | ||
| source, | ||
| "build_cache_tables should use .iterator(chunk_size=2000) on extensions.all()" | ||
| ) | ||
|
|
||
| def test_export_cache_fields_uses_iterator(self): | ||
| """_build_single_cache should use .iterator() on fields.all().""" | ||
| from specifyweb.backend.export.cache import _build_single_cache | ||
| source = _get_source(_build_single_cache) | ||
| self.assertIn( | ||
| '.iterator(chunk_size=2000)', | ||
| source, | ||
| "_build_single_cache should use .iterator(chunk_size=2000) on fields.all()" | ||
| ) | ||
|
|
||
| def test_cog_preps_child_cogs_uses_iterator(self): | ||
| """get_cog_consolidated_preps should use .iterator() on child COG queries.""" | ||
| from specifyweb.backend.interactions.cog_preps import get_cog_consolidated_preps | ||
| source = _get_source(get_cog_consolidated_preps) | ||
| self.assertIn( | ||
| '.iterator(chunk_size=2000)', | ||
| source, | ||
| "get_cog_consolidated_preps should use .iterator(chunk_size=2000)" | ||
| ) | ||
|
|
||
| def test_permissions_serialize_role_uses_iterator(self): | ||
| """serialize_role should use .iterator() on role.policies.all().""" | ||
| from specifyweb.backend.permissions.views import serialize_role | ||
| source = _get_source(serialize_role) | ||
| self.assertIn( | ||
| '.iterator(chunk_size=2000)', | ||
| source, | ||
| "serialize_role should use .iterator(chunk_size=2000) on role.policies.all()" | ||
| ) | ||
|
|
||
| def test_tree_views_ranks_uses_iterator(self): | ||
| """get_all_tree_information should use .iterator() on treedefitems.""" | ||
| from specifyweb.backend.trees.views import get_all_tree_information | ||
| source = _get_source(get_all_tree_information) | ||
| self.assertIn( | ||
| '.iterator(chunk_size=2000)', | ||
| source, | ||
| "get_all_tree_information should use .iterator(chunk_size=2000) on ranks" | ||
| ) |
There was a problem hiding this comment.
Source inspection tests are brittle and will break with refactoring.
All nine test methods use string matching on source code to verify .iterator(chunk_size=2000) usage. This approach has significant maintenance risks:
- Refactoring breaks tests: If the implementation extracts
.iterator()calls into a helper function, decorator, or mixin, these tests fail even though the behavior remains correct. - False positive risk: String matching can incorrectly pass if
.iterator(chunk_size=2000)appears in comments, string literals, or unrelated context. - No behavioral verification: Tests don't confirm that memory is actually reduced or that the iterator is invoked at runtime.
Consider behavior-based alternatives that verify actual functionality:
- Mock
QuerySet.iterator()and assert it's called withchunk_size=2000on the target querysets. - Use Django's
assertNumQueriesor inspectconnection.queriesto verify batched query behavior. - Add integration tests with large datasets that measure memory usage.
- Use
unittest.mock.patchwithwraps=Trueto spy on iterator calls without breaking behavior.
If source inspection is retained as a temporary migration guard, document the brittleness and plan to migrate to behavior-based tests once the .iterator() adoption is stable.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@specifyweb/specify/tests/test_queryset_iterators.py` around lines 24 - 112,
The tests currently inspect source strings for ".iterator(chunk_size=2000)"
which is brittle; replace each source-inspection test for the functions/methods
(to_many_to_data, calculate_totals_deaccession, Command.handle, extract_query,
build_cache_tables, _build_single_cache, get_cog_consolidated_preps,
serialize_role, get_all_tree_information) with behavior-based assertions that
QuerySet.iterator is actually invoked with chunk_size=2000: run the target
function and use unittest.mock.patch (or patch.object) to spy on
django.db.models.query.QuerySet.iterator with wraps=True and assert it was
called with chunk_size=2000 for the specific queryset(s) the function uses (or
alternatively use Django's assertNumQueries / inspect connection.queries to
verify batched iteration), ensuring you patch at the right import target so the
iterator call inside each target function is observed.
Fixes #7864
Contributed by @foozleface
Based on the #7940 PR with additional fixes
Approximately 80 callsites in the backend use
.all()without.iterator(), causing Django to cache all results in the QuerySet internal result cache. For large tables this means loading entire result sets into memory at once. This PR adds.iterator(chunk_size=2000)to 9 high-impact paths where unbounded querysets are iterated.Implementation
.iterator(chunk_size=2000)to COG prep consolidation queries incog_preps.py(2 callsites).iterator(chunk_size=2000)to role policy serialization inpermissions/views.py.iterator(chunk_size=2000)to deaccession total calculation incalculated_fields.py.iterator(chunk_size=2000)to dependent to-many serialization inapi/serializers.py.iterator(chunk_size=2000)to tree definition rank loading intrees/views.pycreatedbyagentvalues inupload_table.py.iterator()is called on the target querysetsTesting instructions
python manage.py test specifyweb.specify.tests.test_queryset_iteratorsSummary by CodeRabbit
New Features
Bug Fixes
Tests