Skip to content

Conversation

@anna-parker
Copy link
Contributor

@anna-parker anna-parker commented Dec 19, 2025

partially resolves #5755

Screenshot

Previous time for processing mpox was ~6min per 1k sequences, now with increased chunk size it goes down to ~30sec per 1k sequences.

Present state on staging

11:38:28    DEBUG (get_ena_submission_list.py:  93) - Successfully processed 26000 entries.
11:38:29    DEBUG (get_ena_submission_list.py:  93) - Successfully processed 27000 entries.
11:38:32    DEBUG (get_ena_submission_list.py:  93) - Successfully processed 28000 entries.
11:38:32     INFO (get_ena_submission_list.py: 223) - Getting released sequences for organism: mpox
11:38:32     INFO (get_ena_submission_list.py: 226) - Starting to stream released entries. Filtering for submission...
11:38:32    DEBUG (get_ena_submission_list.py:  85) - Querying ENA db for latest version of submissions
11:38:32    DEBUG (get_ena_submission_list.py:  89) - Starting processing of data from Loculus backend
11:38:32     INFO (     call_loculus.py: 160) - Fetching released data from https://backend-staging.pathoplexus.org/mpox/get-released-data with request id 5b18dbbc-e061-4140-8316-f6822d482b27
11:38:34    DEBUG (get_ena_submission_list.py:  93) - Successfully processed 0 entries.
11:44:03    DEBUG (get_ena_submission_list.py:  93) - Successfully processed 1000 entries.

Changes: with chunk size 65536:

14:03:12    DEBUG (get_ena_submission_list.py:  93) - Successfully processed 26000 entries.
14:03:13    DEBUG (get_ena_submission_list.py:  93) - Successfully processed 27000 entries.
14:03:14    DEBUG (get_ena_submission_list.py:  93) - Successfully processed 28000 entries.
14:03:14     INFO (get_ena_submission_list.py: 223) - Getting released sequences for organism: mpox
14:03:14     INFO (get_ena_submission_list.py: 226) - Starting to stream released entries. Filtering for submission...
14:03:14    DEBUG (get_ena_submission_list.py:  85) - Querying ENA db for latest version of submissions
14:03:14    DEBUG (get_ena_submission_list.py:  89) - Starting processing of data from Loculus backend
14:03:14     INFO (     call_loculus.py: 160) - Fetching released data from https://backend-staging.pathoplexus.org/mpox/get-released-data with request id 551799c9-d794-411a-90cb-412b07fbe8ff
14:03:15    DEBUG (get_ena_submission_list.py:  93) - Successfully processed 0 entries.
14:03:40    DEBUG (get_ena_submission_list.py:  93) - Successfully processed 1000 entries.

PR Checklist

  • All necessary documentation has been adapted.
  • The implemented feature is covered by appropriate, automated tests.
  • Any manual testing that has been done is documented (i.e. what exactly was tested?)

🚀 Preview: https://use-orjson.loculus.org

@anna-parker anna-parker added the preview Triggers a deployment to argocd label Dec 19, 2025
@anna-parker anna-parker marked this pull request as ready for review December 19, 2025 08:41
@anna-parker anna-parker changed the title feat(deposition): optimize cronjob feat(deposition): read data with orjson instead of json to increase cronjob performance and add better logging Dec 19, 2025
@claude
Copy link
Contributor

claude bot commented Dec 19, 2025

Code Review - PR #5765: feat(deposition): optimize cronjob

Summary

This PR optimizes the ENA submission cronjob by replacing the standard json library with orjson for parsing NDJSON responses, and adds progress logging. Overall, this is a good performance optimization with a clean implementation.


✅ Strengths

  1. Performance Improvement: Using orjson for parsing is a well-known optimization - it's significantly faster than the standard library's json module, especially for large datasets.

  2. Correct Implementation: The change from iter_lines(decode_unicode=True) to iter_lines() (returning bytes) is the right approach when using orjson, which expects bytes input.

  3. Proper Empty Line Handling: Added check if not line: continue to skip empty lines, which is good defensive programming.

  4. Progress Logging: The addition of logging every 1000 entries in get_ena_submission_list.py:89-93 is helpful for monitoring long-running operations.


🔍 Issues & Suggestions

1. Missing Dependency Version (Minor)

# ena-submission/environment.yml:36
- orjson

Issue: The orjson dependency lacks a version pin, while other dependencies in the file are pinned (e.g., orjsonl, deepdiff).

Recommendation: Pin to a specific version for reproducibility:

- orjson=3.10.12  # or latest stable version

According to CLAUDE.md: "Conda dependencies in environment.yml files are not automatically updated by dependabot." This makes version pinning even more important.


2. Inconsistent JSON Library Usage (Code Quality)

# ena-submission/src/ena_deposition/call_loculus.py:110
data = json.dumps(external_metadata)

Issue: The file still imports and uses the standard json library for serialization in submit_external_metadata(), while using orjson for deserialization in fetch_released_entries().

Impact: Minor - serialization happens less frequently than parsing, so performance impact is limited. However, it's inconsistent.

Recommendation: For consistency, consider replacing json.dumps() with orjson.dumps() and decode the result:

data = orjson.dumps(external_metadata).decode('utf-8')

If json.dumps() is kept, the import json at line 1 should remain (which it currently does).


3. Missing Test Coverage (Testing)

Issue: No tests were added or modified to verify:

  • The orjson parsing works correctly with real NDJSON data
  • The progress logging functions as expected
  • Empty lines are handled correctly

Recommendation:

  • Add a test case in test_get_ena_submission_list.py that verifies the progress logging occurs
  • Consider adding a test that includes empty lines in the NDJSON response to verify they're skipped correctly
  • The existing test using fake_fetch_released_entries() may need verification that it still works with byte streams

4. Log Level Choice (Minor)

# ena-submission/scripts/get_ena_submission_list.py:92-93
if processed_entries % 1000 == 0:
    logger.debug(f"Successfully processed {processed_entries} entries.")

Issue: Using logger.debug() means this won't appear in production unless debug logging is enabled, which may reduce its usefulness for monitoring.

Recommendation: Consider using logger.info() instead, since this is operational information that's valuable for monitoring cronjob progress. The 1000-entry threshold already prevents spam.


🔒 Security

✅ No security concerns identified. The change is purely a performance optimization in parsing logic.


⚡ Performance

Expected improvement: orjson is typically 2-3x faster than the standard library for parsing JSON, with lower memory usage. The PR description shows similar execution times (~7-8 seconds), but this is likely because there are "no sequences found to submit" - the real benefit will be visible when processing large datasets.


📋 Checklist Status

From the PR description:

  • Documentation - Not applicable for internal optimization
  • Tests - Missing (see Issue Set up base Astro project #3 above)
  • Manual testing - ✅ Documented in PR description

🎯 Recommendations Summary

Must Fix:

  • Add version pin for orjson dependency

Should Fix:

  • Add test coverage for the changes
  • Consider changing log level from debug to info for progress logging

Nice to Have:

  • Use orjson.dumps() for consistency throughout the file

Final Assessment

This is a solid performance optimization with minimal risk. The code changes are correct and well-implemented. With the addition of dependency version pinning and test coverage, this would be ready to merge.

Recommendation: Approve with minor changes requested.

@anna-parker anna-parker changed the title feat(deposition): read data with orjson instead of json to increase cronjob performance and add better logging feat(deposition): read data with orjsonl instead of json to increase cronjob performance and add better logging Dec 19, 2025
@anna-parker anna-parker changed the title feat(deposition): read data with orjsonl instead of json to increase cronjob performance and add better logging feat(deposition): read data with orjson instead of json to increase cronjob performance and add better logging Dec 19, 2025
@anna-parker anna-parker changed the title feat(deposition): read data with orjson instead of json to increase cronjob performance and add better logging feat(deposition): read data with orjson instead of json, add better logging, increase chunk size Dec 19, 2025
@anna-parker anna-parker merged commit ba854a7 into main Dec 19, 2025
38 checks passed
@anna-parker anna-parker deleted the use_orjson branch December 19, 2025 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

preview Triggers a deployment to argocd

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ENA deposition cronjob repeatedly getting stuck

4 participants