feat(deposition): read data with orjson instead of json, add better logging, increase chunk size #5765

anna-parker · 2025-12-19T07:04:05Z

partially resolves #5755

Screenshot

Previous time for processing mpox was ~6min per 1k sequences, now with increased chunk size it goes down to ~30sec per 1k sequences.

Present state on staging

11:38:28    DEBUG (get_ena_submission_list.py:  93) - Successfully processed 26000 entries.
11:38:29    DEBUG (get_ena_submission_list.py:  93) - Successfully processed 27000 entries.
11:38:32    DEBUG (get_ena_submission_list.py:  93) - Successfully processed 28000 entries.
11:38:32     INFO (get_ena_submission_list.py: 223) - Getting released sequences for organism: mpox
11:38:32     INFO (get_ena_submission_list.py: 226) - Starting to stream released entries. Filtering for submission...
11:38:32    DEBUG (get_ena_submission_list.py:  85) - Querying ENA db for latest version of submissions
11:38:32    DEBUG (get_ena_submission_list.py:  89) - Starting processing of data from Loculus backend
11:38:32     INFO (     call_loculus.py: 160) - Fetching released data from https://backend-staging.pathoplexus.org/mpox/get-released-data with request id 5b18dbbc-e061-4140-8316-f6822d482b27
11:38:34    DEBUG (get_ena_submission_list.py:  93) - Successfully processed 0 entries.
11:44:03    DEBUG (get_ena_submission_list.py:  93) - Successfully processed 1000 entries.

Changes: with chunk size 65536:

14:03:12    DEBUG (get_ena_submission_list.py:  93) - Successfully processed 26000 entries.
14:03:13    DEBUG (get_ena_submission_list.py:  93) - Successfully processed 27000 entries.
14:03:14    DEBUG (get_ena_submission_list.py:  93) - Successfully processed 28000 entries.
14:03:14     INFO (get_ena_submission_list.py: 223) - Getting released sequences for organism: mpox
14:03:14     INFO (get_ena_submission_list.py: 226) - Starting to stream released entries. Filtering for submission...
14:03:14    DEBUG (get_ena_submission_list.py:  85) - Querying ENA db for latest version of submissions
14:03:14    DEBUG (get_ena_submission_list.py:  89) - Starting processing of data from Loculus backend
14:03:14     INFO (     call_loculus.py: 160) - Fetching released data from https://backend-staging.pathoplexus.org/mpox/get-released-data with request id 551799c9-d794-411a-90cb-412b07fbe8ff
14:03:15    DEBUG (get_ena_submission_list.py:  93) - Successfully processed 0 entries.
14:03:40    DEBUG (get_ena_submission_list.py:  93) - Successfully processed 1000 entries.

PR Checklist

All necessary documentation has been adapted.
The implemented feature is covered by appropriate, automated tests.
Any manual testing that has been done is documented (i.e. what exactly was tested?)

🚀 Preview: https://use-orjson.loculus.org

claude · 2025-12-19T08:43:07Z

Code Review - PR #5765: feat(deposition): optimize cronjob

Summary

This PR optimizes the ENA submission cronjob by replacing the standard json library with orjson for parsing NDJSON responses, and adds progress logging. Overall, this is a good performance optimization with a clean implementation.

✅ Strengths

Performance Improvement: Using orjson for parsing is a well-known optimization - it's significantly faster than the standard library's json module, especially for large datasets.
Correct Implementation: The change from iter_lines(decode_unicode=True) to iter_lines() (returning bytes) is the right approach when using orjson, which expects bytes input.
Proper Empty Line Handling: Added check if not line: continue to skip empty lines, which is good defensive programming.
Progress Logging: The addition of logging every 1000 entries in get_ena_submission_list.py:89-93 is helpful for monitoring long-running operations.

🔍 Issues & Suggestions

1. Missing Dependency Version (Minor)

# ena-submission/environment.yml:36
- orjson

Issue: The orjson dependency lacks a version pin, while other dependencies in the file are pinned (e.g., orjsonl, deepdiff).

Recommendation: Pin to a specific version for reproducibility:

- orjson=3.10.12  # or latest stable version

According to CLAUDE.md: "Conda dependencies in environment.yml files are not automatically updated by dependabot." This makes version pinning even more important.

2. Inconsistent JSON Library Usage (Code Quality)

# ena-submission/src/ena_deposition/call_loculus.py:110
data = json.dumps(external_metadata)

Issue: The file still imports and uses the standard json library for serialization in submit_external_metadata(), while using orjson for deserialization in fetch_released_entries().

Impact: Minor - serialization happens less frequently than parsing, so performance impact is limited. However, it's inconsistent.

Recommendation: For consistency, consider replacing json.dumps() with orjson.dumps() and decode the result:

data = orjson.dumps(external_metadata).decode('utf-8')

If json.dumps() is kept, the import json at line 1 should remain (which it currently does).

3. Missing Test Coverage (Testing)

Issue: No tests were added or modified to verify:

The orjson parsing works correctly with real NDJSON data
The progress logging functions as expected
Empty lines are handled correctly

Recommendation:

Add a test case in test_get_ena_submission_list.py that verifies the progress logging occurs
Consider adding a test that includes empty lines in the NDJSON response to verify they're skipped correctly
The existing test using fake_fetch_released_entries() may need verification that it still works with byte streams

4. Log Level Choice (Minor)

# ena-submission/scripts/get_ena_submission_list.py:92-93
if processed_entries % 1000 == 0:
    logger.debug(f"Successfully processed {processed_entries} entries.")

Issue: Using logger.debug() means this won't appear in production unless debug logging is enabled, which may reduce its usefulness for monitoring.

Recommendation: Consider using logger.info() instead, since this is operational information that's valuable for monitoring cronjob progress. The 1000-entry threshold already prevents spam.

🔒 Security

✅ No security concerns identified. The change is purely a performance optimization in parsing logic.

⚡ Performance

✅ Expected improvement: orjson is typically 2-3x faster than the standard library for parsing JSON, with lower memory usage. The PR description shows similar execution times (~7-8 seconds), but this is likely because there are "no sequences found to submit" - the real benefit will be visible when processing large datasets.

📋 Checklist Status

From the PR description:

Documentation - Not applicable for internal optimization
Tests - Missing (see Issue Set up base Astro project #3 above)
Manual testing - ✅ Documented in PR description

🎯 Recommendations Summary

Must Fix:

Add version pin for orjson dependency

Should Fix:

Add test coverage for the changes
Consider changing log level from debug to info for progress logging

Nice to Have:

Use orjson.dumps() for consistency throughout the file

Final Assessment

This is a solid performance optimization with minimal risk. The code changes are correct and well-implemented. With the addition of dependency version pinning and test coverage, this would be ready to merge.

Recommendation: Approve with minor changes requested.

ena-submission/scripts/get_ena_submission_list.py

ena-submission/src/ena_deposition/call_loculus.py

ena-submission/environment.yml

feat(deposition): optimize cronjob

4fa1fa9

anna-parker added the preview Triggers a deployment to argocd label Dec 19, 2025

anna-parker added 2 commits December 19, 2025 08:12

fixup

0490084

only log when modulo 1000 is 0

89113f5

anna-parker requested a review from corneliusroemer December 19, 2025 08:40

anna-parker marked this pull request as ready for review December 19, 2025 08:41

anna-parker changed the title ~~feat(deposition): optimize cronjob~~ feat(deposition): read data with orjson instead of json to increase cronjob performance and add better logging Dec 19, 2025

corneliusroemer approved these changes Dec 19, 2025

View reviewed changes

ena-submission/scripts/get_ena_submission_list.py Outdated Show resolved Hide resolved

ena-submission/src/ena_deposition/call_loculus.py Outdated Show resolved Hide resolved

ena-submission/environment.yml Show resolved Hide resolved

anna-parker added 3 commits December 19, 2025 11:21

make more concise

5844074

check if orjsonl works

ff9a496

more logs

714d856

anna-parker changed the title ~~feat(deposition): read data with orjson instead of json to increase cronjob performance and add better logging~~ feat(deposition): read data with orjsonl instead of json to increase cronjob performance and add better logging Dec 19, 2025

anna-parker mentioned this pull request Dec 19, 2025

DO NOT MERGE: ena cronjob testing pathoplexus/pathoplexus#788

Closed

doesnt work use orjson

3b4fc45

anna-parker changed the title ~~feat(deposition): read data with orjsonl instead of json to increase cronjob performance and add better logging~~ feat(deposition): read data with orjson instead of json to increase cronjob performance and add better logging Dec 19, 2025

anna-parker added 2 commits December 19, 2025 14:38

add chunk size

9b2fc93

use nicer number 64 KB = 2^16

5d6114b

anna-parker changed the title ~~feat(deposition): read data with orjson instead of json to increase cronjob performance and add better logging~~ feat(deposition): read data with orjson instead of json, add better logging, increase chunk size Dec 19, 2025

anna-parker merged commit ba854a7 into main Dec 19, 2025
38 checks passed

anna-parker deleted the use_orjson branch December 19, 2025 14:07

pathoplexus-bot mentioned this pull request Dec 22, 2025

Update Loculus version to 7732d8 pathoplexus/pathoplexus#789

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(deposition): read data with orjson instead of json, add better logging, increase chunk size #5765

feat(deposition): read data with orjson instead of json, add better logging, increase chunk size #5765

Uh oh!

anna-parker commented Dec 19, 2025 •

edited

Loading

Uh oh!

claude bot commented Dec 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat(deposition): read data with orjson instead of json, add better logging, increase chunk size #5765

feat(deposition): read data with orjson instead of json, add better logging, increase chunk size #5765

Uh oh!

Conversation

anna-parker commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Screenshot

Present state on staging

Changes: with chunk size 65536:

PR Checklist

Uh oh!

claude bot commented Dec 19, 2025

Code Review - PR #5765: feat(deposition): optimize cronjob

Summary

✅ Strengths

🔍 Issues & Suggestions

1. Missing Dependency Version (Minor)

2. Inconsistent JSON Library Usage (Code Quality)

3. Missing Test Coverage (Testing)

4. Log Level Choice (Minor)

🔒 Security

⚡ Performance

📋 Checklist Status

🎯 Recommendations Summary

Final Assessment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

anna-parker commented Dec 19, 2025 •

edited

Loading