-
Notifications
You must be signed in to change notification settings - Fork 9
feat(deposition): read data with orjson instead of json, add better logging, increase chunk size #5765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Code Review - PR #5765: feat(deposition): optimize cronjobSummaryThis PR optimizes the ENA submission cronjob by replacing the standard ✅ Strengths
🔍 Issues & Suggestions1. Missing Dependency Version (Minor)# ena-submission/environment.yml:36
- orjsonIssue: The Recommendation: Pin to a specific version for reproducibility: - orjson=3.10.12 # or latest stable versionAccording to 2. Inconsistent JSON Library Usage (Code Quality)# ena-submission/src/ena_deposition/call_loculus.py:110
data = json.dumps(external_metadata)Issue: The file still imports and uses the standard Impact: Minor - serialization happens less frequently than parsing, so performance impact is limited. However, it's inconsistent. Recommendation: For consistency, consider replacing data = orjson.dumps(external_metadata).decode('utf-8')If 3. Missing Test Coverage (Testing)Issue: No tests were added or modified to verify:
Recommendation:
4. Log Level Choice (Minor)# ena-submission/scripts/get_ena_submission_list.py:92-93
if processed_entries % 1000 == 0:
logger.debug(f"Successfully processed {processed_entries} entries.")Issue: Using Recommendation: Consider using 🔒 Security✅ No security concerns identified. The change is purely a performance optimization in parsing logic. ⚡ Performance✅ Expected improvement: 📋 Checklist StatusFrom the PR description:
🎯 Recommendations SummaryMust Fix:
Should Fix:
Nice to Have:
Final AssessmentThis is a solid performance optimization with minimal risk. The code changes are correct and well-implemented. With the addition of dependency version pinning and test coverage, this would be ready to merge. Recommendation: Approve with minor changes requested. |
partially resolves #5755
Screenshot
Previous time for processing mpox was ~6min per 1k sequences, now with increased chunk size it goes down to ~30sec per 1k sequences.
Present state on staging
Changes: with chunk size 65536:
PR Checklist
🚀 Preview: https://use-orjson.loculus.org