-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Summary
A thorough code review of the repository uncovered several correctness bugs, security concerns, duplicated/dead code with divergent data, and significant test coverage gaps. Findings are grouped by severity below.
Critical
1. Orbitrap Exploris 240 accession conflict
ms_analyzer.py:104 maps Exploris 240 to MS:1003094, but sdrf/ontology.py:19 maps it to MS:1003096. The PSI-MS ontology lists MS:1003094 as "Orbitrap Exploris 240" — ontology.py has the wrong accession.
Any SDRF touched by ontology.py for Exploris 240 datasets would get the wrong CV term.
Fix: Change ontology.py:19 to "Orbitrap Exploris 240": "MS:1003094".
2. Path traversal allows arbitrary file write/delete
downloader_base.py:66, pride/downloader.py:98, massive/downloader.py:145, utils/temp_manager.py:56, core/refiner.py:226
SDRF comment[data file] values are used directly as filenames for download and cleanup. A value like ../../etc/important could escape download_dir and then be deleted during cleanup.
Fix: Sanitize to basenames (or strict relative subpaths), reject absolute/parent traversal paths, and enforce resolved_path.is_relative_to(download_dir.resolve()) before any write or delete.
3. _detect_fragmentation_cv uses str(spectrum) — unreliable
ms_analyzer.py:696-705 converts the entire pyteomics spectrum dict to a string and checks for substring "HCD" or "CID". This is fragile because:
"CID"could appear in base64-encoded peak data or file paths."collision-induced dissociation"matches both CID and HCD (HCD is "beam-type collision-induced dissociation").
This is the non-Thermo fallback path (Bruker, SCIEX, Waters data).
Fix: Parse the activation/dissociation CV terms from the pyteomics spectrum dict properly instead of string matching.
High
4. Analyzer failures are swallowed and counted as successful
ms_analyzer.py:267, core/refiner.py:202-203
MSAnalyzer.analyze() catches all exceptions and returns a result with error in raw_stats, but refiner.py still appends it to analysis_results and increments files_processed. Corrupted or empty detection results silently influence refinements.
Fix: Return an explicit success flag from analyze() (or raise on failure), and only append results / increment counters for successful analyses.
5. Collision energy comparison treats NCE and eV as equivalent
sdrf/comparer.py:551-563
The comparer considers "30 NCE" and "30 eV" a match if the numeric part is the same. NCE (Normalized Collision Energy) and eV (electron-volts) are different physical quantities — conflating them hides real metadata mismatches.
Fix: Only auto-match when units also match. If units differ, mark as review/mismatch.
6. Dead sdrf/ontology.py module with divergent data
ontology.py defines INSTRUMENT_TERMS, DISSOCIATION_TERMS, ACQUISITION_TERMS and lookup functions (get_instrument_accession, get_dissociation_accession, get_acquisition_accession). None of these functions are imported or called anywhere in the codebase. Meanwhile, config.py and ms_analyzer.py define overlapping ontology data independently.
This duplication caused the Exploris 240 accession bug above.
Fix: Delete sdrf/ontology.py or consolidate all ontology data into config.py.
7. Dead PRIDE_API_BASE in config.py (v2 vs v3)
# config.py — never imported
PRIDE_API_BASE = "https://www.ebi.ac.uk/pride/ws/archive/v2"
# pride/downloader.py — actually used
PRIDE_API_BASE = "https://www.ebi.ac.uk/pride/ws/archive/v3"The config.py constant references the deprecated v2 API and is unused.
Fix: Remove PRIDE_API_BASE from config.py.
8. No tests for PTM detector or MS analyzer (~1,700 LOC)
analyzer/ptm_detector.py (712 LOC) and analyzer/ms_analyzer.py (1055 LOC) have zero unit tests. These are the most complex modules. The recent commit adding fixed/variable modification type classification has no accompanying tests.
9. pyopenms not declared in project dependencies
README lists pyopenms as required. ptm_detector.py and ms_analyzer.py import it. But it is absent from pyproject.toml [project.dependencies]. Users who pip install techsdrf get silent feature degradation (PTM detection skipped, header parsing fails).
Fix: Add pyopenms to dependencies (or as an optional extra with explicit feature-gating documentation).
Medium
10. mass_accuracy="??" generates invalid stats keys
ms_analyzer.py:635-684
In _parse_filter_string, if the detector prefix is unrecognized, mass_accuracy stays as "??". This flows into _update_fragmentation_stats which constructs keys like n_ms2_??_HCD_spectra — keys that were never initialized — causing a KeyError.
Fix: Guard against unknown accuracy before counter updates; skip or map to a fallback.
11. Duplicated _param_to_column mapping in two files
sdrf/writer.py:200-216 and report/generator.py:468-482 define the same parameter-to-column mapping independently. The writer version supports scan_window_lower, scan_window_upper, and isolation_window_width; the report version does not.
Fix: Extract a single shared mapping to config.py (alongside SDRF_COLUMNS) and import in both places.
12. Blind binary replacement in CruxEstimator._prepare_mzml
tolerance_estimator.py:570:
outfile.write(line.replace(b"1003145", b"1000615"))Replaces every occurrence of 1003145 in the mzML, including base64 spectrum data or offset fields if the byte sequence happens to appear.
Fix: Restrict replacement to lines containing accession= or cvRef=.
13. timeout=None on PRIDE HTTP download
pride/downloader.py:64:
with session.get(http_url, stream=True, timeout=None) as r:No timeout means the download hangs indefinitely if the server stops responding mid-transfer.
Fix: timeout=(30, None) — 30s connect timeout, unlimited read timeout.
14. PRIDE metadata fetched per-file instead of cached
pride/downloader.py:41, 108
_fetch_file_metadata calls the full file-list API for every single file download instead of caching the result. For multi-file datasets this is avoidable latency and API load.
Fix: Fetch metadata once per accession and cache the filename→metadata map.
15. mzML file loaded twice (pyopenms + pyteomics) + again for PTM detection
ms_analyzer.py loads each mzML via oms.MzMLFile().load() for header parsing, then again via pyteomics.mzml.read() for spectrum iteration. The PTM detector also fully loads the file a third time via pyopenms. For large mzML files this triples I/O and memory usage.
Fix: Use on-disk iterators (OnDiscMSExperiment/streaming) and share parsed data across analyzers.
16. Unpinned git dependency
pyproject.toml:33:
"runassessor @ git+https://github.com/RunAssessor/RunAssessor.git"
Pins to HEAD of main. A breaking change upstream silently breaks installs.
Fix: Pin to a specific tag or commit hash.
17. No CI/CD pipeline
No GitHub Actions, no pre-commit hooks. For a tool that modifies scientific metadata, automated linting and testing would prevent regressions.
Fix: Add a .github/workflows/test.yml with pytest, black --check, isort --check, and mypy.
18. Integration tests use wrong constructor kwarg
test_integration_pxd010364.py passes pride_accession=... but SDRFRefiner.__init__ expects accession=. These tests would fail if actually executed.
Fix: Update integration tests to use accession=.
19. Download integrity not verified
PRIDE downloads disable checksum verification (checksum_check=False). HTTP downloads only check "file exists." For MassIVE, there's no integrity check beyond "file is non-empty." A corrupted download silently produces wrong analysis results.
Fix: Enable checksum validation and verify content-length/hash for HTTP downloads; fail and retry on mismatch.
Low / Minor
20. _read_spectra_basic file handle not in with block
ms_analyzer.py:598-616 opens a file outside the try block. Use a with statement.
21. CE unit guessing based on numeric range
ms_analyzer.py:539: unit = "NCE" if 15 <= v <= 45 else "eV" — 30 eV is valid for QTOF, 30 NCE is valid for Orbitrap. Should use instrument category for disambiguation.
22. Hardcoded HR/LR in fallback fragmentation detection
Both _read_spectra_basic and _detect_fragmentation_cv hardcode HR_HCD for HCD and LR_IT_CID for CID without knowing the actual mass analyzer resolution. Incorrect for instruments like timsTOF where CID is high-resolution.
23. Bug in additional/opt-params/opt_params.py:264
fragment_max_mz is assigned from fragment_min_mz — likely a copy-paste error that narrows the search range incorrectly.
Fix: Assign from the fragment_max_mz key.
Test coverage summary
| Module | LOC | Tests |
|---|---|---|
analyzer/ptm_detector.py |
712 | None |
analyzer/ms_analyzer.py |
1055 | None |
converter/converter.py |
222 | None |
report/generator.py |
682 | None |
pride/downloader.py |
168 | Filename resolution only |
massive/downloader.py |
199 | None |
cli.py |
312 | None |
sdrf/reader.py |
320 | Good |
sdrf/writer.py |
287 | Good |
sdrf/comparer.py |
733 | Good |
report/confidence.py |
170 | Good |
tolerance_estimator.py |
821 | Good (mocked) |