Skip to content

Code review: correctness bugs, data inconsistencies, and test gaps #2

@timosachsenberg

Description

@timosachsenberg

Summary

A thorough code review of the repository uncovered several correctness bugs, security concerns, duplicated/dead code with divergent data, and significant test coverage gaps. Findings are grouped by severity below.


Critical

1. Orbitrap Exploris 240 accession conflict

ms_analyzer.py:104 maps Exploris 240 to MS:1003094, but sdrf/ontology.py:19 maps it to MS:1003096. The PSI-MS ontology lists MS:1003094 as "Orbitrap Exploris 240" — ontology.py has the wrong accession.

Any SDRF touched by ontology.py for Exploris 240 datasets would get the wrong CV term.

Fix: Change ontology.py:19 to "Orbitrap Exploris 240": "MS:1003094".

2. Path traversal allows arbitrary file write/delete

downloader_base.py:66, pride/downloader.py:98, massive/downloader.py:145, utils/temp_manager.py:56, core/refiner.py:226

SDRF comment[data file] values are used directly as filenames for download and cleanup. A value like ../../etc/important could escape download_dir and then be deleted during cleanup.

Fix: Sanitize to basenames (or strict relative subpaths), reject absolute/parent traversal paths, and enforce resolved_path.is_relative_to(download_dir.resolve()) before any write or delete.

3. _detect_fragmentation_cv uses str(spectrum) — unreliable

ms_analyzer.py:696-705 converts the entire pyteomics spectrum dict to a string and checks for substring "HCD" or "CID". This is fragile because:

  • "CID" could appear in base64-encoded peak data or file paths.
  • "collision-induced dissociation" matches both CID and HCD (HCD is "beam-type collision-induced dissociation").

This is the non-Thermo fallback path (Bruker, SCIEX, Waters data).

Fix: Parse the activation/dissociation CV terms from the pyteomics spectrum dict properly instead of string matching.


High

4. Analyzer failures are swallowed and counted as successful

ms_analyzer.py:267, core/refiner.py:202-203

MSAnalyzer.analyze() catches all exceptions and returns a result with error in raw_stats, but refiner.py still appends it to analysis_results and increments files_processed. Corrupted or empty detection results silently influence refinements.

Fix: Return an explicit success flag from analyze() (or raise on failure), and only append results / increment counters for successful analyses.

5. Collision energy comparison treats NCE and eV as equivalent

sdrf/comparer.py:551-563

The comparer considers "30 NCE" and "30 eV" a match if the numeric part is the same. NCE (Normalized Collision Energy) and eV (electron-volts) are different physical quantities — conflating them hides real metadata mismatches.

Fix: Only auto-match when units also match. If units differ, mark as review/mismatch.

6. Dead sdrf/ontology.py module with divergent data

ontology.py defines INSTRUMENT_TERMS, DISSOCIATION_TERMS, ACQUISITION_TERMS and lookup functions (get_instrument_accession, get_dissociation_accession, get_acquisition_accession). None of these functions are imported or called anywhere in the codebase. Meanwhile, config.py and ms_analyzer.py define overlapping ontology data independently.

This duplication caused the Exploris 240 accession bug above.

Fix: Delete sdrf/ontology.py or consolidate all ontology data into config.py.

7. Dead PRIDE_API_BASE in config.py (v2 vs v3)

# config.py — never imported
PRIDE_API_BASE = "https://www.ebi.ac.uk/pride/ws/archive/v2"

# pride/downloader.py — actually used
PRIDE_API_BASE = "https://www.ebi.ac.uk/pride/ws/archive/v3"

The config.py constant references the deprecated v2 API and is unused.

Fix: Remove PRIDE_API_BASE from config.py.

8. No tests for PTM detector or MS analyzer (~1,700 LOC)

analyzer/ptm_detector.py (712 LOC) and analyzer/ms_analyzer.py (1055 LOC) have zero unit tests. These are the most complex modules. The recent commit adding fixed/variable modification type classification has no accompanying tests.

9. pyopenms not declared in project dependencies

README lists pyopenms as required. ptm_detector.py and ms_analyzer.py import it. But it is absent from pyproject.toml [project.dependencies]. Users who pip install techsdrf get silent feature degradation (PTM detection skipped, header parsing fails).

Fix: Add pyopenms to dependencies (or as an optional extra with explicit feature-gating documentation).


Medium

10. mass_accuracy="??" generates invalid stats keys

ms_analyzer.py:635-684

In _parse_filter_string, if the detector prefix is unrecognized, mass_accuracy stays as "??". This flows into _update_fragmentation_stats which constructs keys like n_ms2_??_HCD_spectra — keys that were never initialized — causing a KeyError.

Fix: Guard against unknown accuracy before counter updates; skip or map to a fallback.

11. Duplicated _param_to_column mapping in two files

sdrf/writer.py:200-216 and report/generator.py:468-482 define the same parameter-to-column mapping independently. The writer version supports scan_window_lower, scan_window_upper, and isolation_window_width; the report version does not.

Fix: Extract a single shared mapping to config.py (alongside SDRF_COLUMNS) and import in both places.

12. Blind binary replacement in CruxEstimator._prepare_mzml

tolerance_estimator.py:570:

outfile.write(line.replace(b"1003145", b"1000615"))

Replaces every occurrence of 1003145 in the mzML, including base64 spectrum data or offset fields if the byte sequence happens to appear.

Fix: Restrict replacement to lines containing accession= or cvRef=.

13. timeout=None on PRIDE HTTP download

pride/downloader.py:64:

with session.get(http_url, stream=True, timeout=None) as r:

No timeout means the download hangs indefinitely if the server stops responding mid-transfer.

Fix: timeout=(30, None) — 30s connect timeout, unlimited read timeout.

14. PRIDE metadata fetched per-file instead of cached

pride/downloader.py:41, 108

_fetch_file_metadata calls the full file-list API for every single file download instead of caching the result. For multi-file datasets this is avoidable latency and API load.

Fix: Fetch metadata once per accession and cache the filename→metadata map.

15. mzML file loaded twice (pyopenms + pyteomics) + again for PTM detection

ms_analyzer.py loads each mzML via oms.MzMLFile().load() for header parsing, then again via pyteomics.mzml.read() for spectrum iteration. The PTM detector also fully loads the file a third time via pyopenms. For large mzML files this triples I/O and memory usage.

Fix: Use on-disk iterators (OnDiscMSExperiment/streaming) and share parsed data across analyzers.

16. Unpinned git dependency

pyproject.toml:33:

"runassessor @ git+https://github.com/RunAssessor/RunAssessor.git"

Pins to HEAD of main. A breaking change upstream silently breaks installs.

Fix: Pin to a specific tag or commit hash.

17. No CI/CD pipeline

No GitHub Actions, no pre-commit hooks. For a tool that modifies scientific metadata, automated linting and testing would prevent regressions.

Fix: Add a .github/workflows/test.yml with pytest, black --check, isort --check, and mypy.

18. Integration tests use wrong constructor kwarg

test_integration_pxd010364.py passes pride_accession=... but SDRFRefiner.__init__ expects accession=. These tests would fail if actually executed.

Fix: Update integration tests to use accession=.

19. Download integrity not verified

PRIDE downloads disable checksum verification (checksum_check=False). HTTP downloads only check "file exists." For MassIVE, there's no integrity check beyond "file is non-empty." A corrupted download silently produces wrong analysis results.

Fix: Enable checksum validation and verify content-length/hash for HTTP downloads; fail and retry on mismatch.


Low / Minor

20. _read_spectra_basic file handle not in with block

ms_analyzer.py:598-616 opens a file outside the try block. Use a with statement.

21. CE unit guessing based on numeric range

ms_analyzer.py:539: unit = "NCE" if 15 <= v <= 45 else "eV" — 30 eV is valid for QTOF, 30 NCE is valid for Orbitrap. Should use instrument category for disambiguation.

22. Hardcoded HR/LR in fallback fragmentation detection

Both _read_spectra_basic and _detect_fragmentation_cv hardcode HR_HCD for HCD and LR_IT_CID for CID without knowing the actual mass analyzer resolution. Incorrect for instruments like timsTOF where CID is high-resolution.

23. Bug in additional/opt-params/opt_params.py:264

fragment_max_mz is assigned from fragment_min_mz — likely a copy-paste error that narrows the search range incorrectly.

Fix: Assign from the fragment_max_mz key.


Test coverage summary

Module LOC Tests
analyzer/ptm_detector.py 712 None
analyzer/ms_analyzer.py 1055 None
converter/converter.py 222 None
report/generator.py 682 None
pride/downloader.py 168 Filename resolution only
massive/downloader.py 199 None
cli.py 312 None
sdrf/reader.py 320 Good
sdrf/writer.py 287 Good
sdrf/comparer.py 733 Good
report/confidence.py 170 Good
tolerance_estimator.py 821 Good (mocked)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions