Code review: correctness bugs, data inconsistencies, and test gaps

## Summary

A thorough code review of the repository uncovered several correctness bugs, security concerns, duplicated/dead code with divergent data, and significant test coverage gaps. Findings are grouped by severity below.

---

## Critical

### 1. Orbitrap Exploris 240 accession conflict

`ms_analyzer.py:104` maps Exploris 240 to `MS:1003094`, but `sdrf/ontology.py:19` maps it to `MS:1003096`. The PSI-MS ontology lists `MS:1003094` as "Orbitrap Exploris 240" — **`ontology.py` has the wrong accession**.

Any SDRF touched by ontology.py for Exploris 240 datasets would get the wrong CV term.

**Fix:** Change `ontology.py:19` to `"Orbitrap Exploris 240": "MS:1003094"`.

### 2. Path traversal allows arbitrary file write/delete

`downloader_base.py:66`, `pride/downloader.py:98`, `massive/downloader.py:145`, `utils/temp_manager.py:56`, `core/refiner.py:226`

SDRF `comment[data file]` values are used directly as filenames for download and cleanup. A value like `../../etc/important` could escape `download_dir` and then be deleted during cleanup.

**Fix:** Sanitize to basenames (or strict relative subpaths), reject absolute/parent traversal paths, and enforce `resolved_path.is_relative_to(download_dir.resolve())` before any write or delete.

### 3. `_detect_fragmentation_cv` uses `str(spectrum)` — unreliable

`ms_analyzer.py:696-705` converts the entire pyteomics spectrum dict to a string and checks for substring `"HCD"` or `"CID"`. This is fragile because:
- `"CID"` could appear in base64-encoded peak data or file paths.
- `"collision-induced dissociation"` matches **both** CID and HCD (HCD is "beam-type collision-induced dissociation").

This is the non-Thermo fallback path (Bruker, SCIEX, Waters data).

**Fix:** Parse the activation/dissociation CV terms from the pyteomics spectrum dict properly instead of string matching.

---

## High

### 4. Analyzer failures are swallowed and counted as successful

`ms_analyzer.py:267`, `core/refiner.py:202-203`

`MSAnalyzer.analyze()` catches all exceptions and returns a result with `error` in `raw_stats`, but `refiner.py` still appends it to `analysis_results` and increments `files_processed`. Corrupted or empty detection results silently influence refinements.

**Fix:** Return an explicit success flag from `analyze()` (or raise on failure), and only append results / increment counters for successful analyses.

### 5. Collision energy comparison treats NCE and eV as equivalent

`sdrf/comparer.py:551-563`

The comparer considers "30 NCE" and "30 eV" a match if the numeric part is the same. NCE (Normalized Collision Energy) and eV (electron-volts) are different physical quantities — conflating them hides real metadata mismatches.

**Fix:** Only auto-match when units also match. If units differ, mark as review/mismatch.

### 6. Dead `sdrf/ontology.py` module with divergent data

`ontology.py` defines `INSTRUMENT_TERMS`, `DISSOCIATION_TERMS`, `ACQUISITION_TERMS` and lookup functions (`get_instrument_accession`, `get_dissociation_accession`, `get_acquisition_accession`). **None of these functions are imported or called anywhere in the codebase.** Meanwhile, `config.py` and `ms_analyzer.py` define overlapping ontology data independently.

This duplication caused the Exploris 240 accession bug above.

**Fix:** Delete `sdrf/ontology.py` or consolidate all ontology data into `config.py`.

### 7. Dead `PRIDE_API_BASE` in `config.py` (v2 vs v3)

```python
# config.py — never imported
PRIDE_API_BASE = "https://www.ebi.ac.uk/pride/ws/archive/v2"

# pride/downloader.py — actually used
PRIDE_API_BASE = "https://www.ebi.ac.uk/pride/ws/archive/v3"
```

The `config.py` constant references the deprecated v2 API and is unused.

**Fix:** Remove `PRIDE_API_BASE` from `config.py`.

### 8. No tests for PTM detector or MS analyzer (~1,700 LOC)

`analyzer/ptm_detector.py` (712 LOC) and `analyzer/ms_analyzer.py` (1055 LOC) have **zero unit tests**. These are the most complex modules. The recent commit adding fixed/variable modification type classification has no accompanying tests.

### 9. `pyopenms` not declared in project dependencies

README lists pyopenms as required. `ptm_detector.py` and `ms_analyzer.py` import it. But it is absent from `pyproject.toml` `[project.dependencies]`. Users who `pip install techsdrf` get silent feature degradation (PTM detection skipped, header parsing fails).

**Fix:** Add `pyopenms` to dependencies (or as an optional extra with explicit feature-gating documentation).

---

## Medium

### 10. `mass_accuracy="??"` generates invalid stats keys

`ms_analyzer.py:635-684`

In `_parse_filter_string`, if the detector prefix is unrecognized, `mass_accuracy` stays as `"??"`. This flows into `_update_fragmentation_stats` which constructs keys like `n_ms2_??_HCD_spectra` — keys that were never initialized — causing a `KeyError`.

**Fix:** Guard against unknown accuracy before counter updates; skip or map to a fallback.

### 11. Duplicated `_param_to_column` mapping in two files

`sdrf/writer.py:200-216` and `report/generator.py:468-482` define the same parameter-to-column mapping independently. The writer version supports `scan_window_lower`, `scan_window_upper`, and `isolation_window_width`; the report version does not.

**Fix:** Extract a single shared mapping to `config.py` (alongside `SDRF_COLUMNS`) and import in both places.

### 12. Blind binary replacement in `CruxEstimator._prepare_mzml`

`tolerance_estimator.py:570`:
```python
outfile.write(line.replace(b"1003145", b"1000615"))
```

Replaces *every* occurrence of `1003145` in the mzML, including base64 spectrum data or offset fields if the byte sequence happens to appear.

**Fix:** Restrict replacement to lines containing `accession=` or `cvRef=`.

### 13. `timeout=None` on PRIDE HTTP download

`pride/downloader.py:64`:
```python
with session.get(http_url, stream=True, timeout=None) as r:
```

No timeout means the download hangs indefinitely if the server stops responding mid-transfer.

**Fix:** `timeout=(30, None)` — 30s connect timeout, unlimited read timeout.

### 14. PRIDE metadata fetched per-file instead of cached

`pride/downloader.py:41, 108`

`_fetch_file_metadata` calls the full file-list API for every single file download instead of caching the result. For multi-file datasets this is avoidable latency and API load.

**Fix:** Fetch metadata once per accession and cache the filename→metadata map.

### 15. mzML file loaded twice (pyopenms + pyteomics) + again for PTM detection

`ms_analyzer.py` loads each mzML via `oms.MzMLFile().load()` for header parsing, then again via `pyteomics.mzml.read()` for spectrum iteration. The PTM detector also fully loads the file a third time via pyopenms. For large mzML files this triples I/O and memory usage.

**Fix:** Use on-disk iterators (`OnDiscMSExperiment`/streaming) and share parsed data across analyzers.

### 16. Unpinned git dependency

`pyproject.toml:33`:
```
"runassessor @ git+https://github.com/RunAssessor/RunAssessor.git"
```

Pins to HEAD of main. A breaking change upstream silently breaks installs.

**Fix:** Pin to a specific tag or commit hash.

### 17. No CI/CD pipeline

No GitHub Actions, no pre-commit hooks. For a tool that modifies scientific metadata, automated linting and testing would prevent regressions.

**Fix:** Add a `.github/workflows/test.yml` with pytest, black --check, isort --check, and mypy.

### 18. Integration tests use wrong constructor kwarg

`test_integration_pxd010364.py` passes `pride_accession=...` but `SDRFRefiner.__init__` expects `accession=`. These tests would fail if actually executed.

**Fix:** Update integration tests to use `accession=`.

### 19. Download integrity not verified

PRIDE downloads disable checksum verification (`checksum_check=False`). HTTP downloads only check "file exists." For MassIVE, there's no integrity check beyond "file is non-empty." A corrupted download silently produces wrong analysis results.

**Fix:** Enable checksum validation and verify content-length/hash for HTTP downloads; fail and retry on mismatch.

---

## Low / Minor

### 20. `_read_spectra_basic` file handle not in `with` block

`ms_analyzer.py:598-616` opens a file outside the `try` block. Use a `with` statement.

### 21. CE unit guessing based on numeric range

`ms_analyzer.py:539`: `unit = "NCE" if 15 <= v <= 45 else "eV"` — 30 eV is valid for QTOF, 30 NCE is valid for Orbitrap. Should use instrument category for disambiguation.

### 22. Hardcoded HR/LR in fallback fragmentation detection

Both `_read_spectra_basic` and `_detect_fragmentation_cv` hardcode `HR_HCD` for HCD and `LR_IT_CID` for CID without knowing the actual mass analyzer resolution. Incorrect for instruments like timsTOF where CID is high-resolution.

### 23. Bug in `additional/opt-params/opt_params.py:264`

`fragment_max_mz` is assigned from `fragment_min_mz` — likely a copy-paste error that narrows the search range incorrectly.

**Fix:** Assign from the `fragment_max_mz` key.

---

## Test coverage summary

| Module | LOC | Tests |
|--------|-----|-------|
| `analyzer/ptm_detector.py` | 712 | **None** |
| `analyzer/ms_analyzer.py` | 1055 | **None** |
| `converter/converter.py` | 222 | **None** |
| `report/generator.py` | 682 | **None** |
| `pride/downloader.py` | 168 | Filename resolution only |
| `massive/downloader.py` | 199 | **None** |
| `cli.py` | 312 | **None** |
| `sdrf/reader.py` | 320 | Good |
| `sdrf/writer.py` | 287 | Good |
| `sdrf/comparer.py` | 733 | Good |
| `report/confidence.py` | 170 | Good |
| `tolerance_estimator.py` | 821 | Good (mocked) |

Module	LOC	Tests
`analyzer/ptm_detector.py`	712	None
`analyzer/ms_analyzer.py`	1055	None
`converter/converter.py`	222	None
`report/generator.py`	682	None
`pride/downloader.py`	168	Filename resolution only
`massive/downloader.py`	199	None
`cli.py`	312	None
`sdrf/reader.py`	320	Good
`sdrf/writer.py`	287	Good
`sdrf/comparer.py`	733	Good
`report/confidence.py`	170	Good
`tolerance_estimator.py`	821	Good (mocked)

Code review: correctness bugs, data inconsistencies, and test gaps #2

Description

Summary

Critical

1. Orbitrap Exploris 240 accession conflict

2. Path traversal allows arbitrary file write/delete

3. _detect_fragmentation_cv uses str(spectrum) — unreliable

High

4. Analyzer failures are swallowed and counted as successful

5. Collision energy comparison treats NCE and eV as equivalent

6. Dead sdrf/ontology.py module with divergent data

7. Dead PRIDE_API_BASE in config.py (v2 vs v3)

8. No tests for PTM detector or MS analyzer (~1,700 LOC)

9. pyopenms not declared in project dependencies

Medium

10. mass_accuracy="??" generates invalid stats keys

11. Duplicated _param_to_column mapping in two files

12. Blind binary replacement in CruxEstimator._prepare_mzml

13. timeout=None on PRIDE HTTP download

14. PRIDE metadata fetched per-file instead of cached

15. mzML file loaded twice (pyopenms + pyteomics) + again for PTM detection

16. Unpinned git dependency

17. No CI/CD pipeline

18. Integration tests use wrong constructor kwarg

19. Download integrity not verified

Low / Minor

20. _read_spectra_basic file handle not in with block

21. CE unit guessing based on numeric range

22. Hardcoded HR/LR in fallback fragmentation detection

23. Bug in additional/opt-params/opt_params.py:264

Test coverage summary

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

3. `_detect_fragmentation_cv` uses `str(spectrum)` — unreliable

6. Dead `sdrf/ontology.py` module with divergent data

7. Dead `PRIDE_API_BASE` in `config.py` (v2 vs v3)

9. `pyopenms` not declared in project dependencies

10. `mass_accuracy="??"` generates invalid stats keys

11. Duplicated `_param_to_column` mapping in two files

12. Blind binary replacement in `CruxEstimator._prepare_mzml`

13. `timeout=None` on PRIDE HTTP download

20. `_read_spectra_basic` file handle not in `with` block

23. Bug in `additional/opt-params/opt_params.py:264`