fix(amplicon_covs): iterate amplicons in genomic order, not label order (#198) by werner291 · Pull Request #199 · RIVM-bioinformatics/ViroConstrictor

werner291 · 2026-05-02T15:25:59Z

Hello RIVM, I'm Werner. I'm trying to engage with the bioinformatics open source community and saw your open issue #198, so I figured I'd give it a shot.

Full disclosure: my background is in informatics, not biology. As I read the issue, though, it reduces to a data-wrangling bug, which felt approachable. Please correct me if I have misunderstood what the function is meant to do.

Closes #198.

The bug

_calculate_amplicon_start_end walks amplicons sorted by their numeric label (count) and uses idx - 1 / idx + 1 to find the previous and next amplicon for boundary computation. This implicitly assumes label order matches the order amplicons appear on the reference. The BED file in #198 violates that (a pair labeled 1 sits downstream of one labeled 2), the neighbour lookup returns the wrong pair, and the validator fires.

The fix

Iterate in genomic order (sorted by minimum primer start) directly, so idx - 1 and idx + 1 mean the right thing without further changes. The dataframe row order is set by the label-keyed df.loc[...] writes inside the loop, so it stays label-sorted regardless of iteration order, and the downstream join with _create_amplicon_names_list (also label-sorted) keeps aligning.

Behavioural changes

On BED files where label order matches genomic order: no change. Iteration order, computed boundaries, and output dataframe are identical.
On BED files where label order does not match genomic order (the Differently sorted primers in .bed file crashes amplicon cov calculation #198 case): the function now produces correct boundaries instead of raising ValueError.
On BED files with a mislabelled primer: iteration order is now derived from primer positions rather than from labels, so a mislabelled primer that shifts an amplicon's min(start) enough can also reshuffle which amplicons count as neighbours. Both the previous and the new code produce incorrect output on mislabelled inputs; the failure modes just differ.

Tests

Added test_calculate_amplicon_start_end_non_genomic_numbering: a 2-amplicon BED with labels reversed relative to genomic position. On main it raises with the exact ValueError from the issue. With the fix it passes.
Added test_calculate_amplicon_start_end_three_amplicons_scrambled: a 3-amplicon BED where the middle amplicon exercises both previous- and next-neighbour lookups in genomic order.
The existing 11 tests in test_amplicon_covs.py still pass, including the normal-order case.
I was not able to run the e2e suite locally (data download issues setting up the environment). Validation against the full pipeline relies on your CI.

Things I noticed in passing

Pre-existing characteristics of the function, neither introduced nor changed by this PR:

The function ignores the BED's chromosome column; correctness for multi-contig inputs relies on filter_bed_input.py filtering per RefID upstream.
Ties on minimum primer start fall through to pandas' stable sort.

AI assistance disclosure

Prepared with help from an LLM (Claude). I picked the issue, walked the code, ran the tests, and am accountable for the result.

_calculate_amplicon_start_end walked amplicons sorted by their numeric label (count) and used idx-1 / idx+1 to find the previous and next amplicon for boundary computation, implicitly assuming label order matches the order amplicons appear on the reference. On BED files where that does not hold, the neighbour lookup returned the wrong pair, the boundary positions crossed over, and the validator raised a ValueError. Iterate in genomic order (sorted by minimum primer start) so that neighbour lookups refer to the correct amplicons. The dataframe row order is set by label-keyed df.loc writes inside the loop, so it stays label-sorted regardless of iteration order, preserving alignment with _create_amplicon_names_list downstream. Adds two regression tests: a 2-amplicon BED with reversed labels (the exact RIVM-bioinformatics#198 scenario) and a 3-amplicon scrambled BED that exercises the middle-of-list neighbour lookups. Closes RIVM-bioinformatics#198.

sonarqubecloud · 2026-05-02T15:26:39Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(amplicon_covs): iterate amplicons in genomic order, not label order (#198)#199

fix(amplicon_covs): iterate amplicons in genomic order, not label order (#198)#199
werner291 wants to merge 1 commit intoRIVM-bioinformatics:mainfrom
werner291:fix/198-genomic-amplicon-order

werner291 commented May 2, 2026

Uh oh!

sonarqubecloud Bot commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

werner291 commented May 2, 2026

The bug

The fix

Behavioural changes

Tests

Things I noticed in passing

AI assistance disclosure

Uh oh!

sonarqubecloud Bot commented May 2, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant