Skip to content

fix(amplicon_covs): iterate amplicons in genomic order, not label order (#198)#199

Open
werner291 wants to merge 1 commit intoRIVM-bioinformatics:mainfrom
werner291:fix/198-genomic-amplicon-order
Open

fix(amplicon_covs): iterate amplicons in genomic order, not label order (#198)#199
werner291 wants to merge 1 commit intoRIVM-bioinformatics:mainfrom
werner291:fix/198-genomic-amplicon-order

Conversation

@werner291
Copy link
Copy Markdown

Hello RIVM, I'm Werner. I'm trying to engage with the bioinformatics open source community and saw your open issue #198, so I figured I'd give it a shot.

Full disclosure: my background is in informatics, not biology. As I read the issue, though, it reduces to a data-wrangling bug, which felt approachable. Please correct me if I have misunderstood what the function is meant to do.

Closes #198.

The bug

_calculate_amplicon_start_end walks amplicons sorted by their numeric label (count) and uses idx - 1 / idx + 1 to find the previous and next amplicon for boundary computation. This implicitly assumes label order matches the order amplicons appear on the reference. The BED file in #198 violates that (a pair labeled 1 sits downstream of one labeled 2), the neighbour lookup returns the wrong pair, and the validator fires.

The fix

Iterate in genomic order (sorted by minimum primer start) directly, so idx - 1 and idx + 1 mean the right thing without further changes. The dataframe row order is set by the label-keyed df.loc[...] writes inside the loop, so it stays label-sorted regardless of iteration order, and the downstream join with _create_amplicon_names_list (also label-sorted) keeps aligning.

Behavioural changes

  • On BED files where label order matches genomic order: no change. Iteration order, computed boundaries, and output dataframe are identical.
  • On BED files where label order does not match genomic order (the Differently sorted primers in .bed file crashes amplicon cov calculation #198 case): the function now produces correct boundaries instead of raising ValueError.
  • On BED files with a mislabelled primer: iteration order is now derived from primer positions rather than from labels, so a mislabelled primer that shifts an amplicon's min(start) enough can also reshuffle which amplicons count as neighbours. Both the previous and the new code produce incorrect output on mislabelled inputs; the failure modes just differ.

Tests

  • Added test_calculate_amplicon_start_end_non_genomic_numbering: a 2-amplicon BED with labels reversed relative to genomic position. On main it raises with the exact ValueError from the issue. With the fix it passes.
  • Added test_calculate_amplicon_start_end_three_amplicons_scrambled: a 3-amplicon BED where the middle amplicon exercises both previous- and next-neighbour lookups in genomic order.
  • The existing 11 tests in test_amplicon_covs.py still pass, including the normal-order case.
  • I was not able to run the e2e suite locally (data download issues setting up the environment). Validation against the full pipeline relies on your CI.

Things I noticed in passing

Pre-existing characteristics of the function, neither introduced nor changed by this PR:

  • The function ignores the BED's chromosome column; correctness for multi-contig inputs relies on filter_bed_input.py filtering per RefID upstream.
  • Ties on minimum primer start fall through to pandas' stable sort.

AI assistance disclosure

Prepared with help from an LLM (Claude). I picked the issue, walked the code, ran the tests, and am accountable for the result.

_calculate_amplicon_start_end walked amplicons sorted by their numeric
label (count) and used idx-1 / idx+1 to find the previous and next
amplicon for boundary computation, implicitly assuming label order
matches the order amplicons appear on the reference. On BED files where
that does not hold, the neighbour lookup returned the wrong pair, the
boundary positions crossed over, and the validator raised a ValueError.

Iterate in genomic order (sorted by minimum primer start) so that
neighbour lookups refer to the correct amplicons. The dataframe row
order is set by label-keyed df.loc writes inside the loop, so it stays
label-sorted regardless of iteration order, preserving alignment with
_create_amplicon_names_list downstream.

Adds two regression tests: a 2-amplicon BED with reversed labels (the
exact RIVM-bioinformatics#198 scenario) and a 3-amplicon scrambled BED that exercises the
middle-of-list neighbour lookups.

Closes RIVM-bioinformatics#198.
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 2, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Differently sorted primers in .bed file crashes amplicon cov calculation

1 participant