fix(amplicon_covs): iterate amplicons in genomic order, not label order (#198)#199
Open
werner291 wants to merge 1 commit intoRIVM-bioinformatics:mainfrom
Open
fix(amplicon_covs): iterate amplicons in genomic order, not label order (#198)#199werner291 wants to merge 1 commit intoRIVM-bioinformatics:mainfrom
werner291 wants to merge 1 commit intoRIVM-bioinformatics:mainfrom
Conversation
_calculate_amplicon_start_end walked amplicons sorted by their numeric label (count) and used idx-1 / idx+1 to find the previous and next amplicon for boundary computation, implicitly assuming label order matches the order amplicons appear on the reference. On BED files where that does not hold, the neighbour lookup returned the wrong pair, the boundary positions crossed over, and the validator raised a ValueError. Iterate in genomic order (sorted by minimum primer start) so that neighbour lookups refer to the correct amplicons. The dataframe row order is set by label-keyed df.loc writes inside the loop, so it stays label-sorted regardless of iteration order, preserving alignment with _create_amplicon_names_list downstream. Adds two regression tests: a 2-amplicon BED with reversed labels (the exact RIVM-bioinformatics#198 scenario) and a 3-amplicon scrambled BED that exercises the middle-of-list neighbour lookups. Closes RIVM-bioinformatics#198.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Hello RIVM, I'm Werner. I'm trying to engage with the bioinformatics open source community and saw your open issue #198, so I figured I'd give it a shot.
Full disclosure: my background is in informatics, not biology. As I read the issue, though, it reduces to a data-wrangling bug, which felt approachable. Please correct me if I have misunderstood what the function is meant to do.
Closes #198.
The bug
_calculate_amplicon_start_endwalks amplicons sorted by their numeric label (count) and usesidx - 1/idx + 1to find the previous and next amplicon for boundary computation. This implicitly assumes label order matches the order amplicons appear on the reference. The BED file in #198 violates that (a pair labeled1sits downstream of one labeled2), the neighbour lookup returns the wrong pair, and the validator fires.The fix
Iterate in genomic order (sorted by minimum primer start) directly, so
idx - 1andidx + 1mean the right thing without further changes. The dataframe row order is set by the label-keyeddf.loc[...]writes inside the loop, so it stays label-sorted regardless of iteration order, and the downstream join with_create_amplicon_names_list(also label-sorted) keeps aligning.Behavioural changes
ValueError.min(start)enough can also reshuffle which amplicons count as neighbours. Both the previous and the new code produce incorrect output on mislabelled inputs; the failure modes just differ.Tests
test_calculate_amplicon_start_end_non_genomic_numbering: a 2-amplicon BED with labels reversed relative to genomic position. Onmainit raises with the exactValueErrorfrom the issue. With the fix it passes.test_calculate_amplicon_start_end_three_amplicons_scrambled: a 3-amplicon BED where the middle amplicon exercises both previous- and next-neighbour lookups in genomic order.test_amplicon_covs.pystill pass, including the normal-order case.Things I noticed in passing
Pre-existing characteristics of the function, neither introduced nor changed by this PR:
filter_bed_input.pyfiltering perRefIDupstream.AI assistance disclosure
Prepared with help from an LLM (Claude). I picked the issue, walked the code, ran the tests, and am accountable for the result.