fix: Compiler replicate contamination fix#279
Open
saarantras wants to merge 11 commits into
Open
Conversation
+example of fix
for more information, see https://pre-commit.ci
Contributor
There was a problem hiding this comment.
Pull request overview
This PR targets a correctness issue in the MPRAnalyze compiler script (workflow/scripts/count/mpranalyze_compiler.py) where count vectors are flattened in the wrong order, causing DNA/RNA counts to be assigned to the wrong biological replicates. It also adds a minimal repro harness under test_compiler/ to exercise the compiler on a small input.
Changes:
- Fix count vector construction to use column-major flattening (
order='F') when building per-oligo count tables. - Add a small runnable test harness (
test_compiler/test.py) plus a minimal input TSV to reproduce/validate the expected ordering.
Reviewed changes
Copilot reviewed 3 out of 11 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
workflow/scripts/count/mpranalyze_compiler.py |
Changes how per-label count matrices are flattened into output vectors to avoid replicate scrambling. |
test_compiler/test.py |
Adds a small script to run the compiler on a minimal example (currently needs assertions to be a real regression check). |
test_compiler/minimal_test_input.tsv |
Provides a minimal input dataset designed to detect replicate/barcode mis-ordering. |
Comment on lines
81
to
85
| def generateCountOutput(data,columns): | ||
| counts = pd.DataFrame(list(data.groupby('label').apply(lambda x: x.values.flatten()))).fillna(0).astype(np.int64) | ||
| counts = pd.DataFrame(list(data.groupby('label').apply(lambda x: x.values.flatten(order='F')))).fillna(0).astype(np.int64) | ||
| counts.columns = columns | ||
| counts['seq_id'] = data.index.unique() | ||
| counts = counts[(['seq_id'] + list(columns))] |
This improves the tests substantively, as described. Not actually required for the fix, but a nice addition. Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
for more information, see https://pre-commit.ci
test: using general pytests for python scripts
for more information, see https://pre-commit.ci
…x into compiler_fix
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR fixes a small error in
mpranalyze_compiler.pywhich causes counts to get swapped between biological replicates, violating the statistical independence of replicates and negatively impacting power, with the defect being worse for higher inter-replicated varfiability.This is similar to this PR for MPRflow, in that it is also a change to the MPRAnyalize compiler script that fixes an issue which causes mixing of data between replicates, and correcting this bug will also improve power. It is different, in that the compiler is snakeflow is quite different in flow, and the actual underlying logic error is different. (I have formatted both PRs & the tests in the same way for clarity). Unlike the MPRAflow bug (which required a trailing-NA condition to manifest), this misalignment affects every oligo regardless of input. This bug has the potential to change the conclusions of studies which have used this code.
At a technical level, this bug is caused by line 82:
This defaults to row-major order, when it should be column-major. I modify to:
Example of behavior before fix
Here is an example input:
We've made the first digit the same number as the replicate, and two digit numbers correspond to DNA where 3 digit numbers correspond to RNA: so it's easy to see where the numbers are going.
rna_annot.tsv.gz:dna_annot.tsv.gz:As you can see, the first number is the replicate.
rna_counts.tsv.gz:As you can see, the replicates are scrambled.
RNA_X_1_2is replicate 1, but has has 200 and 202, which belong with replicate 2.dna_counts.tsv.gz:It's a similar problem for DNA :
DNA_X_2_2has 11, etc.Example of behavior after fix
rna_counts.tsv.gz:rna_counts.tsv.gz:Now the replicates line up properly.