fix: Compiler replicate contamination fix by saarantras · Pull Request #279 · kircherlab/MPRAsnakeflow

saarantras · 2026-06-03T20:59:41Z

This PR fixes a small error in mpranalyze_compiler.py which causes counts to get swapped between biological replicates, violating the statistical independence of replicates and negatively impacting power, with the defect being worse for higher inter-replicated varfiability.

This is similar to this PR for MPRflow, in that it is also a change to the MPRAnyalize compiler script that fixes an issue which causes mixing of data between replicates, and correcting this bug will also improve power. It is different, in that the compiler is snakeflow is quite different in flow, and the actual underlying logic error is different. (I have formatted both PRs & the tests in the same way for clarity). Unlike the MPRAflow bug (which required a trailing-NA condition to manifest), this misalignment affects every oligo regardless of input. This bug has the potential to change the conclusions of studies which have used this code.

At a technical level, this bug is caused by line 82:

counts = pd.DataFrame(list(data.groupby('label').apply(lambda x: x.values.flatten()))).fillna(0).astype(np.int64)

This defaults to row-major order, when it should be column-major. I modify to:

counts = pd.DataFrame(list(data.groupby('label').apply(lambda x: x.values.flatten(order='F')))).fillna(0).astype(np.int64)

Example of behavior before fix

Here is an example input:

label	Sequence	Barcode	DNA(condition X, replicate 1)	DNA(condition X, replicate 2)	DNA(condition X, replicate 3)	RNA(condition X, replicate 1)	RNA(condition X, replicate 2)	RNA(condition X, replicate 3)
oligoA	A	BC0	10	20	30	100	200	300
oligoA	A	BC1	11	21	31	101	201	301
oligoB	B	BC2	12	22	32	102	202	302
oligoB	B	BC3	13	23	33	103	203	303

We've made the first digit the same number as the replicate, and two digit numbers correspond to DNA where 3 digit numbers correspond to RNA: so it's easy to see where the numbers are going.

rna_annot.tsv.gz:

sample	type	condition	replicate	barcode
RNA_X_1_1	RNA	X	1	1
RNA_X_1_2	RNA	X	1	2
RNA_X_2_1	RNA	X	2	1
RNA_X_2_2	RNA	X	2	2
RNA_X_3_1	RNA	X	3	1
RNA_X_3_2	RNA	X	3	2

dna_annot.tsv.gz:

sample	type	condition	replicate	barcode
DNA_X_1_1	DNA	X	1	1
DNA_X_1_2	DNA	X	1	2
DNA_X_2_1	DNA	X	2	1
DNA_X_2_2	DNA	X	2	2
DNA_X_3_1	DNA	X	3	1
DNA_X_3_2	DNA	X	3	2

As you can see, the first number is the replicate.

rna_counts.tsv.gz:

seq_id	RNA_X_1_1	RNA_X_1_2	RNA_X_2_1	RNA_X_2_2	RNA_X_3_1	RNA_X_3_2
oligoA	100	200	300	101	201	301
oligoB	102	202	302	103	203	303

As you can see, the replicates are scrambled. RNA_X_1_2 is replicate 1, but has has 200 and 202, which belong with replicate 2.

dna_counts.tsv.gz:

seq_id	DNA_X_1_1	DNA_X_1_2	DNA_X_2_1	DNA_X_2_2	DNA_X_3_1	DNA_X_3_2
oligoA	10	20	30	11	21	31
oligoB	12	22	32	13	23	33

It's a similar problem for DNA : DNA_X_2_2 has 11, etc.

Example of behavior after fix

rna_counts.tsv.gz:

seq_id	RNA_X_1_1	RNA_X_1_2	RNA_X_2_1	RNA_X_2_2	RNA_X_3_1	RNA_X_3_2
oligoA	100	101	200	201	300	301
oligoB	102	103	202	203	302	303

rna_counts.tsv.gz:

seq_id	DNA_X_1_1	DNA_X_1_2	DNA_X_2_1	DNA_X_2_2	DNA_X_3_1	DNA_X_3_2
oligoA	10	11	20	21	30	31
oligoB	12	13	22	23	32	33

Now the replicates line up properly.

+example of fix

for more information, see https://pre-commit.ci

Copilot

Pull request overview

This PR targets a correctness issue in the MPRAnalyze compiler script (workflow/scripts/count/mpranalyze_compiler.py) where count vectors are flattened in the wrong order, causing DNA/RNA counts to be assigned to the wrong biological replicates. It also adds a minimal repro harness under test_compiler/ to exercise the compiler on a small input.

Changes:

Fix count vector construction to use column-major flattening (order='F') when building per-oligo count tables.
Add a small runnable test harness (test_compiler/test.py) plus a minimal input TSV to reproduce/validate the expected ordering.

Reviewed changes

Copilot reviewed 3 out of 11 changed files in this pull request and generated 2 comments.

File	Description
`workflow/scripts/count/mpranalyze_compiler.py`	Changes how per-label count matrices are flattened into output vectors to avoid replicate scrambling.
`test_compiler/test.py`	Adds a small script to run the compiler on a minimal example (currently needs assertions to be a real regression check).
`test_compiler/minimal_test_input.tsv`	Provides a minimal input dataset designed to detect replicate/barcode mis-ordering.

    def generateCountOutput(data,columns):
-        counts = pd.DataFrame(list(data.groupby('label').apply(lambda x: x.values.flatten()))).fillna(0).astype(np.int64)
+        counts = pd.DataFrame(list(data.groupby('label').apply(lambda x: x.values.flatten(order='F')))).fillna(0).astype(np.int64)
        counts.columns = columns
        counts['seq_id'] = data.index.unique()
        counts = counts[(['seq_id'] + list(columns))]


This improves the tests substantively, as described. Not actually required for the fix, but a nice addition. Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

for more information, see https://pre-commit.ci

test: using general pytests for python scripts

for more information, see https://pre-commit.ci

…x into compiler_fix

saarantras and others added 3 commits June 3, 2026 16:15

example of compiler bug

ef2f9da

+fix for compiler replicate confounding

f061335

+example of fix

ci: auto fixes from pre-commit hooks

a1115ec

for more information, see https://pre-commit.ci

saarantras changed the title ~~Compiler replicate contamination fix~~ fix: Compiler replicate contamination fix Jun 3, 2026

visze requested a review from Copilot June 4, 2026 16:18

Copilot started reviewing on behalf of visze June 4, 2026 16:18 View session

Copilot AI reviewed Jun 4, 2026

View reviewed changes

saarantras and others added 8 commits June 5, 2026 19:24

Potential fix for pull request finding

6b7ffca

This improves the tests substantively, as described. Not actually required for the fix, but a nice addition. Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

ci: auto fixes from pre-commit hooks

2a93422

for more information, see https://pre-commit.ci

tests: using general pytests for python scripts

68f49a3

Merge pull request #1 from visze/pr/saarantras/279

195506e

test: using general pytests for python scripts

quick additional test, making sure ragged ends dont goof

2d86b82

ci: auto fixes from pre-commit hooks

580f840

for more information, see https://pre-commit.ci

+switched type to prevent overflow

708bd4a

Merge branch 'compiler_fix' of github.com:saarantras/MPRAsnakeflow_fi…

b360d67

…x into compiler_fix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Compiler replicate contamination fix#279

fix: Compiler replicate contamination fix#279
saarantras wants to merge 11 commits into
kircherlab:masterfrom
saarantras:compiler_fix

saarantras commented Jun 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

saarantras commented Jun 3, 2026

Example of behavior before fix

Example of behavior after fix

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants