Skip to content

fix: Compiler replicate contamination fix#279

Open
saarantras wants to merge 11 commits into
kircherlab:masterfrom
saarantras:compiler_fix
Open

fix: Compiler replicate contamination fix#279
saarantras wants to merge 11 commits into
kircherlab:masterfrom
saarantras:compiler_fix

Conversation

@saarantras

Copy link
Copy Markdown

This PR fixes a small error in mpranalyze_compiler.py which causes counts to get swapped between biological replicates, violating the statistical independence of replicates and negatively impacting power, with the defect being worse for higher inter-replicated varfiability.

This is similar to this PR for MPRflow, in that it is also a change to the MPRAnyalize compiler script that fixes an issue which causes mixing of data between replicates, and correcting this bug will also improve power. It is different, in that the compiler is snakeflow is quite different in flow, and the actual underlying logic error is different. (I have formatted both PRs & the tests in the same way for clarity). Unlike the MPRAflow bug (which required a trailing-NA condition to manifest), this misalignment affects every oligo regardless of input. This bug has the potential to change the conclusions of studies which have used this code.

At a technical level, this bug is caused by line 82:

counts = pd.DataFrame(list(data.groupby('label').apply(lambda x: x.values.flatten()))).fillna(0).astype(np.int64)

This defaults to row-major order, when it should be column-major. I modify to:

counts = pd.DataFrame(list(data.groupby('label').apply(lambda x: x.values.flatten(order='F')))).fillna(0).astype(np.int64)

Example of behavior before fix

Here is an example input:

label Sequence Barcode DNA(condition X, replicate 1) DNA(condition X, replicate 2) DNA(condition X, replicate 3) RNA(condition X, replicate 1) RNA(condition X, replicate 2) RNA(condition X, replicate 3)
oligoA A BC0 10 20 30 100 200 300
oligoA A BC1 11 21 31 101 201 301
oligoB B BC2 12 22 32 102 202 302
oligoB B BC3 13 23 33 103 203 303

We've made the first digit the same number as the replicate, and two digit numbers correspond to DNA where 3 digit numbers correspond to RNA: so it's easy to see where the numbers are going.

rna_annot.tsv.gz:

sample type condition replicate barcode
RNA_X_1_1 RNA X 1 1
RNA_X_1_2 RNA X 1 2
RNA_X_2_1 RNA X 2 1
RNA_X_2_2 RNA X 2 2
RNA_X_3_1 RNA X 3 1
RNA_X_3_2 RNA X 3 2

dna_annot.tsv.gz:

sample type condition replicate barcode
DNA_X_1_1 DNA X 1 1
DNA_X_1_2 DNA X 1 2
DNA_X_2_1 DNA X 2 1
DNA_X_2_2 DNA X 2 2
DNA_X_3_1 DNA X 3 1
DNA_X_3_2 DNA X 3 2

As you can see, the first number is the replicate.

rna_counts.tsv.gz:

seq_id RNA_X_1_1 RNA_X_1_2 RNA_X_2_1 RNA_X_2_2 RNA_X_3_1 RNA_X_3_2
oligoA 100 200 300 101 201 301
oligoB 102 202 302 103 203 303

As you can see, the replicates are scrambled. RNA_X_1_2 is replicate 1, but has has 200 and 202, which belong with replicate 2.

dna_counts.tsv.gz:

seq_id DNA_X_1_1 DNA_X_1_2 DNA_X_2_1 DNA_X_2_2 DNA_X_3_1 DNA_X_3_2
oligoA 10 20 30 11 21 31
oligoB 12 22 32 13 23 33

It's a similar problem for DNA : DNA_X_2_2 has 11, etc.

Example of behavior after fix

rna_counts.tsv.gz:

seq_id RNA_X_1_1 RNA_X_1_2 RNA_X_2_1 RNA_X_2_2 RNA_X_3_1 RNA_X_3_2
oligoA 100 101 200 201 300 301
oligoB 102 103 202 203 302 303

rna_counts.tsv.gz:

seq_id DNA_X_1_1 DNA_X_1_2 DNA_X_2_1 DNA_X_2_2 DNA_X_3_1 DNA_X_3_2
oligoA 10 11 20 21 30 31
oligoB 12 13 22 23 32 33

Now the replicates line up properly.

@saarantras saarantras changed the title Compiler replicate contamination fix fix: Compiler replicate contamination fix Jun 3, 2026
@visze visze requested a review from Copilot June 4, 2026 16:18

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets a correctness issue in the MPRAnalyze compiler script (workflow/scripts/count/mpranalyze_compiler.py) where count vectors are flattened in the wrong order, causing DNA/RNA counts to be assigned to the wrong biological replicates. It also adds a minimal repro harness under test_compiler/ to exercise the compiler on a small input.

Changes:

  • Fix count vector construction to use column-major flattening (order='F') when building per-oligo count tables.
  • Add a small runnable test harness (test_compiler/test.py) plus a minimal input TSV to reproduce/validate the expected ordering.

Reviewed changes

Copilot reviewed 3 out of 11 changed files in this pull request and generated 2 comments.

File Description
workflow/scripts/count/mpranalyze_compiler.py Changes how per-label count matrices are flattened into output vectors to avoid replicate scrambling.
test_compiler/test.py Adds a small script to run the compiler on a minimal example (currently needs assertions to be a real regression check).
test_compiler/minimal_test_input.tsv Provides a minimal input dataset designed to detect replicate/barcode mis-ordering.

Comment on lines 81 to 85
def generateCountOutput(data,columns):
counts = pd.DataFrame(list(data.groupby('label').apply(lambda x: x.values.flatten()))).fillna(0).astype(np.int64)
counts = pd.DataFrame(list(data.groupby('label').apply(lambda x: x.values.flatten(order='F')))).fillna(0).astype(np.int64)
counts.columns = columns
counts['seq_id'] = data.index.unique()
counts = counts[(['seq_id'] + list(columns))]
Comment thread test_compiler/test.py Outdated
saarantras and others added 8 commits June 5, 2026 19:24
This improves the tests substantively, as described. Not actually required for the fix, but a nice addition.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
test: using general pytests for python scripts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants