Commit 6857e59
feat(qc): recombination: A: Weighted Threshold
## Recombination Detection: Strategy A - Weighted Threshold
### Scientific Motivation
Recombinant viruses inherit genetic material from multiple parental lineages,
resulting in a distinctive pattern of private mutations when compared to a
single reference sequence. The key insight is that different mutation types
carry different diagnostic weight:
- Unlabeled mutations: Novel mutations not associated with known lineages
- Labeled mutations: Mutations characteristic of specific viral lineages
- Reversions: Positions where the sample reverts to the ancestral state
Reversions are particularly significant because they often indicate that a
genomic region derives from a different parental lineage than the reference.
When a recombinant inherits a segment from a lineage closer to the ancestral
sequence, positions that mutated in the reference lineage will appear as
reversions. By weighting reversions more heavily, this strategy increases
sensitivity to this recombination signature.
### Mechanism
The weighted threshold strategy calculates a weighted sum of private mutation
counts by type:
weightedCount = unlabeled * weightUnlabeled
+ labeled * weightLabeled
+ reversions * weightReversion
When weightedCount exceeds the configured threshold, the strategy computes:
excess = max(0, weightedCount - threshold)
strategyScore = excess * weight / threshold
The final QC score is: strategyScore * scoreWeight
Default weights (1.0, 1.0, 2.0) reflect that reversions are more diagnostic
of recombination than other private mutations. Dataset maintainers can tune
these weights based on the evolutionary dynamics of their specific pathogen.
### Configuration
In pathogen.json, configure under qc.recombinants:
"qc": {
"recombinants": {
"enabled": true,
"scoreWeight": 100,
"weightedThreshold": {
"enabled": true,
"threshold": 20,
"weight": 1.0,
"weightUnlabeled": 1.0,
"weightLabeled": 1.0,
"weightReversion": 2.0
}
}
}
Parameters:
- enabled: Master switch for recombinant detection
- scoreWeight: Multiplier for the rule's contribution to overall QC score
- weightedThreshold.enabled: Enable this specific strategy
- weightedThreshold.threshold: Weighted count above which recombinant is flagged
- weightedThreshold.weight: Scaling factor for excess calculation
- weightedThreshold.weight*: Per-mutation-type weights
### Advantages
- Simple, interpretable algorithm with clear biological rationale
- Configurable weights allow tuning for different pathogens
- Low computational overhead - O(1) calculation from pre-computed counts
- Backward compatible with existing datasets via legacy fallback
- Reversion weighting captures a key recombination signature
### Limitations
- Does not consider spatial distribution of mutations along the genome
- Cannot distinguish recombination from convergent evolution
- Requires careful threshold tuning per pathogen to avoid false positives
- Single threshold may not suit all lineage combinations
- No direct evidence of breakpoint locations
### Comparison to Other Strategies
Strategy A is the simplest approach, using only mutation counts without
positional information. Other strategies extend detection capabilities:
- B (Spatial Uniformity): Uses coefficient of variation to detect clustered
mutations - recombinants often show mutations concentrated in segments
from the divergent parent
- C (Cluster Gaps): Analyzes gaps between SNP clusters to identify potential
breakpoint regions
- D (Reversion Clustering): Looks for spatially clustered reversions, which
indicate a segment from a more ancestral lineage
- E (Multi-Ancestor): Compares to multiple reference sequences to find
regions matching different ancestors
- F (Label Switching): Detects when different genome regions show mutations
characteristic of different lineages
Strategy A serves as a baseline detector. For comprehensive recombinant
identification, combine with spatial strategies (B, C, D) when available.
### Implementation Summary
Files modified:
- packages/nextclade/src/qc/qc_config.rs - Strategy config structs with
defaults and backward compatibility for legacy mutationsThreshold
- packages/nextclade/src/qc/qc_rule_recombinants.rs - Main rule implementation
with weighted count calculation and scoring
- packages/nextclade/src/qc/qc_recomb_utils.rs - Shared utilities for position
clustering and spatial analysis (foundation for strategies B-D)
- packages/nextclade/src/qc/mod.rs - Module exports
- packages/nextclade/src/qc/qc_run.rs - Integration into QC pipeline
- packages/nextclade-web/src/helpers/formatQCRecombinants.ts - Web UI message
formatting with weighted count details
- packages/nextclade-web/src/components/Results/* - UI integration
- packages/nextclade-schemas/*.json,yaml - Updated JSON schemas
Test dataset added:
- data/recomb/enpen/enterovirus/ev-d68/ - Enterovirus D68 dataset with
recombinants rule enabled for testing the weighted threshold strategy
### Future Work
- Add unit tests for edge cases (zero mutations, extreme weights)
- Implement strategies B-F for comprehensive recombination detection
- Add visualization of mutation types in web UI results table
- Consider adaptive thresholds based on genome length and diversity
- Explore machine learning approaches trained on known recombinants
Co-Authored-By: Claude <noreply@anthropic.com>1 parent 901e78d commit 6857e59
File tree
26 files changed
+27064
-7
lines changed- data/recomb/enpen/enterovirus/ev-d68
- packages
- nextclade-schemas
- nextclade-web/src
- components
- DevTools
- Results
- helpers
- nextclade/src/qc
26 files changed
+27064
-7
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
4 | 4 | | |
5 | 5 | | |
6 | 6 | | |
7 | | - | |
| 7 | + | |
| 8 | + | |
8 | 9 | | |
9 | 10 | | |
10 | 11 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
Lines changed: 18 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
0 commit comments