Skip to content

Commit 9bcf7dc

Browse files
feat(qc): recombination: F: Label Switching
## Recombination Detection: Strategy F - Label Switching ### Scientific Motivation Recombination occurs when a virus incorporates genetic material from two or more parental lineages. Each lineage accumulates characteristic mutations over time - these "signature mutations" serve as molecular markers that distinguish lineages from one another. When a recombination event occurs, different genomic regions inherit mutations from different parental lineages. This creates a distinctive pattern: the sequence carries mutations characteristic of lineage A in one region, and mutations characteristic of lineage B in another region. The label switching strategy exploits this by leveraging the mutation label map (nucMutLabelMap) - a curated mapping of nucleotide positions to lineage labels. When private mutations are detected, they inherit labels from this map. In a non-recombinant sequence, most labeled mutations should belong to a single lineage (or closely related lineages). In a recombinant, mutations from different lineages cluster in different genomic regions, creating detectable "label switches" as you traverse the genome. ### Mechanism The algorithm proceeds as follows: 1. **Label grouping**: Collect all labeled private substitutions from `PrivateNucMutations.labeled_substitutions`. Group them by their primary label (first label in the labels array), storing genomic positions for each label. 2. **Minimum labels check**: If fewer than `minLabels` distinct labels are present, return zero score (insufficient signal for recombination). 3. **Centroid calculation**: For each label, compute the centroid (mean position) of all mutations carrying that label. This represents the "center of mass" of each lineage's contribution. 4. **Switch counting**: Sort labels by their centroid position. The number of switches equals `numLabels - 1`, representing transitions between lineage-dominated regions as you traverse the genome from 5' to 3'. 5. **Scoring**: `score = numSwitches * weight` ### Configuration Required in `pathogen.json`: ```json { "mutLabels": { "nucMutLabelMap": { "A123T": ["Alpha"], "G456C": ["Beta"], ... } }, "qc": { "recombinants": { "enabled": true, "scoreWeight": 100.0, "labelSwitching": { "enabled": true, "weight": 50.0, "minLabels": 2 } } } } ``` Parameters: - `enabled`: Activate label switching detection - `weight`: Score contribution per label switch (default: 50.0) - `minLabels`: Minimum distinct labels required to trigger detection (default: 2) ### Advantages - Leverages existing lineage annotation infrastructure (mutLabels) - Biologically interpretable - directly identifies which lineages contributed to the recombinant - Does not require spatial parameters or segment definitions - Robust to mutation density variations across the genome - Works with any pathogen that has curated lineage-defining mutations ### Limitations - Requires a well-curated `nucMutLabelMap` with lineage-specific mutations - Effectiveness depends on quality and completeness of label annotations - Cannot detect recombination between unlabeled or identically-labeled lineages - Uses only the first label when mutations have multiple labels - Centroid-based ordering may miss complex recombination patterns with interleaved regions ### Comparison to Other Strategies Unlike Strategy A (weighted threshold) which only counts mutations, label switching considers the identity and spatial distribution of labeled mutations. Unlike Strategy B (spatial uniformity) which measures general non-uniformity, this strategy specifically identifies which lineages contribute to different regions. Choose label switching when: - Your pathogen has well-characterized lineage-defining mutations - You want to identify the parental lineages, not just detect recombination - The labeled mutation set has good genome-wide coverage Choose other strategies when: - No mutation label map is available (A, B, C, D) - Recombination involves unlabeled variants (A, B, C, D) - Multiple ancestral references are available (E) ### Implementation Summary Files modified: - `packages/nextclade/src/qc/qc_config.rs` - Added QcRecombConfigLabelSwitching config struct - `packages/nextclade/src/qc/qc_rule_recombinants.rs` - Implemented strategy_label_switching function - `packages/nextclade/src/qc/qc_recomb_utils.rs` - Added shared utilities module - `packages/nextclade/src/qc/qc_run.rs` - Integrated recombinants rule - `packages/nextclade/src/qc/mod.rs` - Registered new modules - `packages/nextclade-web/src/helpers/formatQCRecombinants.ts` - Added UI formatting - `packages/nextclade-web/src/components/Results/ListOfQcIsuues.tsx` - Display integration - `packages/nextclade-schemas/*.schema.{json,yaml}` - Updated JSON schemas Test dataset: - `data/recomb/enpen/enterovirus/ev-d68/` - EV-D68 dataset with label switching configuration enabled for testing Unit tests added for: - Disabled config returns None - Empty labeled mutations returns None - Single label below minLabels returns zero score - Two labels returns one switch - Three labels returns two switches - Multiple labels per mutation uses first label only ### Future Work - Support weighted label switches based on centroid separation distance - Consider secondary labels for mutations with multiple lineage assignments - Add visualization of label distribution across genome - Integrate with tree-based lineage assignment for validation Co-Authored-By: Claude <noreply@anthropic.com>
1 parent 901e78d commit 9bcf7dc

26 files changed

+24189
-7
lines changed

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,8 @@
44
/build/
55
/data_dev*/
66
/data_local*
7-
/data/
7+
/data/*
8+
!/data/recomb/
89
/docs/build/
910
/e2e/cli/snapshots/
1011
/e2e/cli/tmp/
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
## 2025-12-10T13:21:04Z
2+
3+
- Update alignment parameters in pathogen.json:
4+
- Fix gap extension penalty
5+
- Enable reverse-complement handling
6+
- Recompute tree topology (ML tree rerun)
7+
- Regenerate mutation labels for all clades
8+
- Update reference example sequences
9+
10+
## 2025-11-20T19:02:04Z
11+
12+
Add citation information to README.md
13+
14+
## 2025-11-19T20:40:14Z
15+
16+
Initial release of an Enterovirus D68 dataset for lineage classification!
17+
18+
Read more about Nextclade datasets in the documentation: https://docs.nextstrain.org/projects/nextclade/en/stable/user/datasets.html
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
# Enterovirus D68 dataset with reference Fermon
2+
3+
| Key | Value |
4+
|----------------------|-----------------------------------------------------------------------|
5+
| authors | [Nadia Neuner-Jehle](https://eve-lab.org/people/nadia-neuner-jehle/), [Alejandra González-Sánchez](https://www.vallhebron.com/en/professionals/alejandra-gonzalez-sanchez), [Emma B. Hodcroft](https://eve-lab.org/people/emma-hodcroft/), [ENPEN](https://escv.eu/european-non-polio-enterovirus-network-enpen/) |
6+
| name | Enterovirus D68 |
7+
| reference | [AY426531.1](https://www.ncbi.nlm.nih.gov/nuccore/AY426531.1) |
8+
| workflow | https://github.com/enterovirus-phylo/nextclade_d68 |
9+
| path | `enpen/enterovirus/ev-d68` |
10+
| clade definitions | A–C (D) |
11+
12+
## Citation
13+
14+
If you use this dataset in your research, please cite:
15+
16+
> Neuner-Jehle, N., González Sánchez, A., Hodcroft, E. B., & European Non-Polio Enterovirus Network (ENPEN). (2025). *enterovirus-phylo/nextclade_d68: Enterovirus D68 Nextclade Dataset v1.0.0* (v1.0.0--2025-11-18). Zenodo. https://doi.org/10.5281/zenodo.17642338
17+
18+
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.17642338.svg)](https://doi.org/10.5281/zenodo.17642338)
19+
20+
## Scope of this dataset
21+
22+
Based on full-genome sequences, this dataset uses the **Fermon reference sequence** ([AY426531.1](https://www.ncbi.nlm.nih.gov/nuccore/AY426531.1)), originally isolated in 1962. It serves as the basis for quality control, clade assignment, and mutation calling across global EV-D68 diversity.
23+
24+
*Note: The Fermon reference differs substantially from currently circulating strains.* This is common for enterovirus datasets, in contrast to some other virus datasets (e.g., seasonal influenza), where the reference is updated more frequently to reflect recent lineages.
25+
26+
To address this, the dataset is *rooted* on a Static Inferred Ancestor — a phylogenetically reconstructed ancestral sequence near the tree root. This provides a stable reference point that can be used, optionally, as an alternative for mutation calling.
27+
28+
## Features
29+
30+
This dataset supports:
31+
32+
- Assignment of subgenotypes
33+
- Phylogenetic placement
34+
- Sequence quality control (QC)
35+
36+
## Subgenogroups of Enterovirus D68
37+
38+
Clade designations follow the global diversity of EV-D68: A (A1–A2/D), B (B1–B3), and C. The label "pre-ABC" indicates old, basal lineages that are no longer circulating. Sequences labeled "pre-ABC" or "unassigned" may indicate sequencing or assembly issues and should be assessed carefully.
39+
40+
These designations are based on the phylogenetic structure and mutations, and are widely used in molecular epidemiology, similar to subgenotype systems for other enteroviruses. Unlike influenza (H1N1, H3N2) or SARS-CoV-2, there is no universal, standardized global lineage nomenclature for enteroviruses. Naming follows conventions from published studies and surveillance practices.
41+
42+
## Reference types
43+
44+
This dataset includes several reference points used in analyses:
45+
- *Reference:* RefSeq or similarly established reference sequence. Here Fermon.
46+
47+
- *Parent:* The nearest ancestral node of a sample in the tree, used to infer branch-specific mutations.
48+
49+
- *Clade founder:* The inferred ancestral node defining a clade (e.g., A2, B3). Mutations "since clade founder" describe changes that define that clade.
50+
51+
- *Static Inferred Ancestor:* Reconstructed ancestral sequence inferred with an outgroup, representing the likely founder of EV-D68. Serves as a stable reference.
52+
53+
- *Tree root:* Corresponds to the root of the tree, it may change in future updates as more data become available.
54+
55+
All references use the coordinate system of the Fermon sequence.
56+
57+
## Issues & Contact
58+
- For questions or suggestions, please [open an issue](https://github.com/enterovirus-phylo/nextclade_d68/issues) or email: eve-group[at]swisstph.ch
59+
60+
## What is a Nextclade dataset?
61+
62+
A Nextclade dataset includes the reference sequence, genome annotations, tree, clade definitions, and QC rules. Learn more in the [Nextclade documentation](https://docs.nextstrain.org/projects/nextclade/en/stable/user/datasets.html).
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
##gff-version 3
2+
#!gff-spec-version 1.21
3+
#!processor NCBI annotwriter
4+
##sequence-region AY426531.2 1 7367
5+
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=42789
6+
# seqname source feature start end score strand frame attribute
7+
AY426531.1 Genbank region 1 7367 . + . ID=AY426531.1:1..7367;Dbxref=taxon:42789;country=USA;gb-acronym=EV-D68;gbkey=Src;mol_type=genomic RNA;note=prototype strain of Enterovirus 68;old-name=Enterovirus 68;strain=Fermon
8+
AY426531.1 Genbank CDS 733 939 . + . Name=VP4;gbkey=Prot;product=VP4;ID=id-AAR98503.1:1..69
9+
AY426531.1 Genbank CDS 940 1683 . + . Name=VP2;gbkey=Prot;product=VP2;ID=id-AAR98503.1:70..317
10+
AY426531.1 Genbank CDS 1684 2388 . + . Name=VP3;gbkey=Prot;product=VP3;ID=id-AAR98503.1:318..552
11+
AY426531.1 Genbank CDS 2389 3315 . + . Name=VP1;gbkey=Prot;product=VP1;ID=id-AAR98503.1:553..861
12+
AY426531.1 Genbank CDS 3316 3756 . + . Name=2A;gbkey=Prot;product=2A;ID=id-AAR98503.1:862..1008
13+
AY426531.1 Genbank CDS 3757 4053 . + . Name=2B;gbkey=Prot;product=2B;ID=id-AAR98503.1:1009..1107
14+
AY426531.1 Genbank CDS 4054 5043 . + . Name=2C;gbkey=Prot;product=2C;ID=id-AAR98503.1:1108..1437
15+
AY426531.1 Genbank CDS 5044 5310 . + . Name=3A;gbkey=Prot;product=3A;ID=id-AAR98503.1:1438..1526
16+
AY426531.1 Genbank CDS 5311 5376 . + . Name=3B;gbkey=Prot;product=3B;ID=id-AAR98503.1:1527..1548
17+
AY426531.1 Genbank CDS 5377 5925 . + . Name=3C;gbkey=Prot;product=3C;ID=id-AAR98503.1:1549..1731
18+
AY426531.1 Genbank CDS 5926 7296 . + . Name=3D;gbkey=Prot;product=3D;ID=id-AAR98503.1:1732..2188

0 commit comments

Comments
 (0)