Skip to content

Commit 6857e59

Browse files
feat(qc): recombination: A: Weighted Threshold
## Recombination Detection: Strategy A - Weighted Threshold ### Scientific Motivation Recombinant viruses inherit genetic material from multiple parental lineages, resulting in a distinctive pattern of private mutations when compared to a single reference sequence. The key insight is that different mutation types carry different diagnostic weight: - Unlabeled mutations: Novel mutations not associated with known lineages - Labeled mutations: Mutations characteristic of specific viral lineages - Reversions: Positions where the sample reverts to the ancestral state Reversions are particularly significant because they often indicate that a genomic region derives from a different parental lineage than the reference. When a recombinant inherits a segment from a lineage closer to the ancestral sequence, positions that mutated in the reference lineage will appear as reversions. By weighting reversions more heavily, this strategy increases sensitivity to this recombination signature. ### Mechanism The weighted threshold strategy calculates a weighted sum of private mutation counts by type: weightedCount = unlabeled * weightUnlabeled + labeled * weightLabeled + reversions * weightReversion When weightedCount exceeds the configured threshold, the strategy computes: excess = max(0, weightedCount - threshold) strategyScore = excess * weight / threshold The final QC score is: strategyScore * scoreWeight Default weights (1.0, 1.0, 2.0) reflect that reversions are more diagnostic of recombination than other private mutations. Dataset maintainers can tune these weights based on the evolutionary dynamics of their specific pathogen. ### Configuration In pathogen.json, configure under qc.recombinants: "qc": { "recombinants": { "enabled": true, "scoreWeight": 100, "weightedThreshold": { "enabled": true, "threshold": 20, "weight": 1.0, "weightUnlabeled": 1.0, "weightLabeled": 1.0, "weightReversion": 2.0 } } } Parameters: - enabled: Master switch for recombinant detection - scoreWeight: Multiplier for the rule's contribution to overall QC score - weightedThreshold.enabled: Enable this specific strategy - weightedThreshold.threshold: Weighted count above which recombinant is flagged - weightedThreshold.weight: Scaling factor for excess calculation - weightedThreshold.weight*: Per-mutation-type weights ### Advantages - Simple, interpretable algorithm with clear biological rationale - Configurable weights allow tuning for different pathogens - Low computational overhead - O(1) calculation from pre-computed counts - Backward compatible with existing datasets via legacy fallback - Reversion weighting captures a key recombination signature ### Limitations - Does not consider spatial distribution of mutations along the genome - Cannot distinguish recombination from convergent evolution - Requires careful threshold tuning per pathogen to avoid false positives - Single threshold may not suit all lineage combinations - No direct evidence of breakpoint locations ### Comparison to Other Strategies Strategy A is the simplest approach, using only mutation counts without positional information. Other strategies extend detection capabilities: - B (Spatial Uniformity): Uses coefficient of variation to detect clustered mutations - recombinants often show mutations concentrated in segments from the divergent parent - C (Cluster Gaps): Analyzes gaps between SNP clusters to identify potential breakpoint regions - D (Reversion Clustering): Looks for spatially clustered reversions, which indicate a segment from a more ancestral lineage - E (Multi-Ancestor): Compares to multiple reference sequences to find regions matching different ancestors - F (Label Switching): Detects when different genome regions show mutations characteristic of different lineages Strategy A serves as a baseline detector. For comprehensive recombinant identification, combine with spatial strategies (B, C, D) when available. ### Implementation Summary Files modified: - packages/nextclade/src/qc/qc_config.rs - Strategy config structs with defaults and backward compatibility for legacy mutationsThreshold - packages/nextclade/src/qc/qc_rule_recombinants.rs - Main rule implementation with weighted count calculation and scoring - packages/nextclade/src/qc/qc_recomb_utils.rs - Shared utilities for position clustering and spatial analysis (foundation for strategies B-D) - packages/nextclade/src/qc/mod.rs - Module exports - packages/nextclade/src/qc/qc_run.rs - Integration into QC pipeline - packages/nextclade-web/src/helpers/formatQCRecombinants.ts - Web UI message formatting with weighted count details - packages/nextclade-web/src/components/Results/* - UI integration - packages/nextclade-schemas/*.json,yaml - Updated JSON schemas Test dataset added: - data/recomb/enpen/enterovirus/ev-d68/ - Enterovirus D68 dataset with recombinants rule enabled for testing the weighted threshold strategy ### Future Work - Add unit tests for edge cases (zero mutations, extreme weights) - Implement strategies B-F for comprehensive recombination detection - Add visualization of mutation types in web UI results table - Consider adaptive thresholds based on genome length and diversity - Explore machine learning approaches trained on known recombinants Co-Authored-By: Claude <noreply@anthropic.com>
1 parent 901e78d commit 6857e59

26 files changed

+27064
-7
lines changed

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,8 @@
44
/build/
55
/data_dev*/
66
/data_local*
7-
/data/
7+
/data/*
8+
!data/recomb/
89
/docs/build/
910
/e2e/cli/snapshots/
1011
/e2e/cli/tmp/
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
## 2025-12-10T13:21:04Z
2+
3+
- Update alignment parameters in pathogen.json:
4+
- Fix gap extension penalty
5+
- Enable reverse-complement handling
6+
- Recompute tree topology (ML tree rerun)
7+
- Regenerate mutation labels for all clades
8+
- Update reference example sequences
9+
10+
## 2025-11-20T19:02:04Z
11+
12+
Add citation information to README.md
13+
14+
## 2025-11-19T20:40:14Z
15+
16+
Initial release of an Enterovirus D68 dataset for lineage classification!
17+
18+
Read more about Nextclade datasets in the documentation: https://docs.nextstrain.org/projects/nextclade/en/stable/user/datasets.html
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
# Enterovirus D68 dataset with reference Fermon
2+
3+
| Key | Value |
4+
|----------------------|-----------------------------------------------------------------------|
5+
| authors | [Nadia Neuner-Jehle](https://eve-lab.org/people/nadia-neuner-jehle/), [Alejandra González-Sánchez](https://www.vallhebron.com/en/professionals/alejandra-gonzalez-sanchez), [Emma B. Hodcroft](https://eve-lab.org/people/emma-hodcroft/), [ENPEN](https://escv.eu/european-non-polio-enterovirus-network-enpen/) |
6+
| name | Enterovirus D68 |
7+
| reference | [AY426531.1](https://www.ncbi.nlm.nih.gov/nuccore/AY426531.1) |
8+
| workflow | https://github.com/enterovirus-phylo/nextclade_d68 |
9+
| path | `enpen/enterovirus/ev-d68` |
10+
| clade definitions | A–C (D) |
11+
12+
## Citation
13+
14+
If you use this dataset in your research, please cite:
15+
16+
> Neuner-Jehle, N., González Sánchez, A., Hodcroft, E. B., & European Non-Polio Enterovirus Network (ENPEN). (2025). *enterovirus-phylo/nextclade_d68: Enterovirus D68 Nextclade Dataset v1.0.0* (v1.0.0--2025-11-18). Zenodo. https://doi.org/10.5281/zenodo.17642338
17+
18+
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.17642338.svg)](https://doi.org/10.5281/zenodo.17642338)
19+
20+
## Scope of this dataset
21+
22+
Based on full-genome sequences, this dataset uses the **Fermon reference sequence** ([AY426531.1](https://www.ncbi.nlm.nih.gov/nuccore/AY426531.1)), originally isolated in 1962. It serves as the basis for quality control, clade assignment, and mutation calling across global EV-D68 diversity.
23+
24+
*Note: The Fermon reference differs substantially from currently circulating strains.* This is common for enterovirus datasets, in contrast to some other virus datasets (e.g., seasonal influenza), where the reference is updated more frequently to reflect recent lineages.
25+
26+
To address this, the dataset is *rooted* on a Static Inferred Ancestor — a phylogenetically reconstructed ancestral sequence near the tree root. This provides a stable reference point that can be used, optionally, as an alternative for mutation calling.
27+
28+
## Features
29+
30+
This dataset supports:
31+
32+
- Assignment of subgenotypes
33+
- Phylogenetic placement
34+
- Sequence quality control (QC)
35+
36+
## Subgenogroups of Enterovirus D68
37+
38+
Clade designations follow the global diversity of EV-D68: A (A1–A2/D), B (B1–B3), and C. The label "pre-ABC" indicates old, basal lineages that are no longer circulating. Sequences labeled "pre-ABC" or "unassigned" may indicate sequencing or assembly issues and should be assessed carefully.
39+
40+
These designations are based on the phylogenetic structure and mutations, and are widely used in molecular epidemiology, similar to subgenotype systems for other enteroviruses. Unlike influenza (H1N1, H3N2) or SARS-CoV-2, there is no universal, standardized global lineage nomenclature for enteroviruses. Naming follows conventions from published studies and surveillance practices.
41+
42+
## Reference types
43+
44+
This dataset includes several reference points used in analyses:
45+
- *Reference:* RefSeq or similarly established reference sequence. Here Fermon.
46+
47+
- *Parent:* The nearest ancestral node of a sample in the tree, used to infer branch-specific mutations.
48+
49+
- *Clade founder:* The inferred ancestral node defining a clade (e.g., A2, B3). Mutations "since clade founder" describe changes that define that clade.
50+
51+
- *Static Inferred Ancestor:* Reconstructed ancestral sequence inferred with an outgroup, representing the likely founder of EV-D68. Serves as a stable reference.
52+
53+
- *Tree root:* Corresponds to the root of the tree, it may change in future updates as more data become available.
54+
55+
All references use the coordinate system of the Fermon sequence.
56+
57+
## Issues & Contact
58+
- For questions or suggestions, please [open an issue](https://github.com/enterovirus-phylo/nextclade_d68/issues) or email: eve-group[at]swisstph.ch
59+
60+
## What is a Nextclade dataset?
61+
62+
A Nextclade dataset includes the reference sequence, genome annotations, tree, clade definitions, and QC rules. Learn more in the [Nextclade documentation](https://docs.nextstrain.org/projects/nextclade/en/stable/user/datasets.html).
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
##gff-version 3
2+
#!gff-spec-version 1.21
3+
#!processor NCBI annotwriter
4+
##sequence-region AY426531.2 1 7367
5+
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=42789
6+
# seqname source feature start end score strand frame attribute
7+
AY426531.1 Genbank region 1 7367 . + . ID=AY426531.1:1..7367;Dbxref=taxon:42789;country=USA;gb-acronym=EV-D68;gbkey=Src;mol_type=genomic RNA;note=prototype strain of Enterovirus 68;old-name=Enterovirus 68;strain=Fermon
8+
AY426531.1 Genbank CDS 733 939 . + . Name=VP4;gbkey=Prot;product=VP4;ID=id-AAR98503.1:1..69
9+
AY426531.1 Genbank CDS 940 1683 . + . Name=VP2;gbkey=Prot;product=VP2;ID=id-AAR98503.1:70..317
10+
AY426531.1 Genbank CDS 1684 2388 . + . Name=VP3;gbkey=Prot;product=VP3;ID=id-AAR98503.1:318..552
11+
AY426531.1 Genbank CDS 2389 3315 . + . Name=VP1;gbkey=Prot;product=VP1;ID=id-AAR98503.1:553..861
12+
AY426531.1 Genbank CDS 3316 3756 . + . Name=2A;gbkey=Prot;product=2A;ID=id-AAR98503.1:862..1008
13+
AY426531.1 Genbank CDS 3757 4053 . + . Name=2B;gbkey=Prot;product=2B;ID=id-AAR98503.1:1009..1107
14+
AY426531.1 Genbank CDS 4054 5043 . + . Name=2C;gbkey=Prot;product=2C;ID=id-AAR98503.1:1108..1437
15+
AY426531.1 Genbank CDS 5044 5310 . + . Name=3A;gbkey=Prot;product=3A;ID=id-AAR98503.1:1438..1526
16+
AY426531.1 Genbank CDS 5311 5376 . + . Name=3B;gbkey=Prot;product=3B;ID=id-AAR98503.1:1527..1548
17+
AY426531.1 Genbank CDS 5377 5925 . + . Name=3C;gbkey=Prot;product=3C;ID=id-AAR98503.1:1549..1731
18+
AY426531.1 Genbank CDS 5926 7296 . + . Name=3D;gbkey=Prot;product=3D;ID=id-AAR98503.1:1732..2188

0 commit comments

Comments
 (0)