Simulate ancestral alignments under the TKF and substitutions models#140
Simulate ancestral alignments under the TKF and substitutions models#140MattesMrzik wants to merge 53 commits intoacg-team:developfrom
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #140 +/- ##
===========================================
+ Coverage 96.32% 97.06% +0.73%
===========================================
Files 32 49 +17
Lines 4355 6130 +1775
===========================================
+ Hits 4195 5950 +1755
- Misses 160 180 +20 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- Replace local WILDCARD_CHAR constant (b'N') with the shared AMB_CHAR from crate::alphabets for consistency - Move TKF92MSASimulationResult struct definition after the link/fate enums so related types are grouped more logically - Add clarifying doc comments on Seqs type alias and the indel_model field
- Remove the local tkf92_fixed test helper and naive_merge utility, which duplicated indel cost logic already encapsulated in TKF92FixedIndelCostBuilder - Update the simulate test to build and invoke the cost via TKF92FixedIndelCostBuilder, keeping the assertion equivalent - Clean up unused imports (AncestralAlignment, h1, log_i1, log_n1, get_mapping_for_any_node)
- Add FragmentSampler trait with sample_fragment_length<R: Rng> -> (usize, f64), returning the sampled length and its log-probability - Implement FragmentSampler for TKF91IndelModel (always returns length 1, log-prob 0.0) - Implement FragmentSampler for TKF92IndelModel (geometric draw using r parameter) - Rename TKF92MSASimulator -> TKFMSASimulator<T: TKFModel + FragmentSampler, Q, R>, replacing the hardcoded TKF92IndelModel field with the generic T - Delegate sample_fragment_length on the simulator to the trait, eliminating all model-specific logic from the simulator itself - Rename simulate test to tkf92_simulate; add tkf91_simulate test that asserts every fragment has length 1 and logl matches TKF91IndelCostBuilder
- Replace root_residue_count accumulator with fragmentation.last().unwrap_or(&0) + current_link.length - Drop the root_residue_count: usize parameter from append_link_to_msa and links_to_msa - Boundary computation is now self-contained: each entry is the previous boundary plus the current fragment's length, which is correct because fragmentation is a monotonically increasing cumulative sequence
- Extract dispatch_immortal_children to handle immortal link traversal - Extract process_link to handle non-immortal link MSA writes - Extract apply_fate to apply a single branch fate (homolog/deletion/non-homolog) - Add explicit 'a lifetime annotations to process_link, apply_fate and dispatch_immortal_children so that references pushed onto tree_stack and insertions are tied to the lifetime of the source TKFLink data
- Rename TKF92MSASimulationResult to TKFMSASimulationResult<AA> - Make simulate_msa() return typed struct instead of tuple - Add msa_to_alignment_with_non_emitting_cols() method - Fix typos: didnt→didn't, comulative→cumulative, bc→because - Rename test tkf_homlog_probs to tkf_homolog_probs
- Move trait bounds into a clause for readability in
- it seems to be simpler to write simulation logl match in tests only
- Add SubstitutionSimulator and SubstitutionSimulatorBuilder in src/substitution_models/simulate.rs - Precompute transition matrices P(edge) once and reuse for all sites - Fixed alignment length provided via builder (set at construction) - Root sampled from model equilibrium freqs; traverse preorder to sample children - Export new module in src/substitution_models/mod.rs - Add unit tests for correctness and reproducibility
Replace EvoModel with SubstModel<Q> in substitution simulator - Make builder and simulator generic over Q: QMatrix and store SubstModel<Q> - Reimplement p(time) locally when precomputing P = exp(Q * t) to avoid EvoModel dependency - Access frequencies and alphabet via the underlying QMatrix - Update tests to construct SubstModel<JC69> (no API changes to tests) Files changed: - src/substitution_models/simulate.rs Rationale: keeps simulator tied to concrete substitution model representation and avoids using the dynamic EvoModel trait where unnecessary. This simplifies access to Q and its frequencies while preserving behavior and reproducibility.
Refactor generics in substitution simulator to use where clauses - Move generic bounds into clauses for builder, simulator and impl blocks - Improves readability by reducing line length - No functional changes File changed: - src/substitution_models/simulate.rs
- Replace JC69 with GTR in substitution simulator private tests - Provide concrete GTR frequencies and rate parameters for deterministic behavior - Keep existing tree! macro usage unchanged This makes tests use a realistic, parameterized DNA model rather than JC69, improving test coverage for general substitution models.
- Rename substitution simulator module file to - Update module export in to use - Preserve tests and functionality while switching filename to better reflect purpose
d4b2a1d to
1e31b3a
Compare
Add default method to trait and support methods in MASA. - Detect columns where all leaf maps are and remove them - Update leaf and ancestral mappings and sequences accordingly - Add helpers on MASA to efficiently update internal state - Note: update deferred (see issue acg-team#150)
- Add complex test `into_alignment_masa_to_msa` in `src/alignment/tests.rs
- max_insertion_length refers to number of fragments not chars
- comment why we use push_front for insertions
There was a problem hiding this comment.
Pull request overview
This PR adds end-to-end alignment simulation support by introducing an AlignmentSimulation trait and implementing simulators for (1) TKF91/TKF92 indel-only ancestral MSAs, (2) substitution-only ancestral MSAs, and (3) combined TKF indels + substitutions. It also adds utilities to clean up ancestral alignments and convert ancestral MSAs into leaf-only alignments.
Changes:
- Added TKF indel MSA simulation (ancestral homology-path alignments) and a combined TKF+substitution MSA simulator.
- Added a standalone substitution MSA simulator and exposed it via
substitution_modelsmodule exports. - Added ancestral-alignment utilities (
remove_extinct_columns,into_alignment) plus tests and minor docs/cleanup.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| phylo/src/tree/mod.rs | Adds small docstrings for node id/index helpers used in tests/utilities. |
| phylo/src/tkf_model/tkf92.rs | Adds new(...) constructor and fragment-length sampling for TKF92 via geometric distribution. |
| phylo/src/tkf_model/tkf91.rs | Adds new(...) constructor and fragment-length sampling (always length 1) for TKF91. |
| phylo/src/tkf_model/sim_tkf_msa.rs | New: combined TKF indel + substitution simulator producing character-based MSAs. |
| phylo/src/tkf_model/sim_tkf_indel_msa.rs | New: TKF indel-only ancestral MSA simulator with fragmentation/log-likelihood tracking and tests. |
| phylo/src/tkf_model/reestimate/tests.rs | Removes debug println! noise from tests. |
| phylo/src/tkf_model/reestimate/mod.rs | Improves docs around extinct columns; expands an assertion message for debugging. |
| phylo/src/tkf_model/mod.rs | Exposes the new TKF simulation modules. |
| phylo/src/substitution_models/simulate_msa.rs | New: substitution-only ancestral MSA simulator and tests. |
| phylo/src/substitution_models/mod.rs | Exposes the new substitution simulator module. |
| phylo/src/parsimony_presence_absence/tests.rs | Updates test fixtures to use record_wo_desc and adjusts internal node IDs. |
| phylo/src/lib.rs | Adds a constant URL for issue reporting used in panic messages. |
| phylo/src/error.rs | Adds Error::AlignmentSimulation for simulator construction/validation failures. |
| phylo/src/asr/mod.rs | Improves intra-doc links in ASR trait docs. |
| phylo/src/alignment/tests.rs | Adds tests for remove_extinct_columns and into_alignment. |
| phylo/src/alignment/mod.rs | Adds extinct-column warnings, remove_extinct_columns, and a default into_alignment for ancestral alignments; introduces AlignmentSimulation trait. |
| phylo/Cargo.toml | Adds rand_distr dependency for geometric sampling. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let prob = (1.0 - prob_of_success).powi(choice as i32) * prob_of_success; | ||
| *self.cumulative_logl.borrow_mut() += prob.ln(); |
There was a problem hiding this comment.
choice from Geometric::sample can exceed i32::MAX; casting to i32 for powi(choice as i32) can overflow and produce an incorrect probability/log-likelihood. Also computing prob in probability space can underflow for large choice. Prefer accumulating the log-probability directly (e.g., log_prob = (choice as f64)*ln(1-p) + ln(p)) and avoid the i32 cast entirely.
| let prob = (1.0 - prob_of_success).powi(choice as i32) * prob_of_success; | |
| *self.cumulative_logl.borrow_mut() += prob.ln(); | |
| let log_prob = | |
| (choice as f64) * (1.0 - prob_of_success).ln() + prob_of_success.ln(); | |
| *self.cumulative_logl.borrow_mut() += log_prob; |
| for (col_idx, survives) in surviving_cols.iter().enumerate() { | ||
| if !survives { | ||
| warn!( | ||
| "Column {} goes extinct in all leaf sequences. \ | ||
| Consider calling `remove_extinct_columns` on the alignment.", | ||
| col_idx | ||
| ); | ||
| } |
There was a problem hiding this comment.
This logs a warning once per extinct column. For large MSAs this can spam logs and become expensive. Consider aggregating (e.g., count extinct columns and warn once, or warn only for the first N columns) while still pointing users to remove_extinct_columns().
| for (col_idx, survives) in surviving_cols.iter().enumerate() { | |
| if !survives { | |
| warn!( | |
| "Column {} goes extinct in all leaf sequences. \ | |
| Consider calling `remove_extinct_columns` on the alignment.", | |
| col_idx | |
| ); | |
| } | |
| let extinct_col_count = surviving_cols.iter().filter(|survives| !**survives).count(); | |
| if extinct_col_count > 0 { | |
| warn!( | |
| "{} column(s) go extinct in all leaf sequences. \ | |
| Consider calling `remove_extinct_columns` on the alignment.", | |
| extinct_col_count | |
| ); |
| } | ||
| .unwrap_or_else(|e| { | ||
| panic!( | ||
| "Updating ancestral record failed. \ |
There was a problem hiding this comment.
The panic message says "Updating ancestral record failed" even when updating a leaf record (Leaf(_) => self.leaf_seqs.update_record(...)). Consider using a neutral message (e.g., "Updating record failed") so the context is accurate.
| "Updating ancestral record failed. \ | |
| "Updating record failed. \ |
| ) -> Self { | ||
| let indel_sim = | ||
| TKFIndelMSASimulator::new(indel_model, tree.clone(), rng.clone(), max_insertion_length); | ||
| let dummy_len = 1; | ||
| let subst_sim = SubstitutionSimulator::new(subst_model, tree, rng, dummy_len).unwrap(); | ||
| Self { | ||
| indel_sim, | ||
| subst_sim, | ||
| } |
There was a problem hiding this comment.
SubstitutionSimulator::new is force-unwrapped with a dummy alignment length. This hides construction errors and makes the length invariant unclear, especially since simulate_ancestral_alignment_with_length(aln_len) can be called with aln_len == 0 (indel simulation can yield 0 columns). Consider redesigning so the substitution simulator can be constructed without a placeholder length, and/or handle the 0-length case explicitly instead of relying on unwrap().
| ) -> Self { | |
| let indel_sim = | |
| TKFIndelMSASimulator::new(indel_model, tree.clone(), rng.clone(), max_insertion_length); | |
| let dummy_len = 1; | |
| let subst_sim = SubstitutionSimulator::new(subst_model, tree, rng, dummy_len).unwrap(); | |
| Self { | |
| indel_sim, | |
| subst_sim, | |
| } | |
| ) -> Result<Self, String> { | |
| let indel_sim = | |
| TKFIndelMSASimulator::new(indel_model, tree.clone(), rng.clone(), max_insertion_length); | |
| let dummy_len = 1; | |
| let subst_sim = SubstitutionSimulator::new(subst_model, tree, rng, dummy_len) | |
| .map_err(|err| { | |
| format!( | |
| "failed to construct SubstitutionSimulator with placeholder alignment length {}: {:?}", | |
| dummy_len, err | |
| ) | |
| })?; | |
| Ok(Self { | |
| indel_sim, | |
| subst_sim, | |
| }) |
| "No valid assignments found for block_id = {block_id}, due to -inf logl" | ||
| "No valid assignments found for block_id = {block_id}, due to -inf logl, \ | ||
| or no possible assignments or possible del_or_not or no max over previous. \ | ||
| Current alignemnt = \n{}, current tree = \n{}, current v2 = {}", |
There was a problem hiding this comment.
Typo in assertion message: "alignemnt" → "alignment".
| Current alignemnt = \n{}, current tree = \n{}, current v2 = {}", | |
| Current alignment = \n{}, current tree = \n{}, current v2 = {}", |
This PR implements an
AlignmentSimulationtrait which simulates an ancestral alignment and implements it for the TKF91 and TKF92 models and substitution models. It also introduces a utility for removing non surviving homology paths and ancestral alignment transformation into standard leaf-only alignments.Key Changes