Skip to content

Simulate ancestral alignments under the TKF and substitutions models#140

Open
MattesMrzik wants to merge 53 commits intoacg-team:developfrom
MattesMrzik:simulate-tkf92-msa
Open

Simulate ancestral alignments under the TKF and substitutions models#140
MattesMrzik wants to merge 53 commits intoacg-team:developfrom
MattesMrzik:simulate-tkf92-msa

Conversation

@MattesMrzik
Copy link
Copy Markdown
Collaborator

@MattesMrzik MattesMrzik commented Jan 23, 2026

This PR implements an AlignmentSimulation trait which simulates an ancestral alignment and implements it for the TKF91 and TKF92 models and substitution models. It also introduces a utility for removing non surviving homology paths and ancestral alignment transformation into standard leaf-only alignments.
Key Changes

  • TKF Indel MSA Simulation
  • Full Evolutionary Simulation: Added TKFMSASimulator which combines the indel process with a substitution model to produce complete, character-based alignments.
  • Substitution Simulator: Introduced a standalone SubstitutionSimulator for generating alignments with substitutions only.
  • Ancestral Alignment Enhancements:
    • Added remove_extinct_columns to clean up alignments where ancestral characters do not survive to any leaves.
    • Implemented into_alignment to transform ancestral alignments (containing internal nodes) into standard alignments.
  • Test: Added tests covering mutation constraints, Dollo’s constraint compliance, and fragment capping logic.

@codecov
Copy link
Copy Markdown

codecov bot commented Jan 23, 2026

Codecov Report

❌ Patch coverage is 94.89796% with 30 lines in your changes missing coverage. Please review.
✅ Project coverage is 97.06%. Comparing base (527040d) to head (9e8e678).
⚠️ Report is 30 commits behind head on develop.

Files with missing lines Patch % Lines
phylo/src/substitution_models/simulate_msa.rs 84.81% 12 Missing ⚠️
phylo/src/tkf_model/tkf91.rs 33.33% 6 Missing ⚠️
phylo/src/alignment/mod.rs 94.44% 4 Missing ⚠️
phylo/src/tkf_model/sim_tkf_indel_msa.rs 98.89% 4 Missing ⚠️
phylo/src/tkf_model/sim_tkf_msa.rs 92.15% 4 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #140      +/-   ##
===========================================
+ Coverage    96.32%   97.06%   +0.73%     
===========================================
  Files           32       49      +17     
  Lines         4355     6130    +1775     
===========================================
+ Hits          4195     5950    +1755     
- Misses         160      180      +20     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- Replace local WILDCARD_CHAR constant (b'N') with the shared
  AMB_CHAR from crate::alphabets for consistency
- Move TKF92MSASimulationResult struct definition after the link/fate
  enums so related types are grouped more logically
- Add clarifying doc comments on Seqs type alias and the indel_model
  field
- Remove the local tkf92_fixed test helper and naive_merge utility,
  which duplicated indel cost logic already encapsulated in
  TKF92FixedIndelCostBuilder
- Update the simulate test to build and invoke the cost via
  TKF92FixedIndelCostBuilder, keeping the assertion equivalent
- Clean up unused imports (AncestralAlignment, h1, log_i1, log_n1,
  get_mapping_for_any_node)
- Add FragmentSampler trait with sample_fragment_length<R: Rng> -> (usize, f64),
  returning the sampled length and its log-probability
- Implement FragmentSampler for TKF91IndelModel (always returns length 1, log-prob 0.0)
- Implement FragmentSampler for TKF92IndelModel (geometric draw using r parameter)
- Rename TKF92MSASimulator -> TKFMSASimulator<T: TKFModel + FragmentSampler, Q, R>,
  replacing the hardcoded TKF92IndelModel field with the generic T
- Delegate sample_fragment_length on the simulator to the trait, eliminating
  all model-specific logic from the simulator itself
- Rename simulate test to tkf92_simulate; add tkf91_simulate test that asserts
  every fragment has length 1 and logl matches TKF91IndelCostBuilder
- Replace root_residue_count accumulator with fragmentation.last().unwrap_or(&0) + current_link.length
- Drop the root_residue_count: usize parameter from append_link_to_msa and links_to_msa
- Boundary computation is now self-contained: each entry is the previous boundary plus the current fragment's length, which is correct because fragmentation is a monotonically increasing cumulative sequence
- Extract dispatch_immortal_children to handle immortal link traversal
- Extract process_link to handle non-immortal link MSA writes
- Extract apply_fate to apply a single branch fate (homolog/deletion/non-homolog)
- Add explicit 'a lifetime annotations to process_link, apply_fate and
  dispatch_immortal_children so that references pushed onto tree_stack
  and insertions are tied to the lifetime of the source TKFLink data
- Rename TKF92MSASimulationResult to TKFMSASimulationResult<AA>
- Make simulate_msa() return typed struct instead of tuple
- Add msa_to_alignment_with_non_emitting_cols() method
- Fix typos: didnt→didn't, comulative→cumulative, bc→because
- Rename test tkf_homlog_probs to tkf_homolog_probs
- Move trait bounds into a  clause for readability in
- it seems to be simpler to write simulation logl match in tests only
- Add SubstitutionSimulator and SubstitutionSimulatorBuilder in src/substitution_models/simulate.rs
- Precompute transition matrices P(edge) once and reuse for all sites
- Fixed alignment length provided via builder (set at construction)
- Root sampled from model equilibrium freqs; traverse preorder to sample children
- Export new module in src/substitution_models/mod.rs
- Add unit tests for correctness and reproducibility
Replace EvoModel with SubstModel<Q> in substitution simulator

- Make builder and simulator generic over Q: QMatrix and store SubstModel<Q>
- Reimplement p(time) locally when precomputing P = exp(Q * t) to avoid EvoModel dependency
- Access frequencies and alphabet via the underlying QMatrix
- Update tests to construct SubstModel<JC69> (no API changes to tests)

Files changed:
- src/substitution_models/simulate.rs

Rationale: keeps simulator tied to concrete substitution model representation and avoids using the dynamic EvoModel trait where unnecessary. This simplifies access to Q and its frequencies while preserving behavior and reproducibility.
Refactor generics in substitution simulator to use where clauses

- Move generic bounds into  clauses for builder, simulator and impl blocks
- Improves readability by reducing line length
- No functional changes

File changed:
- src/substitution_models/simulate.rs
- Replace JC69 with GTR in substitution simulator private tests
- Provide concrete GTR frequencies and rate parameters for deterministic behavior
- Keep existing tree! macro usage unchanged

This makes tests use a realistic, parameterized DNA model rather than JC69, improving test coverage for general substitution models.
- Rename substitution simulator module file to
- Update module export in  to use
- Preserve tests and functionality while switching filename to better reflect purpose
Add default method  to  trait and support methods in MASA.

- Detect columns where all leaf maps are  and remove them
- Update leaf and ancestral mappings and sequences accordingly
- Add  helpers on MASA to efficiently update internal state
- Note:  update deferred (see issue acg-team#150)
- Add complex test `into_alignment_masa_to_msa` in `src/alignment/tests.rs
@MattesMrzik MattesMrzik marked this pull request as ready for review March 19, 2026 14:28
@MattesMrzik MattesMrzik changed the title Simulate ancestral alignment under the TKF92 indel model Simulate ancestral alignment under the TKF and substitutions models Mar 20, 2026
@MattesMrzik MattesMrzik changed the title Simulate ancestral alignment under the TKF and substitutions models Simulate ancestral alignments under the TKF and substitutions models Mar 20, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds end-to-end alignment simulation support by introducing an AlignmentSimulation trait and implementing simulators for (1) TKF91/TKF92 indel-only ancestral MSAs, (2) substitution-only ancestral MSAs, and (3) combined TKF indels + substitutions. It also adds utilities to clean up ancestral alignments and convert ancestral MSAs into leaf-only alignments.

Changes:

  • Added TKF indel MSA simulation (ancestral homology-path alignments) and a combined TKF+substitution MSA simulator.
  • Added a standalone substitution MSA simulator and exposed it via substitution_models module exports.
  • Added ancestral-alignment utilities (remove_extinct_columns, into_alignment) plus tests and minor docs/cleanup.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
phylo/src/tree/mod.rs Adds small docstrings for node id/index helpers used in tests/utilities.
phylo/src/tkf_model/tkf92.rs Adds new(...) constructor and fragment-length sampling for TKF92 via geometric distribution.
phylo/src/tkf_model/tkf91.rs Adds new(...) constructor and fragment-length sampling (always length 1) for TKF91.
phylo/src/tkf_model/sim_tkf_msa.rs New: combined TKF indel + substitution simulator producing character-based MSAs.
phylo/src/tkf_model/sim_tkf_indel_msa.rs New: TKF indel-only ancestral MSA simulator with fragmentation/log-likelihood tracking and tests.
phylo/src/tkf_model/reestimate/tests.rs Removes debug println! noise from tests.
phylo/src/tkf_model/reestimate/mod.rs Improves docs around extinct columns; expands an assertion message for debugging.
phylo/src/tkf_model/mod.rs Exposes the new TKF simulation modules.
phylo/src/substitution_models/simulate_msa.rs New: substitution-only ancestral MSA simulator and tests.
phylo/src/substitution_models/mod.rs Exposes the new substitution simulator module.
phylo/src/parsimony_presence_absence/tests.rs Updates test fixtures to use record_wo_desc and adjusts internal node IDs.
phylo/src/lib.rs Adds a constant URL for issue reporting used in panic messages.
phylo/src/error.rs Adds Error::AlignmentSimulation for simulator construction/validation failures.
phylo/src/asr/mod.rs Improves intra-doc links in ASR trait docs.
phylo/src/alignment/tests.rs Adds tests for remove_extinct_columns and into_alignment.
phylo/src/alignment/mod.rs Adds extinct-column warnings, remove_extinct_columns, and a default into_alignment for ancestral alignments; introduces AlignmentSimulation trait.
phylo/Cargo.toml Adds rand_distr dependency for geometric sampling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +186 to +187
let prob = (1.0 - prob_of_success).powi(choice as i32) * prob_of_success;
*self.cumulative_logl.borrow_mut() += prob.ln();
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

choice from Geometric::sample can exceed i32::MAX; casting to i32 for powi(choice as i32) can overflow and produce an incorrect probability/log-likelihood. Also computing prob in probability space can underflow for large choice. Prefer accumulating the log-probability directly (e.g., log_prob = (choice as f64)*ln(1-p) + ln(p)) and avoid the i32 cast entirely.

Suggested change
let prob = (1.0 - prob_of_success).powi(choice as i32) * prob_of_success;
*self.cumulative_logl.borrow_mut() += prob.ln();
let log_prob =
(choice as f64) * (1.0 - prob_of_success).ln() + prob_of_success.ln();
*self.cumulative_logl.borrow_mut() += log_prob;

Copilot uses AI. Check for mistakes.
Comment on lines +134 to +141
for (col_idx, survives) in surviving_cols.iter().enumerate() {
if !survives {
warn!(
"Column {} goes extinct in all leaf sequences. \
Consider calling `remove_extinct_columns` on the alignment.",
col_idx
);
}
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logs a warning once per extinct column. For large MSAs this can spam logs and become expensive. Consider aggregating (e.g., count extinct columns and warn once, or warn only for the first N columns) while still pointing users to remove_extinct_columns().

Suggested change
for (col_idx, survives) in surviving_cols.iter().enumerate() {
if !survives {
warn!(
"Column {} goes extinct in all leaf sequences. \
Consider calling `remove_extinct_columns` on the alignment.",
col_idx
);
}
let extinct_col_count = surviving_cols.iter().filter(|survives| !**survives).count();
if extinct_col_count > 0 {
warn!(
"{} column(s) go extinct in all leaf sequences. \
Consider calling `remove_extinct_columns` on the alignment.",
extinct_col_count
);

Copilot uses AI. Check for mistakes.
}
.unwrap_or_else(|e| {
panic!(
"Updating ancestral record failed. \
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The panic message says "Updating ancestral record failed" even when updating a leaf record (Leaf(_) => self.leaf_seqs.update_record(...)). Consider using a neutral message (e.g., "Updating record failed") so the context is accurate.

Suggested change
"Updating ancestral record failed. \
"Updating record failed. \

Copilot uses AI. Check for mistakes.
Comment on lines +43 to +51
) -> Self {
let indel_sim =
TKFIndelMSASimulator::new(indel_model, tree.clone(), rng.clone(), max_insertion_length);
let dummy_len = 1;
let subst_sim = SubstitutionSimulator::new(subst_model, tree, rng, dummy_len).unwrap();
Self {
indel_sim,
subst_sim,
}
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SubstitutionSimulator::new is force-unwrapped with a dummy alignment length. This hides construction errors and makes the length invariant unclear, especially since simulate_ancestral_alignment_with_length(aln_len) can be called with aln_len == 0 (indel simulation can yield 0 columns). Consider redesigning so the substitution simulator can be constructed without a placeholder length, and/or handle the 0-length case explicitly instead of relying on unwrap().

Suggested change
) -> Self {
let indel_sim =
TKFIndelMSASimulator::new(indel_model, tree.clone(), rng.clone(), max_insertion_length);
let dummy_len = 1;
let subst_sim = SubstitutionSimulator::new(subst_model, tree, rng, dummy_len).unwrap();
Self {
indel_sim,
subst_sim,
}
) -> Result<Self, String> {
let indel_sim =
TKFIndelMSASimulator::new(indel_model, tree.clone(), rng.clone(), max_insertion_length);
let dummy_len = 1;
let subst_sim = SubstitutionSimulator::new(subst_model, tree, rng, dummy_len)
.map_err(|err| {
format!(
"failed to construct SubstitutionSimulator with placeholder alignment length {}: {:?}",
dummy_len, err
)
})?;
Ok(Self {
indel_sim,
subst_sim,
})

Copilot uses AI. Check for mistakes.
"No valid assignments found for block_id = {block_id}, due to -inf logl"
"No valid assignments found for block_id = {block_id}, due to -inf logl, \
or no possible assignments or possible del_or_not or no max over previous. \
Current alignemnt = \n{}, current tree = \n{}, current v2 = {}",
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in assertion message: "alignemnt" → "alignment".

Suggested change
Current alignemnt = \n{}, current tree = \n{}, current v2 = {}",
Current alignment = \n{}, current tree = \n{}, current v2 = {}",

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants