-
Notifications
You must be signed in to change notification settings - Fork 2
Implementation of TKF92 #99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- i just moved the code from my own repo to here, in my repo i tested, but not yet here.
reassinged, perhaps this is numerical issue
- the msa that caused the error was that internal nodes had chars but the leaves not. this caused the x to be 1 during reassignment since it removed the ancestral chars, which is correct, but then intergrating is wrong, since it integrates over nothing (no chars are there in the whole msa column). added an assert to check for x != 1 - can run find_brute_force_max in parallel and in series, both with progress bar
why wasnt this noticed earlier?
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #99 +/- ##
===========================================
+ Coverage 96.32% 97.26% +0.94%
===========================================
Files 32 45 +13
Lines 4355 5671 +1316
===========================================
+ Hits 4195 5516 +1321
+ Misses 160 155 -5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- removed plot r - removed find fastas in benchmark dir - removed test that found a beta = 0 when very short branch
- clean up - model opti also for tkf91
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces support for TKF (Thorne-Kishino-Felsenstein) evolutionary models, specifically TKF91 and TKF92, which are models for sequence evolution with insertions and deletions. The implementation includes model building, likelihood calculations, parameter optimization, and integration with the existing topology optimization framework.
Key changes:
- Implementation of TKF91 and TKF92 indel models with likelihood computation
- Integration with substitution models to create combined TKF+substitution costs
- Support for model parameter optimization and tree topology optimization (NNI)
Reviewed Changes
Copilot reviewed 11 out of 12 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| phylo/src/tkf_model/mod.rs | Core implementation of TKF91/TKF92 models, builders, and likelihood calculations |
| phylo/src/tkf_model/tests.rs | Comprehensive test suite covering model functionality, likelihood calculations, and parameter validation |
| phylo/src/lib.rs | Module export for tkf_model |
| phylo/src/alignment/mod.rs | Added methods to support ancestral alignment updates needed by TKF models |
| phylo/src/optimisers/topo_optimiser.rs | Added compatibility traits for TKF models with NNI optimization |
| phylo/src/optimisers/spr_optimiser.rs | Added comment clarifying optimization logic |
| phylo/src/optimisers/model_optimiser_tests.rs | Added test for TKF model optimization |
| phylo/src/phylo_info/phyloinfo_builder.rs | Enhanced error messages and added logging for tree node ID setting |
| phylo/src/phylo_info/tests.rs | Updated test to match new error message format |
| phylo/data/tree_multiple.newick | Fixed formatting (restored newline) |
| phylo/benches/helpers.rs | Fixed typo in comment |
| phylo/Cargo.toml | Removed extra blank line |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
junniest
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite hefty to review, so I think we'll need another round.
One more thought: I think it might be worth at this point to split the module into at the very least main (mod.rs), tkf91_model and tkf92_model or something along these lines (and tkfindel_model?). It becomes quite difficult to figure out which model is which in a file this long.
- num enum for tkf params - extracted actions function to separate getting indel x and factor n - split the validate params into (lambda, mu) and (r) - some minor renames and comments
- use bit flags instead of option - cache some r values in tkf92 model struct
junniest
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have one major comment to the design at the moment. You have a lot of methods called get_something which to me indicates that they should only perform a computation and return the result. However, most if not all of these methods also set values in the temporary storage, meaning that they produce side effects.
My gut feeling is that methods with names like get should not produce side effects (set values) and methods with names like set should not return values, because otherwise it is very difficult to track where changes to the internal states are happening.
phylo/src/tkf_model/mod.rs
Outdated
|
|
||
| // TODO: link our paper once it is published. For now see original TKF92 paper: https://doi.org/10.1007/bf00163848 | ||
| #[derive(Clone, Debug)] | ||
| struct TKFIndelModelInfo { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this apply to all TKF-style models? If so, the name is confusing because it seems like it is just meant for the TKF indel model without the substitution models, rather than all of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this model info only cares about the indel process, i feel like the name is fitting. if we have tkf cost with substitutions, it's just tkf indel cost (with tkf indel model info) and substitution cost (with its own model info). Also, for the TKF with substitutions we simply do
self.indel_cost.cost() + self.subst_cost.cost()So both the indel cost and subst cost have their own model info
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment:
/// This struct holds intermediate values for the computation of the log likelihood
/// of an ancestral alignment and tree under a TKF indel model, i.e., without substitutions.
/// The intermediate values are needed for re-alignment.
- renamed vars - added comments - simplified caching (its just all precomputed now) - fns no either get or set stuff, not both at the same time (see functional programming)
junniest
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some tiny nitpicks, but this looks great now! Well done! It is really well-decoupled now and it's a lot easier (at least for me) to understand what's going on.
| if let Some(anc_map) = self.ancestral_maps.get_mut(node_idx) { | ||
| *anc_map = map; | ||
| } else { | ||
| panic!("NodeIdx {node_idx} is not an internal node"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering -- should this really panic? Or could you print a warning/error instead.
The reason why I'm asking is whether this would be something that breaks the reconstruction or could it just be safely ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like panic is fitting since if this method is called, there is something wrong with the logic of the calling code. When people use update_ancestral_map and then have code where a leaf is passed, sth is flawed with their logic and its nice to point that out directly as one might miss the warning in the log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair, I agree with this explanation!
- split template indel subst param test template into two - changed validate_r - bitset for previous ecent deletion
Implementation of the TKF92 cost for given tree and Multiple Ancestral Sequence Alignment (MASA), ie an alignment that includes ancestral wildcard sequences.
Closes #58. Also implements TKF91 (since this is a special case of TKF92 when the parameter r = 0)
The functionality of fixing/re-estimating ancestral sequences after a tree move (see #115) will be addressed in another PR, which will include the impl of
TreeSearchCost