-
Notifications
You must be signed in to change notification settings - Fork 2
Stochastic Tree Builder Class #60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
113 commits
Select commit
Hold shift + click to select a range
8e1ad87
Tree builder trait unimplemented
kalxed 2dc20ec
NJBuilder class, no trait, public associated function implementation
kalxed fb70192
Mattes changes for deterministic tests passing
kalxed 077a47e
Format Rust code using rustfmt
github-actions[bot] 1573c0d
Added tests privately distance function still being debugged
kalxed f3026db
Merge branch 'feature/stochastic-tree-builder' of https://github.com/…
kalxed b08a0ad
Format Rust code using rustfmt
github-actions[bot] 9acd004
Merge branch 'develop' of https://github.com/acg-team/rust-phylo into…
kalxed 01f7a66
Custom distance function that aligns with bio impl, all but one test …
kalxed 758f8df
Format Rust code using rustfmt
github-actions[bot] 1f9170f
Options for both new() NJBuilder parameters, to work on adding stocha…
kalxed 4b79e95
Merge branch 'feature/stochastic-tree-builder' of https://github.com/…
kalxed 4d31305
Format Rust code using rustfmt
github-actions[bot] f631363
Used Into to circumvent calling Some() when constructing
kalxed 05bac27
Temperature not implemented, default for NJBuilder and PhyloInfo Tree…
kalxed d0bded4
Format Rust code using rustfmt
github-actions[bot] abaf2a9
Stochastic aspect implemented, temperature implemented, tests to be a…
kalxed 825500a
Merge branch 'develop' of https://github.com/acg-team/rust-phylo into…
kalxed 4a88ccd
Add tree from the original paper to check NJ
junniest 000b849
Reworked NJBuilder again with enum for temperature, and made it gener…
kalxed 4422ea9
Merge branch 'feature/stochastic-tree-builder' of https://github.com/…
kalxed 7ab93e1
Comment nj_uniform test before gloabl random implementation
kalxed 6c2ca69
Fixed nj_tree_original_paper test
kalxed fe5a4a9
Naming for NJTreeBuilder and trait funciton changed, pub new(), comme…
kalxed ee394fa
Compute delta tree length reworked, temperature_branch working theore…
kalxed 0395f26
Format Rust code using rustfmt
github-actions[bot] 0099b13
Merge branch 'develop' of https://github.com/acg-team/rust-phylo into…
kalxed 157fe16
Updated formatting with rest of comments, sample to be updated, and a…
kalxed 618e8c8
Merge branch 'develop' of https://github.com/acg-team/rust-phylo into…
kalxed 6739992
Softmax, Uniform, and build() tests to be added, also splitting curre…
kalxed e841a5d
Rename build_from_distances_w_rng
junniest 2ef4ef4
Update comment and logs in softmax strategy
junniest acfb986
Remove coverage from tests and unnecessary comment
junniest a973ba4
Remove unnecessary argmin impl
junniest 82a381a
Refactor softmax impl + avoid copying
junniest c69e931
Elide a lifetime to appease clippy
junniest 6d67888
Add basic tests for softmax_from_distances
junniest 784f69e
Move sampling from softmax to main build func
junniest 150c7f5
Rename private tests -> tests
junniest 0ce5b2f
Fix uniform weights in softmax
junniest 56bccce
Add tests for softmax with temp
junniest 1bce4bf
Aliases to make tests calls shorter
junniest d7013cc
Add derive Debug Clone PartialEq to Strategy Enum
junniest f389809
softmax_from_distances -> softmax_from_deltas
junniest 9bbf557
Fix temp values, 0.0 is uniform, 1.0 is softmax
junniest 5bcfb99
Check that temp values are set properly
junniest 3e08c53
Move lower_triangle_index out of struct + tests
junniest 2018638
Derive Clone for distance matrices
junniest db97160
Add softmax nj building tests
junniest 88c04c8
Rename Strategy::Deterministic -> ArgMax
junniest ebbe2c0
Add log statement, fix comments
junniest 2d761a3
Merge branch 'develop' into feature/stochastic-tree-builder
junniest 2ffa019
Format Rust code using rustfmt
github-actions[bot] 272b199
Remove unused use statement
junniest 6e7ddf9
Add comment to indicate delta_tree_length is Q
junniest 7361e01
Add doctring to TreeBuilder trait
junniest f4c5adb
Add missing semicolon
junniest 052f979
Fix and clarify docstrings in NJTreeBuilder
junniest 762b875
Tests for random sampling of 0.0/1.0 weights
junniest ffa1ffa
Remove redundant check from test and fix comment
junniest 2e4967d
Fix comment on AA example with DNA distance func
junniest 71c2d14
Clean up LevenshteinDNACorrected distance
junniest 6cc307a
Add LevenshteinProteinCorrected distance
junniest 9f577ba
Add sources for corrected distance formulas
junniest 4a006e0
Add regular Levenshtein edit distance
junniest 522f54e
Add tests for various distance measures
junniest 4eb19dc
Add public use for NJBuilder and TreeBuilder
junniest 98bd3e8
Add doctests and comments to NJTreeBuilder
junniest 9d0498b
Add `build_nj_tree` method to `PhyloInfoBuilder`
junniest 0c208a9
Remove unnecessary use statement
junniest e714d42
Fix comment for `PhyloInfoBuilder::build_nj_tree`
junniest 9f856bc
Add basic test for `build_nj_tree`
junniest 4eedb86
Remove unused `tree_builder` field
junniest 2a0e2ec
Minor fixes for `PhyloInfoBuilder`
junniest 3529f3b
Fix instantiation syntax for LDNACorr distance
junniest da9eb7a
Rename `delta_tree_length` to `compute_delta_...`
junniest f451ac7
Merge branch 'develop' into feature/stochastic-tree-builder
junniest f5f6eb6
Merge branch 'develop' into feature/stochastic-tree-builder
junniest e563d8a
Fix max_iter, precision tests for new start tree
junniest ab1e5e5
Rename q -> delta_lengths
junniest 602181a
Merge branch 'develop' into feature/stochastic-tree-builder
junniest 1689a65
Simplify NJBuilder::softmax
junniest 644866a
Add back argmin randomisation and simplify build
junniest a5263e2
Fix fake values for fake softmax test
junniest ecef9d4
Add comment for lower_triangle_index transform
junniest 66ee34e
Add comment to NJTreeBuilder::argmin
junniest 56dd99b
Simplify optimise_tree macros and add passing rng
junniest 15d2987
Add comment and test for FakeGenerator::shuffle
junniest 864f350
FakeRng to make pip_optimise_model_tree test pass
junniest c862a1c
Modify 2 topo tests to start with phyml tree
junniest b3ffad9
Fix delta_tree_len matrix length formula
junniest 5a1953a
Remove mut self from tree builder interface
junniest 381adbf
Merge branch 'feature/stochastic-tree-builder' of github.com:acg-team…
junniest 4a98a53
Remove unworking test
junniest c3a8882
Simplifying evolutionary distances
junniest d52cc5d
Fix comment to compute_delta_tree_length
junniest db9ea0e
Fix comment in test
junniest 0e34e69
Fix docstring for NJ tie-breaking
junniest 192cf2c
Remove unnecessary mut from doctests
junniest 66fbaf9
Remove rogue pring statement
junniest 54cf5fc
Fix info! statements on NJTreeBuilder::new
junniest 2be8cf9
Replace unwrap() with ? in production code
junniest 9e98cb8
Fix indentation in macro
junniest 211d4aa
Refactor starting tree creation into a method
junniest 993631c
Merge softmax examples into one test
junniest ae10615
Add test for softmax temperature interpolation
junniest 1563be3
Add #[derive(PartialEq)] to Tree
junniest 0a2fbb9
Fix `nj_builder_fake_softmax` test
junniest a86b7d8
Fix comment for argmax option in NJBuilder
junniest 4d45b05
Merge branch 'feature/stochastic-tree-builder' of github.com:acg-team…
junniest 1c5646f
Fix comment to clarify FakeRng::shuffle
junniest 194af51
Remove unnecessary clone() from NJBuilder test
junniest 5fbb6ba
Fix assertions in NJTreeBuilder::softmax test
junniest File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| (((((((((PMarmoset:0.02931142,Tamarin:0.03977325):0.02582298,Squirrel:0.09657681):0.00554173,(Saki:0.03922165,Titi:0.03966531):0.02219892):0.00679583,(Howler:0.08093104,(Spider:0.03837468,Woolly:0.04073718):0.02217513):0.01497244):0.20396660,(((Chimp:0.01029413,Human:0.01228444):0.00350344,Gorilla:0.00833861):0.01134213,(Gibbon:0.06727730,Orangutan:0.03314814):0.00149069):0.03088307):0.04724073,(Colobus:0.00855577,DLangur:0.00980229):0.00782031):0.00711735,Patas:0.03048588):0.00608031,(AGM_cDNA:0.00848307,Tant_cDNA:0.00717932):0.01237098):0.00295050,Baboon:0.00807196,Rhes_cDNA:0.01226490); |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,224 @@ | ||
| use bio::{alignment::distance::levenshtein, io::fasta::Record}; | ||
| use nalgebra::max; | ||
|
|
||
| pub trait EvolutionaryDistance { | ||
| fn dist(&self, a: &Record, b: &Record) -> f64; | ||
| } | ||
|
|
||
| #[derive(Clone, Copy, Debug, Default, Eq, Hash, PartialEq, PartialOrd, Ord)] | ||
| /// Levenshtein distance with Jukes-Cantor correction, meaningful for DNA sequences | ||
| pub struct LevenshteinDNACorrected; | ||
|
|
||
| impl EvolutionaryDistance for LevenshteinDNACorrected { | ||
| fn dist(&self, a: &Record, b: &Record) -> f64 { | ||
| // Distance formula corrected using the Jukes-Cantor model | ||
| // Formula 1.6 from "Computational Molecular Evolution" by Ziheng Yang (2006) | ||
| // d = -3/4 * ln(1 - (4/3) * p) | ||
| // where p is the proportion of differing sites (here, Levenshtein distance / max length). | ||
| // To avoid infinite distance when all characters are different, the maximum | ||
| // proportion of different characters is capped to 3/4=0.75. | ||
| let seq_i = a.seq(); | ||
| let seq_j = b.seq(); | ||
| let dist = levenshtein(seq_i, seq_j) as f64; | ||
| let p = f64::min( | ||
| dist / (max(seq_i.len(), seq_j.len()) as f64), | ||
| 3.0 / 4.0 - f64::EPSILON, | ||
| ); | ||
| -(3.0 / 4.0) * (1.0 - (4.0 / 3.0) * p).ln() | ||
| } | ||
| } | ||
|
|
||
| #[derive(Clone, Copy, Debug, Default, Eq, Hash, PartialEq, PartialOrd, Ord)] | ||
| /// Levenshtein distance with Jukes-Cantor-like correction, meaningful for protein sequences | ||
| pub struct LevenshteinProteinCorrected; | ||
|
|
||
| impl EvolutionaryDistance for LevenshteinProteinCorrected { | ||
| fn dist(&self, a: &Record, b: &Record) -> f64 { | ||
| // Corrected protein distance formula equivalent to Jukes-Cantor for proteins. | ||
| // Formula 2.3 from "Computational Molecular Evolution" by Ziheng Yang (2006) | ||
| // d = -19/20 * ln(1 - (20/19) * p) | ||
| // where p is the proportion of differing sites (here, Levenshtein distance / max length). | ||
| // To avoid infinite distance when all characters are different, the maximum | ||
| // proportion of different characters is capped to 19/20=0.95. | ||
|
|
||
| let seq_i = a.seq(); | ||
| let seq_j = b.seq(); | ||
| let dist = levenshtein(seq_i, seq_j) as f64; | ||
| let p = f64::min( | ||
| dist / (max(seq_i.len(), seq_j.len()) as f64), | ||
| 19.0 / 20.0 - f64::EPSILON, | ||
| ); | ||
| -(19.0 / 20.0) * (1.0 - (20.0 / 19.0) * p).ln() | ||
| } | ||
| } | ||
|
|
||
| #[derive(Clone, Copy, Debug, Default, Eq, Hash, PartialEq, PartialOrd, Ord)] | ||
| /// Simple Levenshtein distance without any correction, meaningful for both DNA | ||
| /// and protein sequences | ||
| pub struct Levenshtein; | ||
|
|
||
| impl EvolutionaryDistance for Levenshtein { | ||
| fn dist(&self, a: &Record, b: &Record) -> f64 { | ||
| let seq_i = a.seq(); | ||
| let seq_j = b.seq(); | ||
| levenshtein(seq_i, seq_j) as f64 | ||
| } | ||
| } | ||
|
|
||
| #[cfg(test)] | ||
| #[cfg_attr(coverage, coverage(off))] | ||
| mod tests { | ||
| use super::*; | ||
|
|
||
| use crate::record_wo_desc as record; | ||
|
|
||
| #[test] | ||
| fn levenshtein_edit_dist() { | ||
| let dna_corr = Levenshtein {}; | ||
|
|
||
| let s1 = record!("s1", b"AAAAAAAA"); | ||
| let s2 = record!("s2", b"AAAAAAAA"); | ||
| assert_eq!(dna_corr.dist(&s1, &s2), 0.0); | ||
| assert_eq!(dna_corr.dist(&s1, &s1), 0.0); | ||
|
|
||
| let s1 = record!("s1", b"AAAAAAAA"); | ||
| let s2 = record!("s2", b"TTTTTTTT"); | ||
| assert_eq!(dna_corr.dist(&s1, &s2), s1.seq().len() as f64); | ||
|
|
||
| let s1 = record!("s1", b"AAAAAAAAAAAAAAAA"); | ||
| let s2 = record!("s2", b"AAAAA"); | ||
| assert_eq!( | ||
| dna_corr.dist(&s1, &s2), | ||
| (s1.seq().len() - s2.seq().len()) as f64 | ||
| ); | ||
|
|
||
| let s1 = record!("s1", b"AAAAATAACAAAGTAAA"); | ||
| let s2 = record!("s2", b"AAAAATCAGT"); | ||
| assert_eq!( | ||
| dna_corr.dist(&s1, &s2), | ||
| (s1.seq().len() - s2.seq().len()) as f64 | ||
| ); | ||
|
|
||
| let s1 = record!("s1", b"AAAAAAAAAAAAAAAA"); | ||
| let s2 = record!("s2", b"TTTTT"); | ||
| assert_eq!(dna_corr.dist(&s1, &s2), s1.seq().len() as f64); | ||
| } | ||
|
|
||
| #[test] | ||
| fn levenshtein_dna_corrected_corners() { | ||
| let dna_corr = LevenshteinDNACorrected {}; | ||
|
|
||
| let s1 = record!("s1", b"AAAAAAAA"); | ||
| let s2 = record!("s2", b"AAAAAAAA"); | ||
| assert_eq!(dna_corr.dist(&s1, &s2), 0.0); | ||
| assert_eq!(dna_corr.dist(&s1, &s1), 0.0); | ||
|
|
||
| let max_proportion = 3.0 / 4.0; | ||
| let max_distance = | ||
| -max_proportion * (1.0 - 1.0 / max_proportion * (max_proportion - f64::EPSILON)).ln(); | ||
| // Max distance possible between sequences (all chars are different), computed with Python | ||
| assert_eq!(max_distance, 26.728641210756745); | ||
|
|
||
| let s1 = record!("s1", b"AAAAAAAA"); | ||
| let s2 = record!("s2", b"TTTTTTTT"); | ||
| assert_eq!(dna_corr.dist(&s1, &s2), max_distance); | ||
|
|
||
| let s1 = record!("s1", b"AAAAAAAA"); | ||
| let s2 = record!("s2", b""); | ||
| assert_eq!(dna_corr.dist(&s1, &s2), max_distance); | ||
| } | ||
|
|
||
| #[test] | ||
| fn levenshtein_dna_corrected() { | ||
| let dna_corr = LevenshteinDNACorrected {}; | ||
|
|
||
| let s1 = record!("s1", b"AAAAAAAA"); | ||
| let s2 = record!("s2", b"AAAAAAAT"); | ||
| // Levenshtein distance is 1, proportion is 1/8=0.125, distance computed with Python | ||
| assert_eq!(dna_corr.dist(&s1, &s2), 0.13674116759546595); | ||
|
|
||
| let s1 = record!("s1", b"AAAAATAACAAAGTAAA"); | ||
| let s2 = record!("s2", b"AAAAATCAGT"); | ||
| // Levenshtein distance is 7, proportion is 7/17=0.4117647058823529, distance computed with Python | ||
| assert_eq!(dna_corr.dist(&s1, &s2), 0.5972485625963819); | ||
| } | ||
|
|
||
| #[test] | ||
| fn levenshtein_dna_corrected_sanity() { | ||
| let dna_corr = LevenshteinDNACorrected {}; | ||
|
|
||
| let s1 = record!("seq1", b"ACGTACGTXXXXXX"); | ||
| let s2 = record!("seq2", b"AAGTACGTXXXXXX"); | ||
| let s3 = record!("seq3", b"AAGTTCGTXXXXXX"); | ||
| let s4 = record!("seq4", b"TTTTTTTTXXXXXX"); | ||
|
|
||
| assert!(dna_corr.dist(&s1, &s2) < dna_corr.dist(&s1, &s3)); | ||
| assert!(dna_corr.dist(&s1, &s2) < dna_corr.dist(&s1, &s4)); | ||
| assert!(dna_corr.dist(&s1, &s3) < dna_corr.dist(&s1, &s4)); | ||
| assert!(dna_corr.dist(&s2, &s3) < dna_corr.dist(&s2, &s4)); | ||
| assert_eq!(dna_corr.dist(&s1, &s4), dna_corr.dist(&s2, &s4)); | ||
| assert!(dna_corr.dist(&s3, &s4) < dna_corr.dist(&s2, &s4)); | ||
|
|
||
| let s5 = record!("seq5", b"GGGGGGGGGGGGGG"); | ||
| assert_eq!(dna_corr.dist(&s1, &s5), dna_corr.dist(&s2, &s5)); | ||
| assert_eq!(dna_corr.dist(&s1, &s5), dna_corr.dist(&s3, &s5)); | ||
| assert_eq!(dna_corr.dist(&s1, &s5), dna_corr.dist(&s4, &s5)); | ||
| } | ||
|
|
||
| #[test] | ||
| fn levenshtein_protein_corrected_corners() { | ||
| let prot_corr = LevenshteinProteinCorrected {}; | ||
|
|
||
| let s1 = record!("s1", b"PPPPPPPP"); | ||
| let s2 = record!("s2", b"PPPPPPPP"); | ||
| assert_eq!(prot_corr.dist(&s1, &s2), 0.0); | ||
| assert_eq!(prot_corr.dist(&s1, &s1), 0.0); | ||
|
|
||
| let max_proportion = 19.0 / 20.0; | ||
| let max_distance = | ||
| -max_proportion * (1.0 - 1.0 / max_proportion * (max_proportion - f64::EPSILON)).ln(); | ||
| // Max distance possible between sequences (all chars are different), computed with Python | ||
| assert_eq!(max_distance, 33.85627886695854); | ||
|
|
||
| let s1 = record!("s1", b"PPPPPPPP"); | ||
| let s2 = record!("s2", b"NNNNNNNN"); | ||
| assert_eq!(prot_corr.dist(&s1, &s2), max_distance); | ||
|
|
||
| let s1 = record!("s1", b"PPPPPPPP"); | ||
| let s2 = record!("s2", b""); | ||
| assert_eq!(prot_corr.dist(&s1, &s2), max_distance); | ||
| } | ||
|
|
||
| #[test] | ||
| fn levenshtein_protein_corrected() { | ||
| let prot_corr = LevenshteinProteinCorrected {}; | ||
|
|
||
| let s1 = record!("s1", b"AAAAAAAA"); | ||
| let s2 = record!("s2", b"AAAAAAAT"); | ||
| // Levenshtein distance is 1, proportion is 1/8=0.125, distance computed with Python | ||
| assert_eq!(prot_corr.dist(&s1, &s2), 0.1340246683469102); | ||
|
|
||
| let s1 = record!("s1", b"AAAAATAACAAAGTAAA"); | ||
| let s2 = record!("s2", b"AAAAATCAGT"); | ||
| // Levenshtein distance is 7, proportion is 7/17=0.4117647058823529, distance computed with Python | ||
| assert_eq!(prot_corr.dist(&s1, &s2), 0.5397578618621736); | ||
| } | ||
|
|
||
| #[test] | ||
| fn levenshtein_protein_corrected_sanity() { | ||
| let prot_corr = LevenshteinProteinCorrected {}; | ||
|
|
||
| let s1 = record!("seq1", b"ARNDCQEGHILKMFPSTWYV"); | ||
| let s2 = record!("seq2", b"ARNDCQEGHILKMFPSTWVV"); | ||
| let s3 = record!("seq3", b"ARNDCQEGHILKMFPSRRVV"); | ||
| let s4 = record!("seq4", b"RRRRRRRRRRRRRRRRRRRR"); | ||
|
|
||
| assert!(prot_corr.dist(&s1, &s2) < prot_corr.dist(&s1, &s3)); | ||
| assert!(prot_corr.dist(&s1, &s2) < prot_corr.dist(&s1, &s4)); | ||
| assert!(prot_corr.dist(&s1, &s3) < prot_corr.dist(&s1, &s4)); | ||
| assert!(prot_corr.dist(&s2, &s3) < prot_corr.dist(&s2, &s4)); | ||
| assert!(prot_corr.dist(&s3, &s4) < prot_corr.dist(&s2, &s4)); | ||
| assert_eq!(prot_corr.dist(&s1, &s4), prot_corr.dist(&s2, &s4)); | ||
| assert!(prot_corr.dist(&s1, &s4) > prot_corr.dist(&s3, &s4)); | ||
| } | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.