Skip to content

Commit 1f60c73

Browse files
authored
Merge pull request #266 from hyanwong/fill-out-topology
Fill out topology counting tutorial
2 parents d1c9b21 + 0a26a26 commit 1f60c73

File tree

4 files changed

+343
-8
lines changed

4 files changed

+343
-8
lines changed

_toc.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ parts:
1212
chapters:
1313
- file: analysing_tree_sequences
1414
- file: analysing_trees
15+
- file: incremental_algorithms
1516
- file: counting_topologies
1617
- file: parallelization
1718
- caption: Further tskit tutorials

counting_topologies.md

Lines changed: 303 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -16,13 +16,309 @@ kernelspec:
1616

1717
(sec_counting_topologies)=
1818

19-
# _Counting topologies_
20-
% remove underscores in title when tutorial is complete or near-complete
19+
# Counting topologies
2120

21+
**Yan Wong**
2222

23-
:::{todo}
24-
Add examples of using {meth}`TreeSequence.count_topologies` and introduce {ref}`sec_combinatorics`
25-
in a gentle way.
23+
This tutorial is intended to be a gentle introduction to the combinatorial
24+
treatment of tree topologies in `tskit`. For a more formal introduction,
25+
see the {ref}`sec_combinatorics` section of the
26+
[official `tskit` documentation](tskit:sec_introduction).
2627

27-
See https://github.com/tskit-dev/tutorials/issues/93
28-
:::
28+
The *topology* of a single tree is the term used to describe the branching pattern,
29+
regardless of the lengths of the branches. For example, both trees below have the
30+
same topology, although the branch lengths differ:
31+
32+
```{code-cell}
33+
import tskit
34+
node_labels = {0: "a", 1: "b", 2: "c"} # avoid confusion by using letters to label tips
35+
tree = tskit.Tree.generate_comb(3)
36+
display(tree.draw_svg(node_labels=node_labels, y_axis=True))
37+
38+
deep_tree = tskit.Tree.generate_comb(10).tree_sequence.simplify([0, 1, 2]).first()
39+
display(deep_tree.draw_svg(node_labels=node_labels, y_axis=True))
40+
```
41+
42+
:::{note}
43+
The treatment of topologies in `tskit` is restricted to trees with a single defined root,
44+
without nodes with a single child (i.e. trees must consist of nodes that are either leaves,
45+
or internal nodes with two or more children). For convenience in the examples
46+
below, trees are drawn with the tips flagged as samples, although whether a node is a sample or
47+
not does not change the topology of the tree.
48+
:::
49+
50+
## Tree labellings and shapes
51+
52+
The topology of a tree also takes into account the labelling of tips, so that
53+
the trees below, although they have the same *shape*, count as three
54+
different topologies:
55+
56+
```{code-cell}
57+
:tags: [hide-input]
58+
from string import ascii_lowercase
59+
from IPython.display import SVG
60+
61+
def str_none(s, prefix=None):
62+
if s is not None:
63+
if prefix is None:
64+
return str(s)
65+
else:
66+
return prefix + " = " + str(s)
67+
return None
68+
69+
def draw_svg_trees(trees, node_labels={}, x_lab_attr=None, width=100, height=150, space=10):
70+
w = width + space
71+
h = height + space
72+
trees = list(trees)
73+
s = f'<svg height="{h}" width="{w * len(trees)}" xmlns="http://www.w3.org/2000/svg">'
74+
s += f'<style>.x-axis {{transform: translateY({space}px)}}</style>'
75+
for i, tree in enumerate(trees):
76+
s += tree.draw_svg(
77+
size=(width, height),
78+
canvas_size=(w, h),
79+
root_svg_attributes={"x": i * w},
80+
node_labels=node_labels,
81+
x_label=str_none(getattr(tree.rank(), x_lab_attr or "", None), x_lab_attr)
82+
)
83+
s += '</svg>'
84+
return SVG(s)
85+
86+
draw_svg_trees(tskit.all_tree_labellings(tree), node_labels={u: ascii_lowercase[u] for u in tree.samples()})
87+
```
88+
89+
These are, in fact, the only possible three labellings for a three-tip tree of that shape.
90+
There is only one other possible shape for a three-tip tree, and for this shape,
91+
all labelling orders are equivalent (in other words, there is only one
92+
possible labelling):
93+
94+
```{code-cell}
95+
:tags: [hide-input]
96+
tskit.Tree.generate_star(3).draw_svg(node_labels={})
97+
```
98+
99+
A 3-tip tree therefore has only four possible topologies.
100+
These can be generated with the {func}`~tskit.all_trees` function.
101+
102+
```{code-cell}
103+
generated_trees = tskit.all_trees(3)
104+
print("For a three-tip tree there are", len(list(generated_trees)), "labelled topologies.")
105+
```
106+
107+
Here they are, plotted out with their shapes enumerated from zero:
108+
109+
```{code-cell}
110+
:tags: [hide-input]
111+
draw_svg_trees(
112+
tskit.all_trees(3),
113+
node_labels={u: ascii_lowercase[u] for u in tree.samples()},
114+
x_lab_attr="shape"
115+
)
116+
```
117+
118+
### Enumerating shapes and labellings
119+
120+
For a tree with four tips, more topologies and shapes are possible. As before, we can generate the
121+
topologies using {func}`~tskit.all_trees`. Alternatively, if we only want the (unlabelled) shapes,
122+
we can use the {func}`~tskit.all_tree_shapes` function:
123+
124+
```{code-cell}
125+
print("For a four-tip tree there are", len(list(tskit.all_trees(4))), "labelled topologies.")
126+
127+
generated_trees = tskit.all_tree_shapes(4)
128+
print("These can be categorised into", len(list(generated_trees)), "shapes.")
129+
```
130+
131+
Again, we can give each shape a number or *index*, starting from zero:
132+
133+
```{code-cell}
134+
:tags: [hide-input]
135+
draw_svg_trees(tskit.all_tree_shapes(4), x_lab_attr="shape")
136+
```
137+
138+
Each of these shapes will have a separate number of possible labellings, and trees with
139+
these labellings can be created using {func}`~tskit.all_tree_labellings`:
140+
141+
```{code-cell}
142+
for shape_index, tree in enumerate(tskit.all_tree_shapes(4)):
143+
labellings = tskit.all_tree_labellings(tree)
144+
num_labellings = len(list(labellings))
145+
print(
146+
f"Tree shape {shape_index} for a four-tip tree has "
147+
f"{num_labellings} labelling{'' if num_labellings==1 else 's'}."
148+
)
149+
```
150+
151+
Any tree topology for a tree of $N$ tips can therefore be described by a
152+
shape index combined with a labelling index. This is known as the
153+
*rank* of a tree, and it can be obtained using the
154+
{meth}`Tree.rank` method. For instance, here is the rank of a simulated tree
155+
of 10 tips:
156+
157+
```{code-cell}
158+
:tags: [hide-input]
159+
import msprime
160+
num_tips = 10
161+
simulated_ts = msprime.sim_ancestry(10, ploidy=1, random_seed=123)
162+
simulated_tree = simulated_ts.first()
163+
print("The topology of the simulated tree below can be described as", simulated_tree.rank())
164+
ascii_node_labels = {u: ascii_lowercase[u] for u in simulated_tree.samples()}
165+
simulated_tree.draw_svg(node_labels=ascii_node_labels)
166+
```
167+
168+
169+
A tree with the same topology (i.e. the same shape and labelling, but ignoring
170+
the branch lengths) can be generated using the {meth}`Tree.unrank` method, by
171+
specifying the number of tips and the appropriate `(shape, labelling)` tuple:
172+
173+
```{code-cell}
174+
new_tree = tskit.Tree.unrank(num_tips, (1270, 21580))
175+
new_tree.draw_svg(node_labels=ascii_node_labels)
176+
```
177+
178+
Note that this method generates a single tree in a new tree sequence
179+
whose a default sequence length is 1.0.
180+
181+
## Methods for large trees
182+
183+
The number of possible topologies for a tree with $N$ tips
184+
grows very rapidly with $N$. For instance, with 10 tips, there are
185+
282,137,824 possible topologies.
186+
187+
For this reason, the {func}`~tskit.all_trees`, {func}`~tskit.all_tree_shapes` and
188+
{func}`~tskit.all_tree_labellings` methods do not return a list of trees
189+
but an iterator over the trees. This means it is perfectly possible to start
190+
iterating over (say) all tree shapes for a tree of 100 leaves, but
191+
the iterator will not finish before the death of our galaxy.
192+
193+
```{code-cell}
194+
for num_trees, tree in enumerate(tskit.all_tree_shapes(100)):
195+
shape = tree.rank().shape
196+
b2 = tree.b2_index()
197+
print(f"A 100-tip tree with shape index {shape} has a b2 balance index of {b2}")
198+
if num_trees > 5:
199+
break # better not let this run too long!
200+
```
201+
202+
For similar combinatorial reasons, the {meth}`Tree.rank` method can be
203+
inefficient for large trees. To compare the topology of two trees, you are
204+
therefore recommended to use e.g. the {meth}`Tree.kc_distance` method
205+
rather than comparing ranks directly.
206+
207+
```{code-cell}
208+
simulated_tree = simulated_ts.first(sample_lists=True) # kc_distance requires sample lists
209+
if simulated_ts.first(sample_lists=True).kc_distance(simulated_tree) == 0:
210+
print("Trees are identical")
211+
# To compare to the new_tree we need to fix
212+
# print("The simulated and topology-constructed trees have the same topology")
213+
```
214+
215+
Despite the combinatorial explosion associated with topologies of
216+
many-tip trees, it is still possible to efficiently count
217+
the number of *embedded topologies* in a large tree.
218+
219+
### Embedded topologies
220+
221+
An embedded topology is a a topology involving a subset of the tips of a tree.
222+
If the tips are classified into (say) three groups, red, green, and blue,
223+
we can efficiently count all the embedded three-tip trees which have
224+
one tip from each group using the {meth}`Tree.count_topologies` method.
225+
226+
```{code-cell}
227+
:tags: [hide-input]
228+
229+
newick_species_tree = "((A:100.0,B:100.0):100.0,C:200.0)"
230+
demography = msprime.Demography.from_species_tree(newick_species_tree, initial_size=100)
231+
ts = msprime.sim_ancestry({0: 2, 1: 2, 2: 2}, demography=demography, random_seed=321)
232+
big_tree = ts.first()
233+
styles = [
234+
f".node.sample.p{p.id} > .sym " + "{" + f"fill: {colour}" + "}"
235+
for colour, p in zip(['red', 'green', 'blue'], big_tree.tree_sequence.populations())
236+
]
237+
big_tree.draw_svg(style="".join(styles), node_labels={}, time_scale="rank", x_label="big_tree")
238+
```
239+
240+
In this tree, it is fairly clear that the red and green tips cluster together (although not exclusively),
241+
so that if we took one red, one blue, and one green tip at random, we nearly always see the
242+
same embedded topology. The {meth}`Tree.count_topologies` method does this exhaustively:
243+
244+
```{code-cell}
245+
# By default `count_topologies` chooses one tip from each population, like setting
246+
# sample_sets=[ts.samples(p.id) for p in ts.populations() if len(ts.samples(p.id)) > 0]
247+
248+
topology_counter = big_tree.count_topologies()
249+
250+
colours = ['red', 'green', 'blue']
251+
styles = [f".n{u}>.sym {{fill: {c} }}" for u, c in enumerate(colours)]
252+
253+
embedded_counts = topology_counter[0, 1, 2]
254+
for embedded_tree in tskit.all_trees(3):
255+
rank = embedded_tree.rank()
256+
number_of_instances = embedded_counts[rank]
257+
label = f"{number_of_instances} instances embedded in big_tree"
258+
display(embedded_tree.draw_svg(style="".join(styles), node_labels={}, x_label=label))
259+
```
260+
261+
## Methods over tree sequences
262+
263+
It can be useful to count embedded topologies over an entire tree sequence.
264+
For instance, we might want to know the number of embedded topologies
265+
that support Neanderthals as a sister group to europeans versus africans.
266+
`Tskit` provides the efficient {meth}`TreeSequence.count_topologies` method to
267+
do this [incrementally](sec_incremental), without having to re-count the topologies
268+
independently in each tree.
269+
270+
```{code-cell}
271+
import stdpopsim
272+
species = stdpopsim.get_species("HomSap")
273+
model = species.get_demographic_model("OutOfAfrica_3G09")
274+
contig = species.get_contig("chr1", length_multiplier=0.0002, mutation_rate=model.mutation_rate)
275+
samples = {"YRI": 1000, "CEU": 1000, "CHB": 1000}
276+
engine = stdpopsim.get_engine("msprime")
277+
ts = engine.simulate(model, contig, samples)
278+
print("Simulated", ts.num_trees, "African+European+Chinese trees, each with", ts.num_samples, "tips")
279+
```
280+
281+
Although the trees in this tree sequence are very large, counting the embedded topologies is
282+
quite doable (for speed we are only simulating 0.02% of chromosome 1, but calculating the
283+
average over an entire chromosome simply takes a little longer)
284+
285+
```{code-cell}
286+
from datetime import datetime
287+
names = {"YRI": "African", "CEU": "European", "CHB": "Chinese"}
288+
colours = {"YRI": "yellow", "CEU": "green", "CHB": "blue"}
289+
290+
population_map = {p.metadata["id"]: p.id for p in ts.populations()}
291+
sample_populations = [population_map[name] for name in samples.keys()]
292+
topology_span = {tree.rank(): 0 for tree in tskit.all_trees(len(sample_populations))}
293+
294+
start = datetime.now()
295+
total = 0
296+
for topology_counter, tree in zip(ts.count_topologies(), ts.trees()):
297+
embedded_topologies = topology_counter[sample_populations]
298+
weight = tree.span / ts.sequence_length
299+
for rank, count in embedded_topologies.items():
300+
topology_span[rank] += count * weight
301+
total += count
302+
print(f"Counted {total} embedded topologies in {datetime.now() - start} seconds")
303+
```
304+
305+
```{code-cell}
306+
:tags: [hide-input]
307+
ntips = len(sample_populations)
308+
styles = ".sample text.lab {baseline-shift: super; font-size: 0.7em;}"
309+
node_labels = {}
310+
311+
for p in range(ntips):
312+
name = ts.population(sample_populations[p]).metadata["id"]
313+
node_labels[p] = names[name]
314+
styles += f".n{p}>.sym {{fill: {colours[name]} }}"
315+
316+
total = sum(topology_span.values())
317+
for rank, weight in topology_span.items():
318+
label = f"{weight/total *100:.1f}% of genome"
319+
embedded_tree = tskit.Tree.unrank(ntips, rank)
320+
display(embedded_tree.draw_svg(size=(160, 150), style="".join(styles), node_labels=node_labels, x_label=label))
321+
```
322+
323+
For an example with real data, see {ref}`sec_popgen_topological`
324+
in the {ref}`sec_intro_popgen` tutorial.

incremental_algorithms.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
---
2+
jupytext:
3+
text_representation:
4+
extension: .md
5+
format_name: myst
6+
format_version: 0.12
7+
jupytext_version: 1.9.1
8+
kernelspec:
9+
display_name: Python 3
10+
language: python
11+
name: python3
12+
---
13+
14+
```{currentmodule} tskit
15+
```
16+
17+
(sec_incremental)=
18+
# _Incremental algorithms_
19+
20+
Much of the [efficiency](sec_what_is_analysis)
21+
of the tskit approach comes from the use of incremental algorithms.
22+
By considering only the difference between adjacent trees,
23+
incremental algorithms avoid having to perform the same
24+
calculation multiple times on different trees.
25+
26+
This tutorial will explain the philosophy behind incremental algorithms,
27+
and provide examples of how to create your own (e.g. using the
28+
{meth}`TreeSequence.edge_diffs` method).
29+
30+
:::{todo}
31+
Create content. See [issue 233](https://github.com/tskit-dev/tutorials/issues/233)
32+
:::

popgen.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -316,8 +316,11 @@ could also have been used to produce the plot above. In this case, the advantage
316316
using tree sequences is simply that they allow these sorts of analysis to
317317
{ref}`scale<plot_incremental_calculation>` to datasets of millions of whole genomes.
318318

319+
(sec_popgen_topological)=
319320

320-
However, because this inferred tree sequence stores (an estimate of) the underlying
321+
### Topological analysis
322+
323+
As this inferred tree sequence stores (an estimate of) the underlying
321324
genealogy, we can also derive statistics based on genealogical relationships. For
322325
example, this tree sequence also contains a sample genome based on an ancient
323326
genome, a [Denisovan](https://en.wikipedia.org/wiki/Denisovan) individual. We can
@@ -329,6 +332,9 @@ Show an example of looking at topological relationships between the Denisovan an
329332
various East Asian groups, using the {ref}`sec_counting_topologies` functionality.
330333
:::
331334

335+
See {ref}`sec_counting_topologies` for an introduction to topological methods in
336+
`tskit`.
337+
332338
## Further information
333339

334340
This brief introduction is meant as a simple taster. Many other efficient population

0 commit comments

Comments
 (0)