@@ -16,13 +16,309 @@ kernelspec:
16
16
17
17
(sec_counting_topologies)=
18
18
19
- # _ Counting topologies_
20
- % remove underscores in title when tutorial is complete or near-complete
19
+ # Counting topologies
21
20
21
+ ** Yan Wong**
22
22
23
- :::{todo}
24
- Add examples of using {meth}` TreeSequence.count_topologies ` and introduce {ref}` sec_combinatorics `
25
- in a gentle way.
23
+ This tutorial is intended to be a gentle introduction to the combinatorial
24
+ treatment of tree topologies in ` tskit ` . For a more formal introduction,
25
+ see the {ref}` sec_combinatorics ` section of the
26
+ [ official ` tskit ` documentation] ( tskit:sec_introduction ) .
26
27
27
- See https://github.com/tskit-dev/tutorials/issues/93
28
- :::
28
+ The * topology* of a single tree is the term used to describe the branching pattern,
29
+ regardless of the lengths of the branches. For example, both trees below have the
30
+ same topology, although the branch lengths differ:
31
+
32
+ ``` {code-cell}
33
+ import tskit
34
+ node_labels = {0: "a", 1: "b", 2: "c"} # avoid confusion by using letters to label tips
35
+ tree = tskit.Tree.generate_comb(3)
36
+ display(tree.draw_svg(node_labels=node_labels, y_axis=True))
37
+
38
+ deep_tree = tskit.Tree.generate_comb(10).tree_sequence.simplify([0, 1, 2]).first()
39
+ display(deep_tree.draw_svg(node_labels=node_labels, y_axis=True))
40
+ ```
41
+
42
+ :::{note}
43
+ The treatment of topologies in ` tskit ` is restricted to trees with a single defined root,
44
+ without nodes with a single child (i.e. trees must consist of nodes that are either leaves,
45
+ or internal nodes with two or more children). For convenience in the examples
46
+ below, trees are drawn with the tips flagged as samples, although whether a node is a sample or
47
+ not does not change the topology of the tree.
48
+ :::
49
+
50
+ ## Tree labellings and shapes
51
+
52
+ The topology of a tree also takes into account the labelling of tips, so that
53
+ the trees below, although they have the same * shape* , count as three
54
+ different topologies:
55
+
56
+ ``` {code-cell}
57
+ :tags: [hide-input]
58
+ from string import ascii_lowercase
59
+ from IPython.display import SVG
60
+
61
+ def str_none(s, prefix=None):
62
+ if s is not None:
63
+ if prefix is None:
64
+ return str(s)
65
+ else:
66
+ return prefix + " = " + str(s)
67
+ return None
68
+
69
+ def draw_svg_trees(trees, node_labels={}, x_lab_attr=None, width=100, height=150, space=10):
70
+ w = width + space
71
+ h = height + space
72
+ trees = list(trees)
73
+ s = f'<svg height="{h}" width="{w * len(trees)}" xmlns="http://www.w3.org/2000/svg">'
74
+ s += f'<style>.x-axis {{transform: translateY({space}px)}}</style>'
75
+ for i, tree in enumerate(trees):
76
+ s += tree.draw_svg(
77
+ size=(width, height),
78
+ canvas_size=(w, h),
79
+ root_svg_attributes={"x": i * w},
80
+ node_labels=node_labels,
81
+ x_label=str_none(getattr(tree.rank(), x_lab_attr or "", None), x_lab_attr)
82
+ )
83
+ s += '</svg>'
84
+ return SVG(s)
85
+
86
+ draw_svg_trees(tskit.all_tree_labellings(tree), node_labels={u: ascii_lowercase[u] for u in tree.samples()})
87
+ ```
88
+
89
+ These are, in fact, the only possible three labellings for a three-tip tree of that shape.
90
+ There is only one other possible shape for a three-tip tree, and for this shape,
91
+ all labelling orders are equivalent (in other words, there is only one
92
+ possible labelling):
93
+
94
+ ``` {code-cell}
95
+ :tags: [hide-input]
96
+ tskit.Tree.generate_star(3).draw_svg(node_labels={})
97
+ ```
98
+
99
+ A 3-tip tree therefore has only four possible topologies.
100
+ These can be generated with the {func}` ~tskit.all_trees ` function.
101
+
102
+ ``` {code-cell}
103
+ generated_trees = tskit.all_trees(3)
104
+ print("For a three-tip tree there are", len(list(generated_trees)), "labelled topologies.")
105
+ ```
106
+
107
+ Here they are, plotted out with their shapes enumerated from zero:
108
+
109
+ ``` {code-cell}
110
+ :tags: [hide-input]
111
+ draw_svg_trees(
112
+ tskit.all_trees(3),
113
+ node_labels={u: ascii_lowercase[u] for u in tree.samples()},
114
+ x_lab_attr="shape"
115
+ )
116
+ ```
117
+
118
+ ### Enumerating shapes and labellings
119
+
120
+ For a tree with four tips, more topologies and shapes are possible. As before, we can generate the
121
+ topologies using {func}` ~tskit.all_trees ` . Alternatively, if we only want the (unlabelled) shapes,
122
+ we can use the {func}` ~tskit.all_tree_shapes ` function:
123
+
124
+ ``` {code-cell}
125
+ print("For a four-tip tree there are", len(list(tskit.all_trees(4))), "labelled topologies.")
126
+
127
+ generated_trees = tskit.all_tree_shapes(4)
128
+ print("These can be categorised into", len(list(generated_trees)), "shapes.")
129
+ ```
130
+
131
+ Again, we can give each shape a number or * index* , starting from zero:
132
+
133
+ ``` {code-cell}
134
+ :tags: [hide-input]
135
+ draw_svg_trees(tskit.all_tree_shapes(4), x_lab_attr="shape")
136
+ ```
137
+
138
+ Each of these shapes will have a separate number of possible labellings, and trees with
139
+ these labellings can be created using {func}` ~tskit.all_tree_labellings ` :
140
+
141
+ ``` {code-cell}
142
+ for shape_index, tree in enumerate(tskit.all_tree_shapes(4)):
143
+ labellings = tskit.all_tree_labellings(tree)
144
+ num_labellings = len(list(labellings))
145
+ print(
146
+ f"Tree shape {shape_index} for a four-tip tree has "
147
+ f"{num_labellings} labelling{'' if num_labellings==1 else 's'}."
148
+ )
149
+ ```
150
+
151
+ Any tree topology for a tree of $N$ tips can therefore be described by a
152
+ shape index combined with a labelling index. This is known as the
153
+ * rank* of a tree, and it can be obtained using the
154
+ {meth}` Tree.rank ` method. For instance, here is the rank of a simulated tree
155
+ of 10 tips:
156
+
157
+ ``` {code-cell}
158
+ :tags: [hide-input]
159
+ import msprime
160
+ num_tips = 10
161
+ simulated_ts = msprime.sim_ancestry(10, ploidy=1, random_seed=123)
162
+ simulated_tree = simulated_ts.first()
163
+ print("The topology of the simulated tree below can be described as", simulated_tree.rank())
164
+ ascii_node_labels = {u: ascii_lowercase[u] for u in simulated_tree.samples()}
165
+ simulated_tree.draw_svg(node_labels=ascii_node_labels)
166
+ ```
167
+
168
+
169
+ A tree with the same topology (i.e. the same shape and labelling, but ignoring
170
+ the branch lengths) can be generated using the {meth}` Tree.unrank ` method, by
171
+ specifying the number of tips and the appropriate ` (shape, labelling) ` tuple:
172
+
173
+ ``` {code-cell}
174
+ new_tree = tskit.Tree.unrank(num_tips, (1270, 21580))
175
+ new_tree.draw_svg(node_labels=ascii_node_labels)
176
+ ```
177
+
178
+ Note that this method generates a single tree in a new tree sequence
179
+ whose a default sequence length is 1.0.
180
+
181
+ ## Methods for large trees
182
+
183
+ The number of possible topologies for a tree with $N$ tips
184
+ grows very rapidly with $N$. For instance, with 10 tips, there are
185
+ 282,137,824 possible topologies.
186
+
187
+ For this reason, the {func}` ~tskit.all_trees ` , {func}` ~tskit.all_tree_shapes ` and
188
+ {func}` ~tskit.all_tree_labellings ` methods do not return a list of trees
189
+ but an iterator over the trees. This means it is perfectly possible to start
190
+ iterating over (say) all tree shapes for a tree of 100 leaves, but
191
+ the iterator will not finish before the death of our galaxy.
192
+
193
+ ``` {code-cell}
194
+ for num_trees, tree in enumerate(tskit.all_tree_shapes(100)):
195
+ shape = tree.rank().shape
196
+ b2 = tree.b2_index()
197
+ print(f"A 100-tip tree with shape index {shape} has a b2 balance index of {b2}")
198
+ if num_trees > 5:
199
+ break # better not let this run too long!
200
+ ```
201
+
202
+ For similar combinatorial reasons, the {meth}` Tree.rank ` method can be
203
+ inefficient for large trees. To compare the topology of two trees, you are
204
+ therefore recommended to use e.g. the {meth}` Tree.kc_distance ` method
205
+ rather than comparing ranks directly.
206
+
207
+ ``` {code-cell}
208
+ simulated_tree = simulated_ts.first(sample_lists=True) # kc_distance requires sample lists
209
+ if simulated_ts.first(sample_lists=True).kc_distance(simulated_tree) == 0:
210
+ print("Trees are identical")
211
+ # To compare to the new_tree we need to fix
212
+ # print("The simulated and topology-constructed trees have the same topology")
213
+ ```
214
+
215
+ Despite the combinatorial explosion associated with topologies of
216
+ many-tip trees, it is still possible to efficiently count
217
+ the number of * embedded topologies* in a large tree.
218
+
219
+ ### Embedded topologies
220
+
221
+ An embedded topology is a a topology involving a subset of the tips of a tree.
222
+ If the tips are classified into (say) three groups, red, green, and blue,
223
+ we can efficiently count all the embedded three-tip trees which have
224
+ one tip from each group using the {meth}` Tree.count_topologies ` method.
225
+
226
+ ``` {code-cell}
227
+ :tags: [hide-input]
228
+
229
+ newick_species_tree = "((A:100.0,B:100.0):100.0,C:200.0)"
230
+ demography = msprime.Demography.from_species_tree(newick_species_tree, initial_size=100)
231
+ ts = msprime.sim_ancestry({0: 2, 1: 2, 2: 2}, demography=demography, random_seed=321)
232
+ big_tree = ts.first()
233
+ styles = [
234
+ f".node.sample.p{p.id} > .sym " + "{" + f"fill: {colour}" + "}"
235
+ for colour, p in zip(['red', 'green', 'blue'], big_tree.tree_sequence.populations())
236
+ ]
237
+ big_tree.draw_svg(style="".join(styles), node_labels={}, time_scale="rank", x_label="big_tree")
238
+ ```
239
+
240
+ In this tree, it is fairly clear that the red and green tips cluster together (although not exclusively),
241
+ so that if we took one red, one blue, and one green tip at random, we nearly always see the
242
+ same embedded topology. The {meth}` Tree.count_topologies ` method does this exhaustively:
243
+
244
+ ``` {code-cell}
245
+ # By default `count_topologies` chooses one tip from each population, like setting
246
+ # sample_sets=[ts.samples(p.id) for p in ts.populations() if len(ts.samples(p.id)) > 0]
247
+
248
+ topology_counter = big_tree.count_topologies()
249
+
250
+ colours = ['red', 'green', 'blue']
251
+ styles = [f".n{u}>.sym {{fill: {c} }}" for u, c in enumerate(colours)]
252
+
253
+ embedded_counts = topology_counter[0, 1, 2]
254
+ for embedded_tree in tskit.all_trees(3):
255
+ rank = embedded_tree.rank()
256
+ number_of_instances = embedded_counts[rank]
257
+ label = f"{number_of_instances} instances embedded in big_tree"
258
+ display(embedded_tree.draw_svg(style="".join(styles), node_labels={}, x_label=label))
259
+ ```
260
+
261
+ ## Methods over tree sequences
262
+
263
+ It can be useful to count embedded topologies over an entire tree sequence.
264
+ For instance, we might want to know the number of embedded topologies
265
+ that support Neanderthals as a sister group to europeans versus africans.
266
+ ` Tskit ` provides the efficient {meth}` TreeSequence.count_topologies ` method to
267
+ do this [ incrementally] ( sec_incremental ) , without having to re-count the topologies
268
+ independently in each tree.
269
+
270
+ ``` {code-cell}
271
+ import stdpopsim
272
+ species = stdpopsim.get_species("HomSap")
273
+ model = species.get_demographic_model("OutOfAfrica_3G09")
274
+ contig = species.get_contig("chr1", length_multiplier=0.0002, mutation_rate=model.mutation_rate)
275
+ samples = {"YRI": 1000, "CEU": 1000, "CHB": 1000}
276
+ engine = stdpopsim.get_engine("msprime")
277
+ ts = engine.simulate(model, contig, samples)
278
+ print("Simulated", ts.num_trees, "African+European+Chinese trees, each with", ts.num_samples, "tips")
279
+ ```
280
+
281
+ Although the trees in this tree sequence are very large, counting the embedded topologies is
282
+ quite doable (for speed we are only simulating 0.02% of chromosome 1, but calculating the
283
+ average over an entire chromosome simply takes a little longer)
284
+
285
+ ``` {code-cell}
286
+ from datetime import datetime
287
+ names = {"YRI": "African", "CEU": "European", "CHB": "Chinese"}
288
+ colours = {"YRI": "yellow", "CEU": "green", "CHB": "blue"}
289
+
290
+ population_map = {p.metadata["id"]: p.id for p in ts.populations()}
291
+ sample_populations = [population_map[name] for name in samples.keys()]
292
+ topology_span = {tree.rank(): 0 for tree in tskit.all_trees(len(sample_populations))}
293
+
294
+ start = datetime.now()
295
+ total = 0
296
+ for topology_counter, tree in zip(ts.count_topologies(), ts.trees()):
297
+ embedded_topologies = topology_counter[sample_populations]
298
+ weight = tree.span / ts.sequence_length
299
+ for rank, count in embedded_topologies.items():
300
+ topology_span[rank] += count * weight
301
+ total += count
302
+ print(f"Counted {total} embedded topologies in {datetime.now() - start} seconds")
303
+ ```
304
+
305
+ ``` {code-cell}
306
+ :tags: [hide-input]
307
+ ntips = len(sample_populations)
308
+ styles = ".sample text.lab {baseline-shift: super; font-size: 0.7em;}"
309
+ node_labels = {}
310
+
311
+ for p in range(ntips):
312
+ name = ts.population(sample_populations[p]).metadata["id"]
313
+ node_labels[p] = names[name]
314
+ styles += f".n{p}>.sym {{fill: {colours[name]} }}"
315
+
316
+ total = sum(topology_span.values())
317
+ for rank, weight in topology_span.items():
318
+ label = f"{weight/total *100:.1f}% of genome"
319
+ embedded_tree = tskit.Tree.unrank(ntips, rank)
320
+ display(embedded_tree.draw_svg(size=(160, 150), style="".join(styles), node_labels=node_labels, x_label=label))
321
+ ```
322
+
323
+ For an example with real data, see {ref}` sec_popgen_topological `
324
+ in the {ref}` sec_intro_popgen ` tutorial.
0 commit comments