@@ -20,8 +20,8 @@ kernelspec:
20
20
21
21
At its heart, a ` tskit ` {ref}` tree sequence<sec_what_is> ` consists of a list of
22
22
{ref}` sec_terminology_nodes ` , and a list of {ref}` sec_terminology_edges ` that connect
23
- those nodes. Therefore a succinct tree sequence is equivalent to a
24
- [ mathematical graph] ( https://en.wikipedia.org/wiki/Graph_(discrete_mathematics) ) ,
23
+ parent to child nodes. Therefore a succinct tree sequence is equivalent to a
24
+ [ directed graph] ( https://en.wikipedia.org/wiki/Directed_graph ) ,
25
25
which is additionally annotated with genomic positions such that at each
26
26
position, a path through the edges exists which defines a tree. This graph
27
27
interpretation of a tree sequence is tightly connected to the concept of
@@ -147,14 +147,17 @@ ts_arg.draw_svg(
147
147
)
148
148
```
149
149
150
- The number of children a node has in a local tree can be termed the
151
- "local arity" of a node. It is clear from the plot above that both red and blue nodes
152
- can have a local arity of one. The act of ` simplification ` can
153
- transform a tree sequence so that all nodes have a local arity of
154
- 2 or more, which is [ more efficient] ( sec_args_disadvantages ) .
155
- However, this loses information about the timings
156
- and topological operations associated with recombination
157
- events and some common ancestor events. This information is useful for
150
+ The number of children descending from a node in a local tree can be termed the
151
+ "local arity" of that node. It is clear from the plot above that red nodes always
152
+ have a local arity of 1, and blue nodes sometimes do. This may seem an unusual
153
+ state of affairs: tree representations often focus on branch-points, and ignore nodes
154
+ with a single child. Indeed, it is possible to [ simplify] ( sec_args_simplification ) the
155
+ ARG above, resulting in a graph whose local trees only contain branch points or tips
156
+ (i.e. local arity is never 1). Such a graph is [ more compact] ( sec_args_disadvantages )
157
+ than the full ARG, but it omits some information about the timings and
158
+ topological operations associated with recombination
159
+ events and some common ancestor events. This information, as captured by the local
160
+ unary nodes, is useful for
158
161
159
162
1 . Retaining precise information about the time and lineages involved in recombination.
160
163
This is required e.g. to ensure we can always work out the tree editing (or
@@ -214,6 +217,8 @@ represented, in which both parents at a recombination event trace directly back
214
217
same common ancestor.
215
218
:::
216
219
220
+ (sec_args_simplification)=
221
+
217
222
## Simplification
218
223
219
224
If we fully {ref}` simplify<sec_simplification> ` the tree above, all remaining nodes
@@ -302,13 +307,15 @@ structures for simulation or inference is therefore infeasible.
302
307
303
308
## ARG formats and ` tskit `
304
309
305
- In classical ARGs, nodes often represent events (specifically, _ common ancestor_ ,
306
- _ recombination_ , and _ sampling_ events), with the genomic regions of inheritance
307
- encoded by storing a specific breakpoint location on each recombination node.
308
- In contrast, nodes in a ` tskit ` ARG correspond to _ genomes_ , and inherited regions
309
- are defined by intervals stored on * edges* (via the {attr}` ~Edge.left ` and
310
- {attr}` ~Edge.right ` properties), rather than on nodes. Here, for example, is the
311
- edge table from our ARG:
310
+ It is worth noting a subtle and somewhat philosophical
311
+ difference between some classical ARG formulations, and the ARG formulation
312
+ used in ` tskit ` . Classically, nodes in an ARG are taken to represent _ events_
313
+ (specifically, "common ancestor", "recombination", and "sampling" events),
314
+ and genomic regions of inheritance are encoded by storing a specific breakpoint location on
315
+ each recombination node. In contrast, [ nodes] ( tskit:sec_data_model_definitions_node ) in a ` tskit `
316
+ ARG correspond to _ genomes_ . More crucially, inherited regions are defined by intervals
317
+ stored on * edges* (via the {attr}` ~Edge.left ` and {attr}` ~Edge.right ` properties),
318
+ rather than on nodes. Here, for example, is the edge table from our ARG:
312
319
313
320
``` {code-cell}
314
321
ts_arg.tables.edges
@@ -325,7 +332,7 @@ simplification possible, and means `tskit` can encode ancestry without having
325
332
to pin down exactly when specific ancestral events took place.
326
333
327
334
328
- ## Working with the tree sequence graph
335
+ ## Working with ARGs in ` tskit `
329
336
330
337
All tree sequences, including, but not limited to full ARGs, can be treated as
331
338
directed (acyclic) graphs. Although many tree sequence operations operate from left to
0 commit comments