Skip to content

Commit 70f1380

Browse files
committed
Minor edits to terminology_and_concepts.md
1 parent f6dcc9b commit 70f1380

File tree

1 file changed

+26
-32
lines changed

1 file changed

+26
-32
lines changed

terminology_and_concepts.md

Lines changed: 26 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -48,8 +48,6 @@ def basics():
4848
)
4949
tables.mutations.time = np.full_like(tables.mutations.time, tskit.UNKNOWN_TIME)
5050
tables.tree_sequence().dump("data/basics.trees")
51-
52-
5351
5452
def create_notebook_data():
5553
basics()
@@ -71,18 +69,18 @@ concepts behind {program}`tskit`, the tree sequence toolkit.
7169

7270
::::{margin}
7371
:::{note}
74-
See {ref}`sec_intro_downloading_datafiles` to run this tutorial on your own computer
72+
See {ref}`sec_intro_downloading_datafiles` to run this tutorial on your own computer.
7573
:::
7674
::::
7775

7876
A tree sequence is a data structure which describes a set of correlated
7977
evolutionary trees, together with some associated data that specifies, for example,
80-
the location of mutations in the tree sequence. More technically, a tree sequence
81-
stores a biological structure known as an "Ancestral Recombination Graph", or ARG.
78+
the location of mutations in the genome. More technically, a tree sequence
79+
stores a population genetics object known as an Ancestral Recombination Graph (ARG).
8280

83-
Below are the most important {ref}`terms and concepts <tskit:sec_data_model_definitions>`
84-
that you'll encounter in these tutorials, but first we'll {func}`~tskit.load` a tree
85-
sequence from a `.trees` file using the
81+
Below are the most important terms and concepts that you'll encounter in these tutorials.
82+
A concise glossary of these terms and concepts is available at {ref}`here <tskit:sec_data_model_definitions>`.
83+
But first we'll {func}`~tskit.load` a tree sequence from a `.trees` file using the
8684
{ref}`tskit:sec_python_api` (which will be used in the rest of this tutorial):
8785

8886
```{code-cell} ipython3
@@ -97,7 +95,7 @@ ts = tskit.load("data/basics.trees")
9795
:::{note}
9896
{ref}`Workarounds<msprime:sec_ancestry_multiple_chromosomes>` exist
9997
to represent a multi-chromosome genome as a tree
100-
sequence, but are not covered here
98+
sequence, but are not covered here.
10199
:::
102100
::::
103101

@@ -126,13 +124,12 @@ ts.draw_svg(
126124
)
127125
```
128126

129-
Each tree records the lines of descent along which a piece of DNA has been
130-
inherited (ignore for the moment the red symbols, which represent a mutation).
127+
Each tree records the lines of descent along which a piece of DNA has been inherited.
131128
For example, the first tree tells us that DNA from ancestral genome 7 duplicated
132129
to produce two lineages, which ended up in genomes 1 and 4, both of which exist in the
133130
current population. In fact, since this pattern is seen in all trees, these particular
134131
lines of inheritance were taken by all the DNA in this 1000 base pair genome.
135-
132+
The red symbol is a mutation, which we will describe later.
136133

137134
(sec_terminology_nodes)=
138135

@@ -148,7 +145,6 @@ an *internal node*, representing an ancestor in which a single DNA
148145
sequence was duplicated (in forwards-time terminology) or in which multiple sequences
149146
coalesced (in backwards-time terminology).
150147

151-
152148
(sec_terminology_nodes_samples)=
153149

154150
#### Sample nodes
@@ -163,7 +159,6 @@ labelled $0..5$, and also 6 non-sample nodes, labelled $6..11$, in the tree sequ
163159
print("There are", ts.num_nodes, "nodes, of which", ts.num_samples, "are sample nodes")
164160
```
165161

166-
167162
(sec_terminology_edges)=
168163

169164
### Edges
@@ -175,10 +170,9 @@ three trees in the example above has a branch from node 7 to node 1, but those t
175170
branches represent just a single edge.
176171

177172
Each edge is associated with a parent node ID and a child node ID. The time of the parent
178-
node must be
179-
strictly greater than the time of the child node, and the difference in these times is
180-
sometimes referred to as the "length" of the edge. Since trees in a tree sequence are
181-
usually taken to represent marginal trees along a genome, as well as the time dimension
173+
node must be strictly greater than the time of the child node, and the difference in these times
174+
is sometimes referred to as the "length" of the edge. Since trees in a tree sequence are
175+
usually taken to represent local trees along a genome, as well as the time dimension
182176
each edge also has a genomic _span_, defined by a *left* and a *right* position
183177
along the genome. There are 15 edges in the tree sequence above. Here's an example of
184178
one of them:
@@ -240,9 +234,6 @@ children_of_7 = first_tree.children(7)
240234
print("Node 7's parent is", parent_of_7, "and childen are", children_of_7, "in the first tree")
241235
```
242236

243-
244-
245-
246237
(sec_terminology_individuals_and_populations)=
247238

248239
### Individuals and populations
@@ -332,6 +323,8 @@ homozygous for "T", Bob is homozygous for "G", and Cat is heterozygous "T/G".
332323
In other words the ancestral state and the details of any mutations at that site,
333324
when coupled with the tree topology at the site {attr}`~Site.position`, is sufficient to
334325
define the allelic state possessed by each sample.
326+
See description for {attr}`~Mutation.parent` on how tskit handles multiple mutations along
327+
a path in a tree.
335328

336329
Note that even though the genome is 1000 base pairs long, the tree sequence only contains
337330
a single site, because we usually only bother defining *variable* sites in a tree
@@ -340,7 +333,6 @@ that genomic location). It is perfectly possible to have a site with no mutation
340333
(or silent mutations) --- i.e. a "monomorphic" site --- but such sites are not normally
341334
used in further analysis.
342335

343-
344336
(sec_terminology_provenance)=
345337

346338
### Provenance
@@ -354,7 +346,6 @@ call to msprime that produced it, and the second the call to
354346
provenance entries are sufficient to exactly recreate the tree sequence, but this
355347
is not always possible.
356348

357-
358349
(sec_concepts)=
359350

360351
## Concepts
@@ -385,17 +376,18 @@ with 3 or more children in a particular tree (these are known as *polytomies*).
385376
### Tree changes, ancestral recombinations, and SPRs
386377

387378
The process of recombination usually results in trees along a genome where adjacent
388-
trees differ by only a few "tree edit" or SPR (subtree-prune-and-regraft) operations.
379+
trees differ by only a few "tree edit" or subtree-prune-and-regraft (SPR) operations.
389380
The result is a tree sequence in which very few edges
390381
{ref}`change from tree to tree<fig_what_is_edge_diffs>`.
391382
This is the underlying reason that `tskit` is so
392383
efficient, and is well illustrated in the example tree sequence above.
393384

394385
In this (simulated) tree sequence, each tree differs from the next by a single SPR.
395-
The subtree defined by node 7 in the first tree has been pruned and regrafted onto the
396-
branch between 0 and 10, to create the second tree. The second and third trees have the
397-
same topology, but differ because their ultimate coalesence happened in a different
398-
ancestor (easy to spot in a simulation, but hard to detect in real data). This is also
386+
The subtree defined by node 7 in the first tree has been pruned (away from node 11) and
387+
regrafted onto the branch between 0 and 9, to create the second tree.
388+
The second and third trees have the same topology,
389+
but differ because their ultimate coalesence happened in a different ancestor
390+
(easy to spot in a simulation, but hard to detect in real data). This is also
399391
caused by a single SPR: looking at the second tree, either the subtree below node 8 or
400392
the subtree below node 9 must have been pruned and regrafted higher up on the same
401393
lineage to create the third tree. Because this is a fully {ref}`simplified<sec_simplification>`
@@ -409,7 +401,7 @@ positions (an "infinite sites" model of breakpoints), then the number of trees i
409401
sequence equals the number of ancestral recombination events plus one. If recombinations
410402
can occur at the same physical position (e.g. if the genome is treated as a set of
411403
discrete integer positions, as in the simulation that created this tree sequence) then
412-
moving from one tree to the next in a tree sequence might require multiple SPRs if
404+
moving from one tree to the next in a tree sequence might require multiple SPRs if
413405
there are multiple, overlaid ancestral recombination events.
414406

415407
(sec_concepts_args)=
@@ -418,11 +410,11 @@ there are multiple, overlaid ancestral recombination events.
418410

419411
::::{margin}
420412
:::{note}
421-
There is a subtle distinction between common ancestry and coalescence. In particular, all coalescent nodes are common ancestor events, but not all common ancestor events in an ARG result in coalescence in a local tree.
413+
There is a subtle distinction between common ancestry and coalescence. In particular, all coalescent nodes are common ancestor events, but not all common ancestor events in an ARG result in coalescence in all local trees.
422414
:::
423415
::::
424416

425-
The term "Ancestral Recombination Graph", or ARG, is commonly used to describe a genetic
417+
The term Ancestral Recombination Graph (ARG), is commonly used to describe a genetic
426418
genealogy. In particular, many (but not all) authors use it to mean a genetic
427419
genealogy in which details of the position and potentially the timing of all
428420
recombination and common ancestor events are explictly stored. For clarity
@@ -438,7 +430,7 @@ which omits these extra nodes. This is for two main reasons:
438430
2. The number of recombination and non-coalescing common ancestor events in the genealogy
439431
quickly grows to dominate the total number of nodes in the tree sequence,
440432
without actually contributing to the mutations inherited by the samples.
441-
In other words, these nodes are redundant to the storing of genome data.
433+
In other words, these nodes are redundant to the storing of genomic data.
442434

443435
Therefore, compared to a full ARG, you can think of a simplified tree sequence as
444436
storing the trees *created by* recombination events, rather than attempting to record the
@@ -450,6 +442,8 @@ way to put it:
450442
> whereas a [simplified] tree sequence encodes the outcome of those events"
451443
> ([Kelleher _et al._, 2019](https://doi.org/10.1534/genetics.120.303253))
452444
445+
[Wong _et al._, 2024](https://doi.org/10.1093/genetics/iyae100)
446+
review this topic in detail.
453447

454448
### Tables
455449

0 commit comments

Comments
 (0)