@@ -18,20 +18,21 @@ kernelspec:
1818# Getting started with {program}` tskit `
1919
2020You've run some simulations or inference methods, and you now have a
21- {class}` TreeSequence ` object; what now? This tutorial is aimed
21+ {class}` TreeSequence ` object; what now? This tutorial is aimed for
2222users who are new to {program}` tskit ` and would like to get some
2323basic tasks completed. We'll look at five fundamental things you might
24- need to do:
24+ want to do:
2525{ref}` process trees<sec_processing_trees> ` ,
26- {ref}` sites & mutations<sec_processing_sites_and_mutations> ` , and
27- {ref}` genotypes<sec_processing_genotypes> ` ,
26+ {ref}` process sites & mutations<sec_processing_sites_and_mutations>` ,
27+ {ref}` extract genotypes<sec_processing_genotypes>` ,
2828{ref}` compute statistics<sec_tskit_getting_started_compute_statistics> ` , and
2929{ref}` save or export data<sec_tskit_getting_started_exporting_data> ` .
3030Throughout, we'll also provide pointers to where you can learn more.
3131
32+
3233:::{note}
3334The examples in this
34- tutorial are all written using the {ref}` sec_python_api ` , but it's also possible to
35+ tutorial are all written using the {ref}` tskit: sec_python_api` , but it's also possible to
3536{ref}` use R <sec_tskit_r> ` , or access the API in other languages, notably
3637{ref}` C<sec_c_api> ` and [ Rust] ( https://github.com/tskit-dev/tskit-rust ) .
3738:::
@@ -43,7 +44,6 @@ twenty diploid individuals. To make it a bit more interesting, we'll simulate th
4344of a {ref}` selective sweep<msprime:sec_ancestry_models_selective_sweeps> ` in the middle
4445of the chromosome, then throw some neutral mutations onto the resulting tree sequence.
4546
46-
4747``` {code-cell} ipython3
4848import msprime
4949
@@ -61,7 +61,7 @@ ts = msprime.sim_ancestry(
6161 recombination_rate=1e-8,
6262 random_seed=1234, # only needed for repeatabilty
6363 )
64- # Optionally add finite-site mutations to the ts using the Jukes & Cantor model, creating SNPs
64+ # Optionally add finite-site mutations to the ts using the default Jukes & Cantor model, creating SNPs
6565ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=4321)
6666ts
6767```
@@ -250,13 +250,16 @@ for nmuts, count in enumerate(np.bincount(num_muts)):
250250
251251(sec_processing_genotypes)=
252252
253- ## Processing genotypes
253+ ## Extracting genotypes
254254
255255At each site, the sample nodes will have a particular allelic state (or be flagged as
256256{ref}` tskit:sec_data_model_missing_data ` ). The
257- {meth}` TreeSequence.variants ` method gives access to the
258- full variation data. For efficiency, the {attr}` ~Variant.genotypes `
259- at a site are returned as a [ numpy] ( https://numpy.org ) array of integers:
257+ {meth}` TreeSequence.variants ` method gives access to
258+ full variation data - possible allelic states and node genotypes at each site.
259+ Since nodes are by definition haploid, their genotype at a site is their allelic state
260+ at the site. For efficiency, its attribute {attr}` ~Variant.genotypes `
261+ at a site returns a [ numpy] ( https://numpy.org ) array of integers
262+ representing allelic states:
260263
261264``` {code-cell} ipython3
262265import numpy as np
@@ -275,20 +278,16 @@ Tree sequences are optimised to look at all samples at one site, then all sample
275278adjacent site, and so on along the genome. It is much less efficient look at all the
276279sites for a single sample, then all the sites for the next sample, etc. In other words,
277280you should generally iterate over sites, not samples. Nevertheless, all the alleles for
278- a single sample can be obtained via the
279- {meth}` TreeSequence.haplotypes ` method.
281+ a single sample can be obtained via the {meth}` TreeSequence.haplotypes ` method.
280282:::
281283
282-
283284To find the actual allelic states at a site, you can refer to the
284285{attr}` ~Variant.alleles ` provided for each {class}` Variant ` :
285286the genotype value is an index into this list. Here's one way to print them out; for
286287clarity this example also prints out the IDs of both the sample nodes (i.e. the genomes)
287288and the diploid {ref}` individuals <sec_nodes_or_individuals> ` in which each sample
288289node resides.
289290
290-
291-
292291```` {code-cell} ipython3
293292samp_ids = ts.samples()
294293print(" ID of diploid individual: ", " ".join([f"{ts.node(s).individual:3}" for s in samp_ids]))
@@ -309,7 +308,6 @@ such as indels, leading to allelic states which need not be one of these 4 lette
309308even be a single letter.
310309:::
311310
312-
313311(sec_tskit_getting_started_compute_statistics)=
314312
315313## Computing statistics
@@ -353,17 +351,17 @@ is the spectrum for a section of the tree sequence between 5 and 5.5Mb, which we
353351created by deleting the regions outside that interval using
354352{meth}` TreeSequence.keep_intervals ` . Unsurprisingly,
355353as we noted when looking at the trees, there's a far higher proportion of singletons in
356- the region of the sweep.
354+ the region of the sweep compared to the entire genome .
357355
358356(sec_tskit_getting_started_compute_statistics_windowing)=
359357
360358### Windowing
361359
362360It is often useful to see how statistics vary in different genomic regions. This is done
363- by calculating them in {ref}` tskit:sec_stats_windows ` along the genome. For this,
361+ by calculating them in {ref}` windows< tskit:sec_stats_windows> ` along the genome. For this,
364362let's look at a single statistic, the genetic {meth}` ~TreeSequence.diversity ` (π). As a
365- site statistic this measures the average number of genetic differences between two
366- randomly chosen samples, whereas as a branch length statistic it measures the average
363+ site statistic, it measures the average number of genetic differences between two
364+ randomly chosen samples, whereas as a branch statistic, it measures the average
367365branch length between them. We'll plot how the value of π changes using 10kb windows,
368366plotting the resulting diversity between positions 4 and 6 Mb:
369367
@@ -380,15 +378,15 @@ ax1.set_yscale("log")
380378ax1.set_ylim(1e-6, 1e-3)
381379ax2.stairs(ts.diversity(windows=windows, mode="branch"), windows/1_000, baseline=None)
382380ax2.set_xlabel("Genome position (kb)")
383- ax2.set_title("Branch-length- based calculation")
381+ ax2.set_title("Branch-based calculation")
384382ax2.set_xlim(4e3, 6e3)
385383ax2.set_yscale("log")
386384plt.show()
387385```
388386
389387There's a clear drop in diversity in the region of the selective sweep. And as expected,
390- the statistic based on branch- lengths gives a less noisy signal.
391-
388+ the statistic based on branch lengths gives a less noisy signal than the statistic based
389+ on randomly occuring mutations at sites.
392390
393391(sec_tskit_getting_started_exporting_data)=
394392
@@ -434,17 +432,16 @@ print(small_ts.as_nexus(precision=3, include_alignments=False))
434432
435433### VCF
436434
437- The standard way of interchanging genetic variation data is the Variant Call Format,
435+ The standard way of interchanging genetic variation data is the Variant Call Format (VCF) ,
438436for which tskit has basic support:
439437
440438``` {code-cell} ipython3
441439import sys
442440small_ts.write_vcf(sys.stdout)
443441```
444442
445- The write_vcf method takes a file object as a parameter; to get it to write out to the
446- notebook here we ask it to write to stdout.
447-
443+ The {meth}` TreeSequence.write_vcf ` method takes a file object as a parameter;
444+ to get it to write out to the notebook here we ask it to write to stdout.
448445
449446(sec_tskit_getting_started_exporting_data_allel)=
450447
@@ -466,8 +463,8 @@ print(h.n_variants, h.n_haplotypes)
466463h
467464```
468465
469- Sckit. allel has a wide-ranging and efficient suite of tools for working with genotype
470- data, so should provide anything that's needed . For example, it gives us an
466+ {program} ` scikit- allel` has a wide-ranging and efficient suite of tools for working
467+ with genotype data for most of users' needs . For example, it gives us an
471468another way to compute the pairwise diversity statistic (that we calculated
472469{ref}` above<sec_tskit_getting_started_compute_statistics_windowing> `
473470using the native {meth}` TreeSequence.diversity ` method):
@@ -477,6 +474,23 @@ ac = h.count_alleles()
477474allel.mean_pairwise_difference(ac)
478475```
479476
477+ Comparative call with tskit is:
478+
479+ ``` {code-cell} ipython3
480+ # Per site (not span-normalised to match scikit-allel)
481+ small_ts.diversity(windows="sites", span_normalise=False)
482+ # Per site (span-normalised to account for variable span of windows defined by sites)
483+ small_ts.diversity(windows="sites")
484+
485+ # Per whole genome
486+ small_ts.diversity(span_normalise=False)
487+ # ... equivalent call to the above
488+ sum(small_ts.diversity(windows="sites", span_normalise=False))
489+ small_ts.diversity()
490+ ```
491+
492+ See {ref}` windows<tskit:sec_stats_windows> ` and {ref}` windows<tskit:sec_stats_span_normalise> `
493+ for details about the `` span_normalise=False `` .
480494
481495(sec_tskit_getting_started_key_points)=
482496
@@ -496,15 +510,15 @@ in rough order of importance:
496510 * {ref}` sec_terminology_nodes ` (i.e. genomes) can belong to
497511 {ref}` individuals<sec_terminology_individuals_and_populations> ` . For example,
498512 sampling a diploid individual results in an {class}` Individual ` object which
499- possesses two distinct {ref}` sample nodes<sec_terminology_nodes_samples> ` .
513+ links to two distinct {ref}` sample nodes<sec_terminology_nodes_samples> ` .
500514* Key tree sequence methods
501515 * {meth}` ~TreeSequence.samples() ` returns an array of node IDs specifying the
502516 nodes that are marked as samples
503517 * {meth}` ~TreeSequence.node ` returns the node object for a given integer node ID
504518 * {meth}` ~TreeSequence.trees ` iterates over all the trees
505519 * {meth}` ~TreeSequence.sites ` iterates over all the sites
506- * {meth}` ~TreeSequence.variants ` iterates over all the sites with their genotypes
507- and alleles
520+ * {meth}` ~TreeSequence.variants ` iterates over all the sites with their list of
521+ unique alleles and sample/node? genotypes
508522 * {meth}` ~TreeSequence.simplify() ` reduces the number of sample nodes in the tree
509523 sequence to a specified subset
510524 * {meth}` ~TreeSequence.keep_intervals() ` (or its complement,
@@ -519,4 +533,3 @@ in rough order of importance:
519533 sequence, for example {meth}` ~TreeSequence.allele_frequency_spectrum ` ,
520534 {meth}` ~TreeSequence.diversity ` , and {meth}` ~TreeSequence.Fst ` ; these can
521535 also be calculated in windows along the genome.
522-
0 commit comments