This folder contains gtdb*dump.tar.gz files coming from the Genome
Taxonomy Database, ready for use with
Ete (see also Ete's
documentation
for more details).
To create the gtdb*dump.tar.gz files, we first get the archea and
bacteria taxonomies from their releases
(for example, for the latest release,
ar53_taxonomy
and
bac120_taxonomy).
Then, we use Nick Youngblut's gtdb_to_taxdump (which can also be found in tools -> third party) to convert GTDB taxonomy to NCBI taxdump format. To do it, we run:
gtdb_to_taxdump.py ar53_taxonomy.tsv.gz bac120_taxonomy.tsv.gzand then we just put the 4 resulting .dmp files into a tar.gz:
tar -czf gtdb_latest_dump.tar.gz *.dmpLet's download release 226 as an example:
wget https://github.com/etetoolkit/ete-data/raw/main/gtdb_taxonomy/gtdb226/gtdb226dump.tar.gz(Note that we download the raw dump file, .../ete-data/raw/main/...,
and not .../ete-data/blob/main/....)
We can then run the following python code to use it in Ete:
from ete4 import GTDBTaxa
gtdb = GTDBTaxa()
gtdb.update_taxonomy_database('gtdb226dump.tar.gz')