This repository allows to:
- Generate xmls and manifests necessary for genome submission
- Link the genomes you want to submit with the samples/runs used to generate them
- Upload bins and MAGs in fasta format to ENA (European Nucleotide Archive) with webin-cli
When you submit genomes to the ENA, you need to register a sample for every genome containing all the relevant metadata describing the genome and the sample of origin. The genome_uploader acts as the main linker to preserve sample metadata as much as possible. For every genome to register, you need an INSDC run or assembly accession associated to the genome in order for the script to inherit its relevant metadata. On top of those metadata, the script adds metadata specified by the user that are specific to the genome, like taxonomy, statistics, or the tools used to generate it. The metadata that ENA requires are descibed in the checklist for MAGs and for bins, respectively.
The genome_uploader takes as input one tsv (tab-separated values) table in the following format:
| genome_name | genome_path | accessions | assembly_software | binning_software | binning_parameters | stats_generation_software | completeness | contamination | genome_coverage | metagenome | co-assembly | broad_environment | local_environment | environmental_medium | rRNA_presence | NCBI_lineage |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ERR4647712_crispatus | path/to/ERR4647712.fa.gz | ERR4647712 | megahit_v1.2.9 | MGnify-genomes-generation-pipeline_v1.0.0 | default | CheckM2_v1.0.1 | 100 | 0.38 | 14.2 | chicken gut metagenome | False | chicken | gut | mucosa | True | d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Lactobacillaceae;g__Lactobacillus;s__Lactobacillus crispatus |
With columns indicating:
- genome_name: genome id (unique string identifier)
- accessions: run(s) or assembly(ies) the genome was generated from (DRR/ERR/SRRxxxxxx for runs, DRZ/ERZ/SRZxxxxxx for assemblies). If the genome was generated by a co-assembly of multiple runs, separate them with a comma.
- assembly_software: assemblerName_vX.X
- binning_software: binnerName_vX.X
- binning_parameters: binning parameters
- stats_generation_software: software_vX.X
- completeness:
float - contamination:
float - rRNA_presence:
True/Falseif all among 5S, 16S, and 23S genes, and at least 18 tRNA genes, have been detected in the genome - NCBI_lineage: full NCBI lineage - format:
x;y;z;.... The same organism can be described in two different ways: either in tax ids (integers) orstrings. For example, the lineage for E. coli can be:Bacteria;Pseudomonadati;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia2;1224;1236;91347;543;561;562
- metagenome: needs to be listed in the taxonomy tree here (you might need to press "Tax tree - Show" in the right most section of the page)
- co-assembly:
True/False, whether the genome was generated from a co-assembly. N.B. the script only supports co-assemblies generated from the same project. - genome_coverage : genome coverage against raw reads
- genome_path: path to genome to upload (already compressed)
- broad_environment:
string(explanation following) - local_environment:
string(explanation following) - environmental_medium:
string(explanation following)
According to ENA checklist's guidelines, broad_environment describes the broad ecological context of a sample - desert, taiga, coral reef, ... local_environment is more local - lake, harbour, cliff, ... environmental_medium is either the material displaced by the sample, or the one in which the sample was embedded prior to the sampling event - air, soil, water, ...
For host-associated metagenomic samples, the three variables can be defined similarly to the following example for the chicken gut metagenome: "chicken digestive system", "digestive tube", "caecum". More information can be found at ERC000050 for bins and ERC000047 for MAGs under field names "broad-scale environmental context", "local environmental context", "environmental medium"
An example of input tsv table can be found here
All fields above are mandatory for MAG submission (see ENA's MAGs checklist here). However, if you are registering bins, you cand decide whether to omit the following fields: completeness, contamination and rRNA_presence(see ENA's bins checklist here). These values are used together to determine MAG quality according to MIMAG criteria (described here).
If you already generated these for your bins, our recommendation is to include them for shareability and to describe your sample more accurately.
Raw-read runs or assemblies from which genomes were generated should already be available on the INSDC (ENA by EBI, GenBank by NCBI, or DDBJ) for this script to work. Therefore, at least a DRR|ERR|SRR accession (for runs) or a ERZ|SRZ|DRZ accession (for assemblies) should be available.
If you are working with your own, private data on ENA, you will need to add the --private flag to access private metadata through ENA API. This implies that if you are working on public data, you can omit the flag. However, you will need to submit two different batches of data if you are handling both private and public data.
If uploading TPA (Third PArty) genomes, you will need to contact ENA support before using the script. They will provide instructions on how to correctly register a TPA project where to submit your genomes. If both TPA and non-TPA genomes need to be uploaded, please divide them in two batches and use the --tpa flag only with TPA genomes.
Files to be uploaded will need to be compressed (e.g. already in .gz format).
No more than 5000 genomes can be submitted at the same time. If you have more than 5000, split your table into smaller ones and launch the genome_uploader for each table.
Command will install all necessary dependencies into genomeuploader environment
conda install bioconda::genome-uploaderInstall genome_uploader with:
pip install genome_uploaderAdditionally, you need to download the webin-cli.jar from the latest release.
This tool requires your ENA Webin credentials to function. You can provide these by setting environment variables or using an environment file.
Create a file named .env in your home directory (~/.env), your current working directory (./.env), or specify a custom file (default is .env).
Add the following lines with your credentials:
ENA_WEBIN=your_username_here
ENA_WEBIN_PASSWORD=your_password_hereexport ENA_WEBIN=your_username_here
export ENA_WEBIN_PASSWORD=your_password_hereRun genome_uploader with input TSV:
genome_upload \
-u UPLOAD_STUDY \
--genome_info METADATA_FILE \
(--mags | --bins) \
--centre_name CENTRE_NAME \
[--out] [--force] [--live] [--tpa]where
-u UPLOAD_STUDY: study accession for genomes upload to ENA (in format ERPxxxxxx or PRJEBxxxxxx)--genome_info METADATA_FILE: genomes metadata file in tsv format-m, --mags, --b, --bins: select either of these for bin or MAG upload. If in doubt, check which definition fits best according to ENA--out: output folder (default: working directory)--force: forces reset of sample xmls generation. This is useful if you changed something in your tsv table, or if ENA metadata haven't been downloaded correctly (you can check this inENA_backup.json).--live: registers genomes on ENA's live server. Omitting this option allows to validate samples beforehand (it will need the-testoption in the upload command for the test submission to work)--centre_name CENTRE_NAME: name of the centre generating and uploading genomes--tpa: if uploading TPA (Third PArty) generated genomes--private: if data is private
Note
It is recommended to validate your genomes in test mode (i.e. without the --live argument in the registration step) before attempting the final upload.
Test run will proceed on the ENA's TEST server. Launching the registration in test mode will add a timestamp to the genome name to allow multiple executions of the test process.
If no errors occur, then re-run the command with the --live argument for a live registration to ENA's REAL server.
Sample xmls won't be regenerated automatically if a previous xml already exists. If any metadata or value in the tsv table changes, --force will allow xml regeneration.
The script produces the following files and folders:
bin_upload/MAG_upload
├── manifests
│ └── ...
├── manifests_test # folder generated for validation in test mode
│ └── ...
├── ENA_backup.json # backup file to prevent re-download of metadata from ENA. Regeneration can be forced with --force
├── genome_samples.xml # xml generated to register samples on ENA before the upload
├── registered_bins/MAGs.tsv # list of genomes registered on ENA in live mode - needed for manifest generation
├── registered_bins/MAGs_test.tsv # list of genomes registered on ENA in test mode - needed for manifest generation
└── submission.xml # xml used for genome registration on ENAAn example of output files and folder structure submitted in test mode can be found under the examples folder.
Once manifest files are generated, it is necessary to use ENA's webin-cli resource to upload genomes.
More information about ENA's webin-cli can be found in the ENA docs.
We recommend using a pre-installed webin_cli_handler script.
Note
First, validate your submission with the --mode validate.
Second, upload to the ENA's TEST server using the --test flag (make sure you have validated your run on Generate files for upload step) and --mode submit.
Finally, upload to ENA's REAL server using --mode submit without --test.
Run live execution:
webin_cli_handler \
--manifest *.manifest \
--context genome \
--mode submit \
[--test]If you do not have ena-webin-cli installed add the --download-webin-cli flag. The tool will be automatically downloaded. It requires a recent JAVA version to be able to work following official repo.
If you want to use your local Java .jar (downloaded after pip installation) provide it with --webin-cli-jar.
Other options:
webin_cli_handler
-h, --help show this help message and exit
-m, --manifest MANIFEST
Manifest text file containing file and metadata fields
-c, --context {genome,transcriptome,sequence,polysample,reads,taxrefset}
Submission type: genome, transcriptome, sequence, polysample, reads, taxrefset
--mode {submit,validate}
submit or validate
--test Specify to use test server instead of live
--workdir WORKDIR Path to working directory
--download-webin-cli Specify if you do not have ena-webin-cli installed
--download-webin-cli-directory DOWNLOAD_WEBIN_CLI_DIRECTORY
Path to save webin-cli into
--download-webin-cli-version DOWNLOAD_WEBIN_CLI_VERSION
Version of ena-webin-cli to download, default: latest
--webin-cli-jar WEBIN_CLI_JAR
Path to pre-downloaded webin-cli.jar file to execute
--retries RETRIES Number of retry attempts (default: 3)
--retry-delay RETRY_DELAY
Initial retry delay in seconds (default: 5)
--java-heap-size-initial JAVA_HEAP_SIZE_INITIAL
Java initial heap size in GB (default: 10)
--java-heap-size-max JAVA_HEAP_SIZE_MAX
Java maximum heap size in GB (default: 10)ENA's test servers reset every day. This means that if you try to register the same set of samples more than once in a single day, the request will fail because the automatically generated aliases would result as duplicates on ENA's servers. To prevent this issue, when you register samples in test mode, the genome_uploader appends a timestamp to each generated alias. This ensures that you can repeat your tests multiple times without running into duplicate-alias errors.
However, when debugging or checking the script’s behavior in development mode, you might want the aliases to remain consistent across runs, so that repeated submissions refer to the same sample. To allow this, you can use the --test-suffix flag when running genome_upload.py, which lets you define a custom suffix instead of the automatic timestamp. This gives you more control over how sample aliases are generated during testing.