Skip to content

Python script to upload bins and MAGs to ENA (European Nucleotide Archive)

License

Notifications You must be signed in to change notification settings

EBI-Metagenomics/genome_uploader

Repository files navigation

ENA public Bins and MAGs uploader

This repository allows to:

  • Generate xmls and manifests necessary for genome submission
  • Link the genomes you want to submit with the samples/runs used to generate them
  • Upload bins and MAGs in fasta format to ENA (European Nucleotide Archive) with webin-cli

How it works

When you submit genomes to the ENA, you need to register a sample for every genome containing all the relevant metadata describing the genome and the sample of origin. The genome_uploader acts as the main linker to preserve sample metadata as much as possible. For every genome to register, you need an INSDC run or assembly accession associated to the genome in order for the script to inherit its relevant metadata. On top of those metadata, the script adds metadata specified by the user that are specific to the genome, like taxonomy, statistics, or the tools used to generate it. The metadata that ENA requires are descibed in the checklist for MAGs and for bins, respectively.

Prepare Input TSV

The genome_uploader takes as input one tsv (tab-separated values) table in the following format:

genome_name genome_path accessions assembly_software binning_software binning_parameters stats_generation_software completeness contamination genome_coverage metagenome co-assembly broad_environment local_environment environmental_medium rRNA_presence NCBI_lineage
ERR4647712_crispatus path/to/ERR4647712.fa.gz ERR4647712 megahit_v1.2.9 MGnify-genomes-generation-pipeline_v1.0.0 default CheckM2_v1.0.1 100 0.38 14.2 chicken gut metagenome False chicken gut mucosa True d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Lactobacillaceae;g__Lactobacillus;s__Lactobacillus crispatus

With columns indicating:

  • genome_name: genome id (unique string identifier)
  • accessions: run(s) or assembly(ies) the genome was generated from (DRR/ERR/SRRxxxxxx for runs, DRZ/ERZ/SRZxxxxxx for assemblies). If the genome was generated by a co-assembly of multiple runs, separate them with a comma.
  • assembly_software: assemblerName_vX.X
  • binning_software: binnerName_vX.X
  • binning_parameters: binning parameters
  • stats_generation_software: software_vX.X
  • completeness: float
  • contamination: float
  • rRNA_presence: True/False if all among 5S, 16S, and 23S genes, and at least 18 tRNA genes, have been detected in the genome
  • NCBI_lineage: full NCBI lineage - format: x;y;z;.... The same organism can be described in two different ways: either in tax ids (integers) or strings. For example, the lineage for E. coli can be:
    • Bacteria;Pseudomonadati;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia
    • 2;1224;1236;91347;543;561;562
  • metagenome: needs to be listed in the taxonomy tree here (you might need to press "Tax tree - Show" in the right most section of the page)
  • co-assembly: True/False, whether the genome was generated from a co-assembly. N.B. the script only supports co-assemblies generated from the same project.
  • genome_coverage : genome coverage against raw reads
  • genome_path: path to genome to upload (already compressed)
  • broad_environment: string (explanation following)
  • local_environment: string (explanation following)
  • environmental_medium: string (explanation following)

According to ENA checklist's guidelines, broad_environment describes the broad ecological context of a sample - desert, taiga, coral reef, ... local_environment is more local - lake, harbour, cliff, ... environmental_medium is either the material displaced by the sample, or the one in which the sample was embedded prior to the sampling event - air, soil, water, ... For host-associated metagenomic samples, the three variables can be defined similarly to the following example for the chicken gut metagenome: "chicken digestive system", "digestive tube", "caecum". More information can be found at ERC000050 for bins and ERC000047 for MAGs under field names "broad-scale environmental context", "local environmental context", "environmental medium"

An example of input tsv table can be found here

Warnings

Mandatory vs Optional Fields

All fields above are mandatory for MAG submission (see ENA's MAGs checklist here). However, if you are registering bins, you cand decide whether to omit the following fields: completeness, contamination and rRNA_presence(see ENA's bins checklist here). These values are used together to determine MAG quality according to MIMAG criteria (described here).

If you already generated these for your bins, our recommendation is to include them for shareability and to describe your sample more accurately.

Existing accessions in the INSDC

Raw-read runs or assemblies from which genomes were generated should already be available on the INSDC (ENA by EBI, GenBank by NCBI, or DDBJ) for this script to work. Therefore, at least a DRR|ERR|SRR accession (for runs) or a ERZ|SRZ|DRZ accession (for assemblies) should be available.

If you are working with your own, private data on ENA, you will need to add the --private flag to access private metadata through ENA API. This implies that if you are working on public data, you can omit the flag. However, you will need to submit two different batches of data if you are handling both private and public data.

TPA generation and upload

If uploading TPA (Third PArty) genomes, you will need to contact ENA support before using the script. They will provide instructions on how to correctly register a TPA project where to submit your genomes. If both TPA and non-TPA genomes need to be uploaded, please divide them in two batches and use the --tpa flag only with TPA genomes.

Compress your fasta files

Files to be uploaded will need to be compressed (e.g. already in .gz format).

Split your input tables

No more than 5000 genomes can be submitted at the same time. If you have more than 5000, split your table into smaller ones and launch the genome_uploader for each table.

Installation

Installation with conda (recommended)

Command will install all necessary dependencies into genomeuploader environment

conda install bioconda::genome-uploader

Installation with pip

Install genome_uploader with:

pip install genome_uploader

Additionally, you need to download the webin-cli.jar from the latest release.

Setting ENA Credentials

This tool requires your ENA Webin credentials to function. You can provide these by setting environment variables or using an environment file.

Using an environment file

Create a file named .env in your home directory (~/.env), your current working directory (./.env), or specify a custom file (default is .env).

Add the following lines with your credentials:

ENA_WEBIN=your_username_here
ENA_WEBIN_PASSWORD=your_password_here

Alternatively, set the environment variables directly in your shell

export ENA_WEBIN=your_username_here
export ENA_WEBIN_PASSWORD=your_password_here

Run

Generate files for upload

Run genome_uploader with input TSV:

genome_upload \
  -u UPLOAD_STUDY \
  --genome_info METADATA_FILE \
  (--mags | --bins) \
  --centre_name CENTRE_NAME \
  [--out] [--force] [--live] [--tpa]

where

  • -u UPLOAD_STUDY: study accession for genomes upload to ENA (in format ERPxxxxxx or PRJEBxxxxxx)
  • --genome_info METADATA_FILE : genomes metadata file in tsv format
  • -m, --mags, --b, --bins: select either of these for bin or MAG upload. If in doubt, check which definition fits best according to ENA
  • --out: output folder (default: working directory)
  • --force: forces reset of sample xmls generation. This is useful if you changed something in your tsv table, or if ENA metadata haven't been downloaded correctly (you can check this in ENA_backup.json).
  • --live: registers genomes on ENA's live server. Omitting this option allows to validate samples beforehand (it will need the -test option in the upload command for the test submission to work)
  • --centre_name CENTRE_NAME: name of the centre generating and uploading genomes
  • --tpa: if uploading TPA (Third PArty) generated genomes
  • --private: if data is private

Note

It is recommended to validate your genomes in test mode (i.e. without the --live argument in the registration step) before attempting the final upload. Test run will proceed on the ENA's TEST server. Launching the registration in test mode will add a timestamp to the genome name to allow multiple executions of the test process. If no errors occur, then re-run the command with the --live argument for a live registration to ENA's REAL server.

Sample xmls won't be regenerated automatically if a previous xml already exists. If any metadata or value in the tsv table changes, --force will allow xml regeneration.

Produced files:

The script produces the following files and folders:

bin_upload/MAG_upload
├── manifests
│    └── ...
├── manifests_test                  # folder generated for validation in test mode
│    └── ...
├── ENA_backup.json                 # backup file to prevent re-download of metadata from ENA. Regeneration can be forced with --force
├── genome_samples.xml              # xml generated to register samples on ENA before the upload
├── registered_bins/MAGs.tsv        # list of genomes registered on ENA in live mode - needed for manifest generation
├── registered_bins/MAGs_test.tsv   # list of genomes registered on ENA in test mode - needed for manifest generation
└── submission.xml                  # xml used for genome registration on ENA

An example of output files and folder structure submitted in test mode can be found under the examples folder.

Upload genomes

Once manifest files are generated, it is necessary to use ENA's webin-cli resource to upload genomes.

More information about ENA's webin-cli can be found in the ENA docs.

We recommend using a pre-installed webin_cli_handler script.

Note

First, validate your submission with the --mode validate.
Second, upload to the ENA's TEST server using the --test flag (make sure you have validated your run on Generate files for upload step) and --mode submit. Finally, upload to ENA's REAL server using --mode submit without --test.

Run live execution:

webin_cli_handler \
  --manifest *.manifest \
  --context genome \
  --mode submit \
  [--test]

If you do not have ena-webin-cli installed add the --download-webin-cli flag. The tool will be automatically downloaded. It requires a recent JAVA version to be able to work following official repo.
If you want to use your local Java .jar (downloaded after pip installation) provide it with --webin-cli-jar.

Other options:

webin_cli_handler 

  -h, --help            show this help message and exit
  -m, --manifest MANIFEST
                        Manifest text file containing file and metadata fields
  -c, --context {genome,transcriptome,sequence,polysample,reads,taxrefset}
                        Submission type: genome, transcriptome, sequence, polysample, reads, taxrefset
  --mode {submit,validate}
                        submit or validate
  --test                Specify to use test server instead of live
  --workdir WORKDIR     Path to working directory
  --download-webin-cli  Specify if you do not have ena-webin-cli installed
  --download-webin-cli-directory DOWNLOAD_WEBIN_CLI_DIRECTORY
                        Path to save webin-cli into
  --download-webin-cli-version DOWNLOAD_WEBIN_CLI_VERSION
                        Version of ena-webin-cli to download, default: latest
  --webin-cli-jar WEBIN_CLI_JAR
                        Path to pre-downloaded webin-cli.jar file to execute
  --retries RETRIES     Number of retry attempts (default: 3)
  --retry-delay RETRY_DELAY
                        Initial retry delay in seconds (default: 5)
  --java-heap-size-initial JAVA_HEAP_SIZE_INITIAL
                        Java initial heap size in GB (default: 10)
  --java-heap-size-max JAVA_HEAP_SIZE_MAX
                        Java maximum heap size in GB (default: 10)

Devs section

Testing submission in normal mode vs strict submission

ENA's test servers reset every day. This means that if you try to register the same set of samples more than once in a single day, the request will fail because the automatically generated aliases would result as duplicates on ENA's servers. To prevent this issue, when you register samples in test mode, the genome_uploader appends a timestamp to each generated alias. This ensures that you can repeat your tests multiple times without running into duplicate-alias errors.

However, when debugging or checking the script’s behavior in development mode, you might want the aliases to remain consistent across runs, so that repeated submissions refer to the same sample. To allow this, you can use the --test-suffix flag when running genome_upload.py, which lets you define a custom suffix instead of the automatic timestamp. This gives you more control over how sample aliases are generated during testing.

About

Python script to upload bins and MAGs to ENA (European Nucleotide Archive)

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 8

Languages