Xpression_collector is a simple Python-based pipeline for fetching RNA-seq reads from SRA and processing them to get the expression TPM files for further downstream analyses. Following are the salient features of the pipeline:
-> Optional BUSCO analysis of the PEP file obtained from the input CDS file
-> Reattempting capability to handle network disruptions during SRA prefetch and fasterq-dump
-> Iterative removal of SRA files after fetching and kallisto quantification of every accession to free-up storage space
-> Gene expression quantification by pseudopalignment using kallisto
-> Removal of invalid RNA-seq samples
-> Optional removal of isoforms
-> v0.2.1. is out and is the latest stable release!
-> v0.2.1 major changes include:
-> Modified isoform removal to handle multi-line FASTA files
-> Added compressed file processing support to multi-line FASTA file handling function
-> v0.2 is out and is the latest stable release!
-> v0.2 updates include:
-> buscolineage flag option entry has been simplified
-> The pipeline can be restarted now after disruptions
-> SRA file fetching and kallisto runs take place in an interleaved manner improving processing time
-> Better error handling and network issue tolerant fall backs in SRA file fetching
-> You can now merge already exisiting TPM/ repr_TPM files with the ones produced in the current run of Xpression_collector
git clone https://github.com/PuckerLab/Xpression_collect
Mandatory dependencies
- Tools - kallisto, sratoolkit
- Python libraries - pandas (v2.3.1 or greater), matplotlib (v3.10.5 or greater)
Optional dependencies
- Tools - BUSCO
- Python libraries - pyfiglet (v1.0.2 or greater), rich (v15.0.0 or greater)
docker pull shakunthalan/xpression_collector:latest
It is recommended to run the docker image as a user and not root:
docker run --rm -u $(id -u) -v /path/to/data:/path/to/data shakunthalan/xpression_collector:latest
Note: If you are using a docker image of the tool, you need not specify the dependencies' full paths while running the tool; The dependencies are built-in in the docker image and that makes it simple to use the tool across systems without the need for local installations.
This method of installation installs all the dependencies in a conda environment using the environment.yml file in this repository
git clone https://github.com/PuckerLab/Xpression_collect
cd Xpression_collect
conda env create -f environment.yml
conda activate xpression_collector
Usage:
python3 Xpression_collector.py --cds <CDS_FILE>
--sra <TXT_FILE_WITH_SRA_ACCESSIONS_LIST>
--out <OUTPUT_DIR>
MANDATORY:
--cds STR Full path to CDS file
Either provide a list of SRA accessions to fetch or path to already available main folder with subfolders named according to the samples/ accessions
--sra STR Full path to TXT file with one SRA accession per line
--out STR Full path to output directory
OPTIONAL:
--sample_name STR Name of the species; default is sample
--gff STR Full path to GFF file if isoform removal needs to be performed
--gff_config STR Full path to GFF config file specifying the different GFF attributes - child_attribute, child_parent_linker, parent_attribute;
default attributes are ID, Parent, ID respectively
--min_sra_file_size INT Minimum file size cutoff in MB to check prefetched SRA file sizes to cath sralite files; default cutoff is 1MB
--annotation_qc STR yes or no for BUSCO-based QC of the PEP file; default is yes
--threads STR Total number of cores for running the pipeline; default is 4
--parallel_prefetch STR Number of SRA accessions to be prefetched parallely; default is 2
--attempts STR Number of attempts at prefetching and accession from SRA; default is just 1 attempt
--wait STR Base wait time for sleep in case of network delays for prefetch; Increases exponentially
with a base of 2 for each reattempt; default is 20 seconds
--remove_isoforms STR Optional step to remove isoforms; yes or no; default is yes
--merge_tpms STR Full path to config file containing the full paths to the filtered_tpm, and/or repr_filtered_tpms to be merged
--min INT Minimum percentage expression of top 100 genes
--max INT Maximum percentage expression of top 100 genes
--black STR SRA IDs to be removed or blacklisted in a TXT file with one SRA accession ID per line
--busco STR Full path to BUSCO; Specify busco_docker if BUSCO is installed via docker
--busco_lineage STR Full path to the config file to specify the BUSCO lineage> <Tab separated TXT file with sample name
(should be the same as sample_name) in the first column and BUSCO lineage in the second column
--busco_version STR Version of BUSCO> default is v6.0.0
--container_version STR Container version of the BUSCO docker image
--docker_host_path STR Host path to be mounted on for running the docker image
-docker_container_path STR Container path of the docker image
--organism_type STR eukaryote or prokaryote for BUSCO analysis; default is eukaryote
--kallisto STR Full path to the kallisto executable
--prefetch STR Full path to the prefetch executable
--fasterq-dump STR Full path to the fasterq-dump executable
--uncompressed STR Provide this flag if your read files are uncompressed
--clean_up STR Clean up the TMP folder after a successful run; default is yes
-> The pipeline has the ability to restart after disruptions. Simply repeat the command you used before the disruption
-> The TMP folder gets cleaned up after a successful run of the pipeline. You can retain it in case you want to take a deeper look
-> --parallel_prefetch flag determines the number of parallel prefetch processes at a time
The optimal number of parallel prefetches cannot be defined precisely as it is influenced by the network bandwidth and the disk space available to store .sra files
But some performance optimizations have shown that a prefetch process uses anywhere between 30-45% of a CPU
Based on this the pipeline runs two parallel prefetches by default with a dynamic queueing mechanism
In the dynamic queueing and pushing design, once an accession's prefetch is done, it is pushed to the next step of fasterq-dump+pigz and kallisto which function serially since both of these are CPU-bound tasks showing maximum performance when utilizing all available cores at a time. The vacant spot in the prefetch block is now taken by the next accession in line making it a combinatorial scheme utilizing parallelization and serialization as necessary
Hence it is recommended to use the default parameter for --parallel_prefetch flag unless you are sure about abundance of storage space and excellent network bandwidth
-> Alternative isoforms removal needs the GFF file to be given along with the CDS file and it is important for the CDS FASTA headers and the transcript identifiers in the GFF file to match to process without errors
-> Detailed guidelines on preparing the config file for --gff_config flag are:
The GFF config TXT file is a simple tab separated TXT file that:
Needs 3 columns in the order
(i) child_attribute: attribute field of the mRNA or transcript feature in the file like ID
(ii) child_parent_linker: attribute field of the mRNA or transcript, CDS, exon features that link them with their
respective parent feature like Parent - Note: base assumption by the tool is that all child levels
have the same child-parent linker attribute fields.
For eg., if Parent is the child-parent linker in the mRNA feature line,
then Parent will be the child-parent linker for all other child-level feature lines in the GFF
(iii) parent_attribute: attribute field of the gene feature like ID
-> In case the headers in your CDS file are formatted differently when compared to the transcript identifiers in the GFF file supplied, you can use a helper called Fasta_fix.py to tackle this issue. The helper script and its detailed usage can be found in https://github.com/ShakNat/DupyliCate
-> https://github.com/ncbi/sra-tools
-> Bray N, Pimentel H, Melsted P, Pachter L. Near-optimal probabilistic RNA-seq quantification. Nature Biotechnology 34, 525–527 (2016). https://doi.org/10.1038/nbt.3519
-> Delaney K. Sullivan, Kyung Hoi (Joseph) Min, Kristján Eldjárn Hjörleifsson, Laura Luebbert, Guillaume Holley, Lambda Moses, Johan Gustafsson, Nicolas L. Bray, Harold Pimentel, A. Sina Booeshaghi, Páll Melsted, Lior Pachter. kallisto, bustools and kb-python for quantifying bulk, single-cell and single-nucleus RNA-seq. Nature Protocols 20, 587-607 (2025). https://doi.org/10.1038/s41596-024-01057-0
-> Fredrik Tegenfeldt, Dmitry Kuznetsov, Mosè Manni, Matthew Berkeley, Evgeny M Zdobnov, Evgenia V Kriventseva, OrthoDB and BUSCO update: annotation of orthologs with wider sampling of genomes, Nucleic Acids Research, Volume 53, Issue D1, 6 January 2025, Pages D516–D522, https://doi.org/10.1093/nar/gkae987
Kindly cite this repository if you use the Xpression_collector pipeline in your work