bksnake - Bulk RNASeq Snakemake Workflow

Public version of bksnake - biokit snakemake - bulk RNASeq Snakemake workflow.

bksnake is a Snakemake (Moelder et al., 2021) workflow for bulk RNASeq data analysis. It uses STAR aligner (Dobin et al., 2012) for read mapping and FeatureCounts from the Subread package (Liao et al., 2014) for gene quantification. Reference genomes with RefSeq, Ensembl, and Gencode gene annotations are available for several species such as Homo sapiens (hg38, chm13), Mus musculus (mm10, mm39), Rattus norvegicus (rn6, rn7), Macaca mulatta (Mmul_10), Macaca fascicularis (mfa5, mfa6), Chlorocebus sabaeus (Vero_WHO_p1.0), Sus scrofa (ss11), and Oryctolagus cuniculus (oc2). The generation of these reference genomes and annotation files is documented in a separate repository on github (see refsnake). Data quality and RNASeq metrics are determined using FastQC (Andrews et al.), RSeQC (Wang et al., 2012), MultiQC (Ewels et al., 2015), and Picard tools (Broad Institute). In addition, diagnostic plots for data quality assessment such as BioQC tissue heterogeneity (Zhang et al., 2017) or Principal Component Analysis are provided in an HTML report generated using the Snakemake optional parameter --report.

Optionally, genome coverage files (BigWig) and read alignment files (BAM/CRAM) can be generated.
Sample matching can be cross-checked using SNP fingerprints and Picard tools.
Input read trimming with Cutadapt (Martin, 2010) and generation of unmapped reads are also available.

The pipeline can be launched via a helper tool, run.py, or directly with Snakemake. All parameters are specified within a configuration yaml file or explicitly on the command line using run.py.

Key Requirements:

Input fastq files, a metadata file, and reference genome/STAR index files must be available locally.
Snakemake and Singularity must be installed and configured.
All software tools are pulled from public Singularity or Docker image repositories.
It is recommended to run the pipeline in a high-performance cluster environment.

Overview of the Analysis Workflow (top)

Directed acyclic graph of a representative analysis workflow.

DAG of the workflow.

Requirements (top)

This workflow requires the following tools in the system path

To pull Singularity images from the GitHub package registry, you must specify your username and a GitHub read package token:

export SINGULARITY_DOCKER_USERNAME=<username>
export SINGULARITY_DOCKER_PASSWORD=<github read package token>

Preparation (top)

Download workflow (top)

Clone this repository to your local working directory

git clone https://github.com/bedapub/bksnake.git

Reference genome (top)

Note on File Structure: For bksnake versions with tag v0.0.8 and earlier, the reference files have a different structure. For newer versions, the subfolder gtf has been removed, and refseq and ensembl folders were renamed to refseq_<version> and ensembl_<version>, respectively. GTF and other files were also renamed.

Human hg38

Download prepared reference genome annotation files from Zenodo into a genome "root" directory and "untar" the downloaded files there. At the end, create a symbolic link "hg38" pointing to the data folder for the human genome hg38 files.

genome_dir=<genome root directory>

mkdir -p $genome_dir
cd $genome_dir

wget https://zenodo.org/record/8017214/files/file.dat
tar –xvzf file.dat
ln -s genomes_2022-07-15_hg38 hg38

In folder genomes_2022-07-15_hg38, or hg38, are several subfolders containing fasta and gtf files for RefSeq and Ensembl annotations as well as index files for the STAR aligner version 2.7.10b.

<genome root directory>/hg38
├── fasta
├── gtf
│   ├── ensembl
│   └── refseq
├── star_2.7.10b

Specify the genome "root" directory in the pipeline configuration file (config/config.yaml), parameter genome_dir (see below).

Other species (top)

Reference genomes of other species can be generated, for example, with the refsnake Snakemake workflow.

Metadata (top)

The main input to the workflow is a tab-delimited text file containing metadata information about all samples. See an example file, resources/test-data/metadata.txt, in the previously cloned pipeline directory. The metadata file is composed of one header line and sample information on subsequent lines, one sample per line, organized by several columns as follows:

#ID: unique sample label
GROUP: name of sample condition
FASTQ1: name of forward read fastq file (mate 1, R1)
FASTQ2: Name of the reverse read fastq file (mate 2, R2). Optional for single-end.
Raw: path to the folder containing the input fastq files
Organism: Species name (e.g., Human, Rat, Mouse, Pig)

Notes

Columns #ID and GROUP may not contain white-spaces and other special characters.
Additional columns for experiment or specimen information can be added, but they are not used by the pipeline.
For single-end libraries, the FASTQ2 column is not necessary, or the field can be left blank if the dataset contains a mix of paired- and single-end libraries.

Example (top)

#ID	GROUP	FASTQ1	FASTQ2	Raw	Organism
GSM5362225	HOXB9-KO	10k_SRR14749833_1.fastq.gz	10k_SRR14749833_2.fastq.gz	resources/test-data/fastq	Human
GSM5362224	HOXB9-T133A	10k_SRR14749832_1.fastq.gz	10k_SRR14749832_2.fastq.gz	resources/test-data/fastq	Human
GSM5362223	HOXB9	10k_SRR14749831_1.fastq.gz	10k_SRR14749831_2.fastq.gz	resources/test-data/fastq	Human

Configuration (top)

The workflow requires several parameters to be configured, most of which can be set through a yaml configuration file. A template file named config.yaml is provided in the config directory. It's recommended to create a local copy of the template and make modifications there. Note that many of these parameters can also be specified through the wrapper script run.py, as explained in the next section. Parameters specified on the command line through the wrapper script run.py will overwrite parameters set in the configuration file.

To learn about all possible parameters, execute:

python run.py --help

It is important to specify the following parameters in the config.yaml

genome root and sub- directory: genome_dir: VALUE
singularity images directory: singularity: prefix: VALUE
snakemake path: snakemake: path: VALUE
sample metadata file: metadata: file: VALUE and metadata: group_name: VALUE
genome version (optional): species: VALUE

Usage (top)

Run on a cluster with LSF scheduler, up to 100 jobs in parallel

python run.py --config <path to config.yaml> --jobs 100

Run locally, using up to 12 cores

python run.py --config <path to config.yaml> --cores 12

Run the pipeline with the test data set and specify four optional parameters on the command line

python run.py \
    --config config/config.yaml \
    --metadata-file resources/test-data/metadata.txt \
    --snakemake-path=<path to snakemake> \
    --genome-dir <genome root directory> \
    --outdir test-data_output \
    --jobs 50

By this, sample metadata, in particular the path for the raw input data, i.e. "fastq" files, are given by the input metadata file located at resources/test-data/metadata.txt. The test data set consists of three sub samples fastq files from A549 cell line samples from GEO study GSM5362223. By the command above, the Snakemake jobs will be submitted to the cluster queue. A maximum of 50 jobs will be processed simultaneously. All output will be written to a new folder named test-data_output. All other pipeline configuration parameters will be used from the default config template file, config/config.yaml.

Output (top)

Overview

Folder structure of a typical workflow run (* = optional output).

├── annot                            genome annotation files
├── bam*                             read mappings in BAM format (optional)
├── bw*                              read coverage BigWig files (optional)
├── config.yaml                      workflow configuration file
├── cram*                            read mappings in CRAM format (optional)
├── cutadapt*                        Cudapapt output (optional)
├── fastq*                           copy of input reads (optional)
├── fastqc                           FASTQC output files
├── fc                               FeatureCounts output files
├── gct                              gene counts and normalized gene counts in GCT file format for RefSeq/Ensembl/Gencode annotations
├── log                              log and output files from the tools used
├── metrics                          RSeQC files, and CrosscheckFingerprints files from Picard tools for human data and hg38/GRCh38p14 genome (optional)
├── multiqc_data                     MultiQC data files
├── multiqc_report_ensembl.html      MultiQC report for Ensembl annotations
├── multiqc_report_gencode.html      MultiQC report for Gencode annotations (human and mouse)
├── multiqc_report_refseq.html       MultiQC report for RefSeq annotations
├── qc                               some QC plots, e.g. PCA or BioQC
├── report.html                      Snakemake report, containing workflow and QC metrics
├── rulegraph.pdf                    workflow DAG in pdf format
├── rulegraph.png                    workflow DAG in png format
├── samples.txt                      sample metadata in tab-delimited file
└── unmapped*                        unmapped reads in FASTQ file format (optional)

Explanations (top)

"annot" folder

Contains copies of reference genome and annotations files (i.e. FASTA, GTF, etc) In addition, sample metadata in tab-delimited text file, phenoData.meta and in cls format phenoData.cls.

"bam" folder (optional)

Contains all aligned reads in BAM file. Only generated if the parameter keep_bam_files is True in the pipeline configuration.

"bw" folder (optional)

Read coverage files in BigWig format. May be used for graphical visualization of the genome coverage by external tools such as JBrowse or IGV. Only generated if the parameter generate_bw_files is True in the pipeline configuration.

"cutadapt" folder (optional)

Output files from the sequencing read trimming tool Cutadapt. Only generated if the parameter cutadapt: run is True in the pipeline configuration.

"config.yaml" file

All configuration parameters stored in a single yaml file.

"cram" folder (optional)

Contains all aligned reads in CRAM file. Only generated if the parameter generate_cram_files is True in the pipeline configuration.

"fastq" folder (optional)

Contains a copy of the input FASTQ files. Only generated if the parameter keep_fastq_files is True in the pipeline configuration.

"fastqc" folder

Contains all output files from the read quality control tool FastQC.

"fc" folder

Intermediate output files, summary and compressed counts file, from the gene quantification step by using FeatureCounts tool from the Subreads package. Parameters for FeatureCounts can be specified via the input configuration file.

"gct" folder

Contains - if available - RefSeq, Ensembl, or Gencode (<db>) annotated gene counts, normalized counts, log2-transformed counts in GCT file format.

<db>_count.gct: RefSeq gene counts
<db>_count_collapsed.gct: RefSeq gene counts collapsed to human orthologous gene symbols (using resources/geneids.chip)
<db>_tpm.gct: normalized RefSeq transcript per million mapped reads (tpm)
<db>_tpm_collapsed.gct: human orthologs of normalized counts (tpm)
<db>_log2tpm.gct: log2-transformed normalized RefSeq transcript per million mapped reads (tpm)

"log" folder

Contains several log files from analysis tools used by the pipeline. Mainly used for debugging purposes.

"metrics" folder (partially optional)

The strandedness of reads is derived by using the RSeQC tool (Wang et al., 2012).

There is an optional rule for cross-checking human sample matching using SNP fingerprints and the Picard tool (BroadPicard-Tool). A haplotype map is a collection of "blocks" of SNPs that are in tight linkage with SNPs of the same block and low linkage with SNPs of different blocks (see Javed et al., 2020). Here, we utilize fingerprint maps from naumanjaved on github. In order to use these features, set the config file parameter crosscheck_fingerprints: True. As of now, this tools works only for the human genome hg38/GRCh38p14.

Also optional are "junction_annotation" and "junction_saturation" from the RSeQC tool. See config file parameter junction_annotation.

"multiqc_data" folder

Output files from the MultiQC tool with RefSeq, Ensembl, Gencode gene annotations (e.g. for Picard RNASeq metrics).

"multiqc_report_refseq.html" file

HTML summary report from the MultiQC tool based on genome annotations from <db> where <db> is refseq, ensembl, gencode (if available).

"qc" folder

Plots for quality control purposes, using genome annotations from <db>.

<db>_bioQC.pdf: Heatmap representation of the BioQC enrichment scores for detecting detecting such tissue heterogeneity (Zhang et al. 2017)
<db>_bioQC_thr2.txt: BioQC enrichment scores above threshold 2
<db>_bioQC.txt: All BioQC enrichment scores
<db>_log2tpm_pca.pdf: Plot of the main components from the principal component analysis on the basis of the log2-transformed normalized gene counts (log2tpm).
<db>_log2tpm_pca.txt: Coordinates of the components from the principal component analysis on the basis of the log2-transformed normalized gene counts (log2tpm).
md5sum.txt: md5 check sums for all fastq files.

"rulegraph.pdf" and "rulegraph.png" files

Graphical representation of the entire workflow. A directed acyclic graph, DAG, generated by Snakemake

"samples.txt" file

Tab-delimited, human readable text file containing study and sample metadata in tabular form (one sample per line).

"unmapped" folder (optional)

Contains unmapped sequencing reads in FASTQ files. Only generated if the parameter generate_unmapped is True in the pipeline configuration.

Release notes (top)

0.2.1 Enable processing of single- and paired-end libraries simultaenoulsy.
0.2.0 STAR index creation and custom BED files.
0.1.13 n/a.

Contributing (top)

Any contribution, feedback and bug report are highly welcome. For major changes, please open an issue first to discuss what you would like to change. Thank you!

License (top)

GNU GPLv3

(To the top of the page)

Name		Name	Last commit message	Last commit date
Latest commit History 196 Commits
config		config
profile_slurm		profile_slurm
resources		resources
workflow		workflow
.bumpversion.cfg		.bumpversion.cfg
.gitignore		.gitignore
README.md		README.md
VERSION		VERSION
lsf.yaml		lsf.yaml
run.py		run.py

bedapub/bksnake

Folders and files

Latest commit

History

Repository files navigation