-
Notifications
You must be signed in to change notification settings - Fork 3
Home
This is a collection of pipelines to be used for NGS (both DNA and RNA) analyses, from alignment to variant calling.
Start by creating a clone of the repository:
cd /path/to/some/directory
git clone https://github.com/pughlab/pipeline-suite/
Additionally, the report generation portion of this tool requires installation of the BPG plotting package for R: https://CRAN.R-project.org/package=BoutrosLab.plotting.general

The pipeline-suite runs using parameters provided in YAML format. There are two types of required configuration files:
-
data configs:
- can be generated using create_fastq_yaml.pl; dna_fastq_config.yaml and rna_fastq_config.yaml
- bam_config.yaml, generated by any tool which outputs BAMs that are required for downstream steps
-
pipeline configs (fastqc_tool_config.yaml, dna_pipeline_config.yaml and rna_pipeline_config.yaml):
- these specify common parameters, including:
- project name
- sequencing type (wgs, exome, rna or targeted), sequencing center and platform
- ref_type (ie, hg19 or hg38)
- path to desired output directory (will be created if this is the initial run)
- paths to tool-specific reference files/directories
- desired versions of tools
- and, for each tool, memory and run time parameters for each step
- these specify common parameters, including:
For tool-specific details for RNA-Seq configuration, click here: RNA-Seq.
For tool-specific details for DNA-Seq configuration, click here: DNA-Seq.
- Run FASTQC to verify fastq quality:
module load perl
perl /path/to/collect_fastqc_metrics.pl \
-d /path/to/fastq_config.yaml \
-t /path/to/fastqc_tool_config.yaml \
-c slurm \
{optional: --rna, --dry-run }
Be sure to run FASTQC to verify fastq quality prior to running downstream pipelines. In particular, ensure read length is consistent, GC content is similar (typically between 40-60%) and files are unique (no duplicated md5sums):
- Prepare interval files (ie, for WXS): For WXS or targeted-sequencing panels, a bed file containing target regions should be provided (listing at minimum: chromosome, start and end positions). Variant calling pipelines MuTect and Mutect2 will add 100bp of padding to each region provided. For consistency, this padding must be manually added prior to variant calling with other tools (ie, Strelka, SomaticSniper, VarDict and VarScan). This function will additionally create a bgzipped version of the padded interval file required by Strelka.
module load perl
perl /path/to/format_interval_bed.pl \
-b /path/to/base/intervals.bed \
-r /path/to/reference.fa
Make sure you have write permissions on the directory containing the intervals bed file as this will write output files to the same directory as the original bed file!
- Run DNA (or RNA) pipeline:
module load perl
perl /path/to/pughlab_dnaseq_pipeline.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/dna_fastq_config.yaml \
--preprocessing \
--variant_calling \
--create_report \
-c slurm
This will generate the directory structure in the output directory (provided in /path/to/dna_pipeline_config.yaml), including a "logs/run_DNA_pipeline_TIMESTAMP/" directory containing a file "run_DNASeq_pipeline.log" which lists the individual tool commands; these can be run separately if "--dry-run" is set, or in the event of a failure at any stage and you don't need to re-run the entire thing (Note: doing so would not regenerate files that already exist).