Skip to content

Commit 6f82d87

Browse files
author
Pablo Riesgo Ferreiro
committed
Merge branch 'develop' into 'master'
Release 1.0.0 See merge request tron/tron-bam-preprocessing!2
2 parents 786b16f + 013bf4e commit 6f82d87

19 files changed

+76491
-137
lines changed

.gitignore

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,9 @@
11
.idea
2+
work
3+
output
4+
.nextflow*
5+
report.html*
6+
timeline.html*
7+
trace.txt*
8+
dag.dot*
9+
*.swp

.gitlab-ci.yml

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
image: openjdk:11.0.10-jre-buster
2+
3+
4+
before_script:
5+
- java -version
6+
- apt-get update && apt-get --assume-yes install wget make procps
7+
- wget -qO- https://get.nextflow.io | bash && cp nextflow /usr/local/bin/nextflow
8+
- nextflow help
9+
- wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
10+
- mkdir /root/.conda
11+
- bash Miniconda3-latest-Linux-x86_64.sh -b && cp /root/miniconda3/bin/* /usr/local/bin/
12+
- rm -f Miniconda3-latest-Linux-x86_64.sh
13+
- conda --version
14+
15+
stages:
16+
- test
17+
18+
test:
19+
stage: test
20+
script:
21+
- make clean test

Makefile

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
clean:
2+
rm -rf output
3+
#rm -rf work
4+
rm -f report.html*
5+
rm -f timeline.html*
6+
rm -f trace.txt*
7+
rm -f dag.dot*
8+
rm -f .nextflow.log*
9+
rm -rf .nextflow*
10+
11+
test:
12+
nextflow main.nf -profile test,conda --output output/test1
13+
nextflow main.nf -profile test,conda --skip_bqsr --output output/test2
14+
nextflow main.nf -profile test,conda --skip_realignment --output output/test3
15+
nextflow main.nf -profile test,conda --skip_deduplication --output output/test4

README.md

Lines changed: 17 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# tron-bam-preprocessing
1+
# TRONflow BAM preprocessing pipeline
22

33
Nextflow pipeline for the preprocessing of BAM files based on Picard and GATK.
44

@@ -9,24 +9,21 @@ In order to have a variant calling ready BAM file there are a number of operatio
99

1010
GATK has been providing a well known best practices document on BAM preprocessing, the latest best practices for GATK4 (https://software.broadinstitute.org/gatk/best-practices/workflow?id=11165) does not perform anymore realignment around indels as opposed to best practices for GATK3 (https://software.broadinstitute.org/gatk/documentation/article?id=3238). This pipeline is based on both Picard and GATK. These best practices have been implemented a number of times, see for instance this implementation in Workflow Definition Language https://github.com/gatk-workflows/gatk4-data-processing/blob/master/processing-for-variant-discovery-gatk4.wdl.
1111

12-
At TRON we have a number of implementations of the BAM preprocessing pipeline, each one of those varies depending on the context. For instance, the script to run mutect has this pipeline embedded, see /code/iCaM/scripts/mutect.sh. This is repeated in some other places.
1312

1413
## Objectives
1514

16-
We aim at providing a single implementation of the BAM preprocessing pipeline that can be used across different situations. For this purpose there are some required steps and some optional steps. This is implemented as a Nextflow pipeline to simplify parallelization of execution in the cluster. The default configuration uses reference genome hg19, if another reference is needed the adequate resources must be provided. The reference genome resources for hg19 are installed in /projects/data/gatk_bundle/hg19 and they were downloaded from https://software.broadinstitute.org/gatk/download/bundle
15+
We aim at providing a single implementation of the BAM preprocessing pipeline that can be used across different situations. For this purpose there are some required steps and some optional steps. This is implemented as a Nextflow pipeline to simplify parallelization of execution in the cluster. The default configuration uses reference genome hg19, if another reference is needed the adequate resources must be provided. The reference genome resources for hg19 were downloaded from https://software.broadinstitute.org/gatk/download/bundle
1716

18-
The input is a configuration file so multiple BAMs can run easily. The output is another tab-separated values file with the absolute paths of the preprocessed and indexed BAMs.
17+
The input is a tab-separated values file where each line corresponds to one input BAM. The output is another tab-separated values file with the absolute paths of the preprocessed and indexed BAMs.
1918

2019
## Implementation
2120

2221
Steps:
2322

2423
* **Clean BAM**. Sets the mapping quality to 0 for all unmapped reads and avoids soft clipping going beyond the reference genome boundaries. Implemented in Picard
2524
* **Reorder chromosomes**. Makes the chromosomes in the BAM follow the same order as the reference genome. Implemented in Picard
26-
* **Sort by query name**. Ensuring the order by query name allows to find duplicates also in the unpaired and secondary alignment reads. Implemented in Picard
2725
* **Add read groups**. GATK requires that some headers are adde to the BAM, also we want to flag somehow the normal and tumor BAMs in the header as some callers, such as Mutect2 require it. Implemented in Picard.
28-
* **Mark duplicates** (optional). Identify the PCR and the optical duplications and marks those reads. Implemented in Picard
29-
* **Sort by coordinates**. This order is required by all GATK tools. Implemented in Picard
26+
* **Mark duplicates** (optional). Identify the PCR and the optical duplications and marks those reads. This uses the parallelized version on Spark, it is reported to scale linearly up to 16 CPUs.
3027
* **Realignment around indels** (optional). This procedure is important for locus based variant callers, but for any variant caller doing haplotype assembly it is not needed. This is computing intensive as it first finds regions for realignment where there are indication of indels and then it performs a local realignment over those regions. Implemented in GATK3, deprecated in GATK4
3128
* **Base Quality Score Recalibration (BQSR)** (optional). It aims at correcting systematic errors in the sequencer when assigning the base call quality errors, as these scores are used by variant callers it improves variant calling in some situations. Implemented in GATK4
3229

@@ -35,13 +32,11 @@ Steps:
3532
## How to run it
3633

3734
```
38-
-bash-4.2$ nextflow main.nf --help
35+
$ nextflow run tron-bioinformatics/tronflow-bam-preprocessing -r v1.0.0 --help
3936
N E X T F L O W ~ version 19.07.0
40-
Launching `bam_preprocessing.nf` [intergalactic_shannon] - revision: e707c77d7b
37+
Launching `main.nf` [intergalactic_shannon] - revision: e707c77d7b
4138
Usage:
42-
bam_preprocessing.nf --input_files input_files
43-
44-
This workflow is based on the implementation at /code/iCaM/scripts/mutect.sh
39+
main.nf --input_files input_files
4540
4641
Input:
4742
* input_files: the path to a tab-separated values file containing in each row the sample name, sample type (eg: tumor or normal) and path to the BAM file
@@ -61,7 +56,15 @@ Optional input:
6156
* skip_bqsr: optionally skip BQSR
6257
* skip_realignment: optionally skip realignment
6358
* skip_deduplication: optionally skip deduplication
64-
* output: the folder where to publish output, if not provided they will be moved to "output" folder inside the workflow folder
59+
* output: the folder where to publish output, if not provided they will be moved to "output" folder inside the workflow folder* prepare_bam_cpus: default 3
60+
* platform: the platform to be added to the BAM header. Valid values: [ILLUMINA, SOLID, LS454, HELICOS and PACBIO] (default: ILLUMINA)
61+
* prepare_bam_memory: default 8g
62+
* mark_duplicates_cpus: default 16
63+
* mark_duplicates_memory: default 64g
64+
* realignment_around_indels_cpus: default 2
65+
* realignment_around_indels_memory: default 32g
66+
* bqsr_cpus: default 3
67+
* bqsr_memory: default 4g
6568
6669
Output:
6770
* Preprocessed and indexed BAMs
@@ -70,4 +73,5 @@ Optional input:
7073
Optional output:
7174
* Recalibration report
7275
* Realignment intervals
76+
* Duplication metrics
7377
```

environment.yml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# You can use this file to create a conda environment for this pipeline:
2+
# conda env create -f environment.yml
3+
name: tronflow-bam-preprocessing-1.0.0
4+
channels:
5+
- conda-forge
6+
- bioconda
7+
- defaults
8+
dependencies:
9+
- bioconda::gatk4=4.2.0.0
10+
- bioconda::gatk=3.8

0 commit comments

Comments
 (0)