You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+17-13Lines changed: 17 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,4 @@
1
-
# tron-bam-preprocessing
1
+
# TRONflow BAM preprocessing pipeline
2
2
3
3
Nextflow pipeline for the preprocessing of BAM files based on Picard and GATK.
4
4
@@ -9,24 +9,21 @@ In order to have a variant calling ready BAM file there are a number of operatio
9
9
10
10
GATK has been providing a well known best practices document on BAM preprocessing, the latest best practices for GATK4 (https://software.broadinstitute.org/gatk/best-practices/workflow?id=11165) does not perform anymore realignment around indels as opposed to best practices for GATK3 (https://software.broadinstitute.org/gatk/documentation/article?id=3238). This pipeline is based on both Picard and GATK. These best practices have been implemented a number of times, see for instance this implementation in Workflow Definition Language https://github.com/gatk-workflows/gatk4-data-processing/blob/master/processing-for-variant-discovery-gatk4.wdl.
11
11
12
-
At TRON we have a number of implementations of the BAM preprocessing pipeline, each one of those varies depending on the context. For instance, the script to run mutect has this pipeline embedded, see /code/iCaM/scripts/mutect.sh. This is repeated in some other places.
13
12
14
13
## Objectives
15
14
16
-
We aim at providing a single implementation of the BAM preprocessing pipeline that can be used across different situations. For this purpose there are some required steps and some optional steps. This is implemented as a Nextflow pipeline to simplify parallelization of execution in the cluster. The default configuration uses reference genome hg19, if another reference is needed the adequate resources must be provided. The reference genome resources for hg19 are installed in /projects/data/gatk_bundle/hg19 and they were downloaded from https://software.broadinstitute.org/gatk/download/bundle
15
+
We aim at providing a single implementation of the BAM preprocessing pipeline that can be used across different situations. For this purpose there are some required steps and some optional steps. This is implemented as a Nextflow pipeline to simplify parallelization of execution in the cluster. The default configuration uses reference genome hg19, if another reference is needed the adequate resources must be provided. The reference genome resources for hg19 were downloaded from https://software.broadinstitute.org/gatk/download/bundle
17
16
18
-
The input is a configuration file so multiple BAMs can run easily. The output is another tab-separated values file with the absolute paths of the preprocessed and indexed BAMs.
17
+
The input is a tab-separated values file where each line corresponds to one input BAM. The output is another tab-separated values file with the absolute paths of the preprocessed and indexed BAMs.
19
18
20
19
## Implementation
21
20
22
21
Steps:
23
22
24
23
***Clean BAM**. Sets the mapping quality to 0 for all unmapped reads and avoids soft clipping going beyond the reference genome boundaries. Implemented in Picard
25
24
***Reorder chromosomes**. Makes the chromosomes in the BAM follow the same order as the reference genome. Implemented in Picard
26
-
***Sort by query name**. Ensuring the order by query name allows to find duplicates also in the unpaired and secondary alignment reads. Implemented in Picard
27
25
***Add read groups**. GATK requires that some headers are adde to the BAM, also we want to flag somehow the normal and tumor BAMs in the header as some callers, such as Mutect2 require it. Implemented in Picard.
28
-
***Mark duplicates** (optional). Identify the PCR and the optical duplications and marks those reads. Implemented in Picard
29
-
***Sort by coordinates**. This order is required by all GATK tools. Implemented in Picard
26
+
***Mark duplicates** (optional). Identify the PCR and the optical duplications and marks those reads. This uses the parallelized version on Spark, it is reported to scale linearly up to 16 CPUs.
30
27
***Realignment around indels** (optional). This procedure is important for locus based variant callers, but for any variant caller doing haplotype assembly it is not needed. This is computing intensive as it first finds regions for realignment where there are indication of indels and then it performs a local realignment over those regions. Implemented in GATK3, deprecated in GATK4
31
28
***Base Quality Score Recalibration (BQSR)** (optional). It aims at correcting systematic errors in the sequencer when assigning the base call quality errors, as these scores are used by variant callers it improves variant calling in some situations. Implemented in GATK4
32
29
@@ -35,13 +32,11 @@ Steps:
35
32
## How to run it
36
33
37
34
```
38
-
-bash-4.2$ nextflow main.nf --help
35
+
$ nextflow run tron-bioinformatics/tronflow-bam-preprocessing -r v1.0.0 --help
This workflow is based on the implementation at /code/iCaM/scripts/mutect.sh
39
+
main.nf --input_files input_files
45
40
46
41
Input:
47
42
* input_files: the path to a tab-separated values file containing in each row the sample name, sample type (eg: tumor or normal) and path to the BAM file
* output: the folder where to publish output, if not provided they will be moved to "output" folder inside the workflow folder
59
+
* output: the folder where to publish output, if not provided they will be moved to "output" folder inside the workflow folder* prepare_bam_cpus: default 3
60
+
* platform: the platform to be added to the BAM header. Valid values: [ILLUMINA, SOLID, LS454, HELICOS and PACBIO] (default: ILLUMINA)
0 commit comments