-
-
Notifications
You must be signed in to change notification settings - Fork 62
Tutorial DRAFT2
Lets learn CWL based on a simple flat three steps workflow. Workflow will make a BigWig file from a bam file. BigWig file will represent the reads or fragment coverage for the particular genome (mm9/mm10/hg19/xenTro7/etc). To do so I will use bedtools genomecov
subcommand then I have to sort
and finally to run bedGraphToBigWig
from UCSC userApps package. I would like to have a workflow that accepts ChIP-Seq and RNA-Seq data. To create a coverage file I will use bedtools genomecov with following args:
- -g GenomeFile - chromosome sizes for the genome
- -ratio For all the data I have normalization coefficient
- -split To show splicing for RNA-Seq data (A small notice ChIP-seq data do not have splices but might have a gap in a read if some one uses bowtie2 for instance, so this param should not affect ChIP-seq data too much I'm going to use it by default, but of course we can make it optional)
- -bg To output data in BedGraph format
- -fs To use fragmentize instead of reads
- -strand, -pc - other optional args
These args describe different behavior for different data sets that I have RNA/ChIP single/pair stranded/unstranded.
It is good to wrap each tool in separate .cwl file to avoid possible ID conflicts
I'm going to create three separate wrappers for each individual tool.
I know a few things about bedtools - base command and subcommand can I write a CWL? Yes! I'm ready to write few lines of CWL. I have to declare class CommandLineTool and specify required properties: baseCommand, inputs, outputs. Lets create a first CWL file bedtools-genomecov.cwl
, don't forget to make it executable (chmod +x bedtools-genomecov.cwl
) with next lines:
#!/usr/bin/env cwl-runner
class: CommandLineTool
description: |
Tool: bedtools genomecov (aka genomeCoverageBed)
Sources: https://github.com/arq5x/bedtools2
Summary: Compute the coverage of a feature file among a genome.
Usage: bedtools genomecov [OPTIONS] -i <bed/gff/vcf> -g <genome>
inputs: []
outputs: []
baseCommand: ["bedtools", "genomecov"]
Congratulations, now I can execute this tool directly or run cwltool bedtools-genomecov.cwl
. It will show the help from bedtools genomecov. Great, I can provide as closest as possible CWL description.
Tool: bedtools genomecov (aka genomeCoverageBed)
Version: v2.25.0
Summary: Compute the coverage of a feature file among a genome.
Usage: bedtools genomecov [OPTIONS] -i <bed/gff/vcf> -g <genome>
Options:
-ibam The input file is in BAM format.
Note: BAM _must_ be sorted by position
...
-bg Report depth in BedGraph format. For details, see:
genome.ucsc.edu/goldenPath/help/bedgraph.html
...
Time to add all the inputs that I'm going to use:
#!/usr/bin/env cwl-runner
class: CommandLineTool
description: |
Tool: bedtools genomecov (aka genomeCoverageBed)
Sources: https://github.com/arq5x/bedtools2
Summary: Compute the coverage of a feature file among a genome.
Usage: bedtools genomecov [OPTIONS] -i <bed/gff/vcf> -g <genome>
inputs:
- id: "#input"
type: File
description: |
The input file can be in BAM format
(Note: BAM _must_ be sorted by position)
or <bed/gff/vcf>
inputBinding:
position: 1
prefix: "-ibam"
- id: "#genomeFile"
type: File
description: 'Input genome file.'
inputBinding:
position: 2
prefix: "-g"
- id: "#dept"
type: boolean
default: true
inputBinding:
position: 4
prefix: -bg
- id: "#scale"
type: ["null",float ]
description: |
Scale the coverage by a constant factor.
Each coverage value is multiplied by this factor before being reported.
Useful for normalizing coverage by, e.g., reads per million (RPM).
- Default is 1.0; i.e., unscaled.
- (FLOAT)
inputBinding:
position: 4
prefix: -scale
- id: "#split"
type: ["null",boolean]
description: |
reat "split" BAM or BED12 entries as distinct BED intervals.
when computing coverage.
For BAM files, this uses the CIGAR "N" and "D" operations
to infer the blocks for computing coverage.
For BED12 files, this uses the BlockCount, BlockStarts, and BlockEnds
fields (i.e., columns 10,11,12).
inputBinding:
position: 4
prefix: "-split"
- id: "#strand"
type: ["null", string]
description: |
Calculate coverage of intervals from a specific strand.
With BED files, requires at least 6 columns (strand is column 6).
- (STRING): can be + or -
inputBinding:
position: 4
prefix: "-strand"
- id: "#pairchip"
type: ["null", boolean]
description: "pair-end chip seq experiment"
inputBinding:
position: 4
prefix: "-pc"
- id: "#fragmentsize"
type: ["null", int]
description: "fixed fragment size"
inputBinding:
position: 4
prefix: "-fs"
outputs: []
baseCommand: ["bedtools", "genomecov"]
inputs is a list of command input parameters and I have to provide at least the id and type, id has to be uniq. To make an argument optional I have to add "null" type to the type's list, for instance, type: ["null", int]
optional argument of type int. inputBinding makes the parameter appear in the final command line. To add a prefix before parameter use prefix. To bind to a specific position use position. If you need a parameter for the preprocessing in CWL and not for the final command line do not use inputBinding object. For the bedtools genomecov actual order of arguments is not important so I can avoid to use position.
I'm ready to run my first job. Job is a json file with all required parameters for cwl:
{
"input": {
"class": "File",
"path": "./test-files/rna.SRR948778.bam"
},
"genomeFile": {
"class": "File",
"path": "./test-files/mm10-chrNameLength.txt"
},
"scale":1,
"split": true,
"dept":true
}
To run a job run cwltool bedtools-genomecov.cwl bedtools-genomecov-job.json
, cwltool checks all required parameters and types as well as the provided files are exist if everything consistent cwltool executes the command. And I've got the bedGraph output to stdout.
I would like to have cwl description as close as possible to the original tool. From the help I know that for genomecov I can use either -d, -bg or -bga but not together. It also requires -i if it is bed/gff/vcf file formats or -ibam if it is bam format and I have to have index file together with bam.
To achieve -d, -bg or -bga behavior I can use enum type.
- id: "#dept"
type:
name: "JustDepts"
type: enum
symbols: ["-bg","-bga","-d"]
inputBinding:
position: 4
And I have to replace the "dept": true
with "dept":"-bg"
in the job file.
I can safely assume that gff file has .gff extension and bam file has .bam it might be different in other environments. In draft2 it's probably the only way to check the file type. So, I'm going to change the parameter 'input' (id:"#input") to make it checks type of the file and add either -i or -ibam prefix to the provided file. To do that I use JavaScript and nodejs engine. In inputBinding in valueFrom object I add engine: node-engine.cwl and a script. Script executes a code as it is a function we have to return a value. To access inputs IDes in JavaScript, CWL provides $job object with keys as IDes, for example: $job['input'] holds object with id "#input". To access file path I can use $job['input'].path then check the extension. My function returns an array [prefix,file_path] (example: ['-ibam','my.bam']) so in final command line they appear in this order.
var prefix = ((/.*\.bam$/i).test($job['input'].path))?'-ibam':'-i'; //either -ibam or -i
return [prefix,$job['input'].path]; //use the original path and add the prefix
The final CWL block:
- id: "#input"
type: File
description: |
The input file can be in BAM format
(Note: BAM _must_ be sorted by position)
or <bed/gff/vcf>
inputBinding:
position: 1
valueFrom:
engine: node-engine.cwl
script: |
{
var prefix = ((/.*\.bam$/i).test($job['input'].path))?'-ibam':'-i';
return [prefix,$job['input'].path];
}
I can also assume that if it is a bam file the corresponding index is in the same directory and after .bam it has .bai, example: "my.bam.bai". In terms of CWL it is a secondary file and it will go into secondaryFiles object, but why do we need it? The secondary files do not form the command line, but it is the instruction to CWL platform to check their existence, map them into a docker image or copy them to a location in a cloud. So it is just a best practice :). I add almost the same code to the secondaryFiles object except I return empty array [] if it is not a bam file. And return an object of type File if it is .bam {"path":"my.bam.bai", "class": "File"} now it is not a string.
- id: "#input"
type: File
description: |
The input file can be in BAM format
(Note: BAM _must_ be sorted by position)
or <bed/gff/vcf>
inputBinding:
position: 1
secondaryFiles:
- engine: node-engine.cwl
script: |
{
if ((/.*\.bam$/i).test($job['input'].path))
return {"path": $job['input'].path+".bai", "class": "File"};
return [];
}
valueFrom:
engine: node-engine.cwl
script: |
{
var prefix = ((/.*\.bam$/i).test($job['input'].path))?'-ibam':'-i';
return [prefix,$job['input'].path];
}
And the final thing is to redirect stdout to a file. It is good to provide future file name to save stdout. To do that we need a new parameter in inputs it holds file name type string. I would reference this name in outputs and stdout objects of CWL. To reference input I use special case engine cwl:JsonPointer and in the script string I provide /job/#id in my case it is /job/genomecoverage. I remember that I have to add this parameter into the job file "genomecoverageout": "./rna.SRR948778.bedGraph"
- id: "#genomecoverageout"
type: string
outputs:
- id: "#genomecoverage"
type: File
description: "The file containing the genome coverage"
outputBinding:
glob:
engine: cwl:JsonPointer
script: /job/genomecoverageout
stdout:
engine: cwl:JsonPointer
script: /job/genomecoverageout
The latest workable CWL bedtools-genomecov.cwl
To pass my data to the bedGraphToBigWig tool. I have to sort it by first column and then by second column within same first key. To do that from command line sort -k1,1 -k2,2n filename. Argument -k can be repeat as many times as I'd like. For arguments that can be repeated I use CWL array:
- id: "#key"
type:
type: array
items: string
description: |
-k, --key=POS1[,POS2]
start a key at POS1, end it at POS2 (origin 1)
However, prefix will be used just before generated string and I can only separate elements with a character. Idea, I do not add inputBinding and going to use this array to form the command line in arguments.
arguments:
- valueFrom:
engine: node-engine.cwl
script: $job['key'].map(function(i) {return "-k"+i;})
If I am crazy enough I can add regular expression to check each parameter /\d+,\d+n?/g - I am not :). linux-sort.cwl
To make a wrapper for this one I can use what I've learned no new tricks. ucsc-bedGraphToBigWig.cwl
I have all three tools wrapped in CWL. I know tools input parameters and output data. I'm ready to create my workflow. Workflow has it's own input and output lists it might be confused, but actually not, remember that from the workflow you need a final result but not an intermediate. In my bam-to-bigwig workflow, I would like to provide bam file, normalization coefficient, split when it is RNA-seq, fragment size when it single-end ChIP-Seq, force to use pair-end fragment when it is pair-end ChIP parameters and get bigWig file as output.
#!/usr/bin/env cwl-runner
class: Workflow
inputs:
- id: "#input"
type: File
- id: "#genomeFile"
type: File
- id: "#scale"
type: float
- id: "#pairchip"
type: ["null",boolean]
- id: "#fragmentsize"
type: ["null",int]
- id: "#strand"
type: ["null",string]
- id: "#bigWig"
type: string
outputs:
- id: "#outfile"
type: File
source: "#bigwig.bigWigOut"