- Setup
git clone https://github.com/leahkemp/pipeface.git
cd pipefaceNote: Variant annotation is only available for hg38
Get a copy of the hg38 reference genome
rsync -a -P rsync://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/analysisSet/hg38.analysisSet.fa.gz .Check download was successful by checking md5sum
md5sum hg38.analysisSet.fa.gzExpected md5sum
6d3c82e1e12b127d526395294526b9c8 hg38.analysisSet.fa.gzgunzip and build index
gunzip hg38.analysisSet.fa.gz
samtools faidx hg38.analysisSet.faGet a copy of the hs1 reference genome
rsync -a -P rsync://hgdownload.soe.ucsc.edu/goldenPath/hs1/bigZips/hs1.fa.gz .Check download was successful by checking md5sum
md5sum hs1.fa.gzExpected md5sum
a493d5402cc86ecc3f54f6346d980036 hs1.fa.gzgunzip and build index
gunzip hs1.fa.gz
samtools faidx hs1.faNote: You can create a BED file defining the tandem repeats regions you wish to call, alternatively you can use the catelog below.
Get a copy of the Broad Institute tandem repeat catalog
wget https://github.com/broadinstitute/tandem-repeat-catalog/releases/download/v1.0.2/variation_clusters_and_isolated_TRs_v1.0.2.hg38.TRGT.bed.gz
gunzip variation_clusters_and_isolated_TRs_v1.0.2.hg38.TRGT.bed.gzCheck download was successful by checking md5sum
md5sum variation_clusters_and_isolated_TRs_v1.0.2.hg38.TRGT.bed.gzExpected md5sum
d50345a1967c507bcdd3cf35c4db27d0 variation_clusters_and_isolated_TRs_v1.0.2.hg38.TRGT.bed.gzPrepare file for LongTR
cat variation_clusters_and_isolated_TRs_v1.0.2.hg38.TRGT.bed | sed 's/ID.*MOTIFS=//' | sed 's/;.*//' | awk 'length($4) > 1' | awk '$3 - $2 <= 1000' > variation_clusters_and_isolated_TRs_v1.0.2.hg38.TRGT.longtr.bedNote: checking relatedness is only available for duo/trio mode
wget -O sites.hg38.v0.2.19.vcf.gz https://github.com/brentp/somalier/files/3412456/sites.hg38.vcf.gzwget -O sites.chm13v2.T2T.v0.2.19.vcf.gz https://github.com/brentp/somalier/files/9954286/sites.chm13v2.T2T.vcf.gzClone the Rerio github repository
git clone https://github.com/nanoporetech/rerioGet a copy of the clair3 models
python3 rerio/download_model.py --clair3Get a copy of the clair3 models
wget http://www.bio8.cs.hku.hk/clair3/clair3_models/hifi_revio.tar.gzUntar
tar -xvf hifi_revio.tar.gzSpecify the sample ID, family ID (optional), file path to the data, data type, file path to regions of interest bed file (optional) and file path to clair3 model (if running Clair3) for each data to be processed. Eg:
sample_id,family_id,family_position,file,data_type,regions_of_interest,clair3_model
sample_01,,,/path/to/PGXXXX240090.fastq.gz,ont,/path/to/regions.bed,/path/to/clair3_models/ont/r1041_e82_400bps_sup_v420/
sample_01,,,/path/to/PGXXXX240091.fastq.gz,ont,/path/to/regions.bed,/path/to/clair3_models/ont/r1041_e82_400bps_sup_v420/
sample_02,,,/path/to/PGXXXX240092.fastq,ont,/path/to/regions.bed,/path/to/clair3_models/ont/r1041_e82_400bps_sup_v420/
sample_03,,,/path/to/PGXXOX240065.bam,ont,NONE,/path/to/clair3_models/ont/r1041_e82_400bps_sup_v420/
sample_04,,,/path/to/m84088_240403_023825_s1.hifi_reads.bc2034.bam,pacbio,NONE,/path/to/clair3_models/hifi_revio/
sample_04,,,/path/to/m84088_240403_043745_s2.hifi_reads.bc2035.bam,pacbio,NONE,/path/to/clair3_models/hifi_revio/Note: In singleton mode,
family_idwill only used to organise the output files into subdirectories offamily_id(if provided)
Specify the sample ID, family ID, family position, file path to the data, data type, file path to regions of interest bed file (optional) and file path to clair3 model (if running Clair3) for each data to be processed. Eg:
sample_id,family_id,family_position,file,data_type,regions_of_interest,clair3_model
sample_01,family01,proband,/path/to/PGXXOX240065.bam,ont,NONE,NONE
sample_01,family01,proband,/path/to/PGXXOX240066.bam,ont,NONE,NONE
sample_02,family01,father,/path/to/PGXXOX240067.bam,ont,NONE,NONE
sample_03,family01,mother,/path/to/PGXXOX240068.bam,ont,NONE,NONE
sample_04,family02,proband,/path/to/PGXXOX240069.bam,ont,NONE,NONE
sample_05,family02,father,/path/to/PGXXOX240070.bam,ont,NONE,NONE
sample_04,family02,mother,/path/to/PGXXOX240071.bam,ont,NONE,NONENote: In duo/trio mode,
family_idandfamily_positionare used to define the joint SNP/indel calling/merging
Note: In duo mode, a
proband, and either afatherormothermust be defined in thefamily_positioncolumn for everyfamily_id
Note: In trio mode, a
proband,fatherandmothermust be defined in thefamily_positioncolumn for everyfamily_id
Note: Files with the same value in the
sample_idcolumn will be merged, this is used to handle multiple sequencing runs of the same sample
Requirements:
- leave
family_idandfamily_positionempty if not required - please provide all entries for a given
sample_idthe samefamily_id(this is currently not error checked) - set
regions_of_interestto 'NONE' if not required - similarly, set
clair3_modelto 'NONE' if not required (ie. if you have not selected clair3 as the SNP/indel caller) - provide full file paths
- multiple entries for a given
sample_idare required to have the same file extension in thefilecolumn (eg. '.bam', '.fastq.gz' or '.fastq') - for entries in the
filecolumn, the file extension must be either '.bam', '.fastq.gz' or '.fastq' (as appropriate) - for entries in the
filecolumn, files containing methylation data should be provided in uBAM format (and not FASTQ format) - entries in the
data_typecolumn must be either 'ont' or 'pacbio' (as appropriate)
Specify the path to in_data.csv. Eg:
"in_data": "/path/to/in_data.csv",Specify the input data format ('ubam_fastq'). Eg:
"in_data_format": "ubam_fastq",Specify the path to the reference genome and it's index. Eg:
"ref": "/path/to/hg38.fa",
"ref_index": "/path/to/hg38.fa.fai",Optionally turn on haploid-aware mode (for XY samples only). Eg:
"haploidaware": "yes",
"sex": "XY",
"parbed": "/path/to/par.bed",OR
"haploidaware": "no",
"sex": "NONE",
"parbed": "NONE"Optionally specify the path to the tandem repeat bed file. Set to 'NONE' if not required. Eg:
"tandem_repeat": "/path/to/tandem_repeat.bed",Specify the mode to run the pipeline in ('singleton', 'duo' or 'trio'). Eg:
"mode": "singleton",Specify the SNP/indel caller to use ('clair3', 'deepvariant' or 'deeptrio'). Eg:
"snp_indel_caller": "deepvariant",Note: Running DeepVariant/DeepTrio on ONT data assumes r10 data
Note: In singleton mode, Clair3 and DeepVariant is available
Note: In duo mode, only DeepVariant is available
Note: In trio mode, only DeepTrio is available
Specify the SV caller to use ('sniffles', 'cutesv' or 'both'). Eg:
"sv_caller": "sniffles",Specify whether variant annotation should be carried out ('yes' or 'no'). Eg:
"annotate": "yes",Note: variant annotation is only available for hg38
Specify whether alignment depth should be calculated ('yes' or 'no'). Eg:
"calculate_depth": "yes",Specify whether base modifications should be analysed ('yes' or 'no'). Eg:
"analyse_base_mods": "yes",Note: processing base modifications assume base modifications are present in the input data and the input data is in unaligned BAM (uBAM) format
Optionally run tandem repeat calling and specify the path to an appropriate tandem repeat regions bed file. Set to 'NONE' if not required. Eg:
"tr_calling": "yes",
"tr_call_regions": "/path/to/variation_clusters_and_isolated_TRs_v1.0.2.hg38.TRGT.longtr.bed",OR
"tr_calling": "no",
"tr_call_regions": "NONE"Optionally run relatedness checks and specify the path to an appropriate somalier sites file. Set to 'NONE' if not required. Eg:
"check_relatedness": "yes",
"sites": "/path/to/sites.hg38.vcf.gz",OR
"check_relatedness": "no",
"sites": "NONE"Note: checking relatedness is only available for duo/trio mode
Specify the directory in which to write the pipeline outputs (please provide a full path). Eg:
"outdir": "/path/to/results/"