-
Notifications
You must be signed in to change notification settings - Fork 11
Replacing impute2 with Beagle; Support for hg19 and hg38 reference genomes #20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
ea53bd3
3836328
bba909d
e0d4a50
9586a6a
3109026
5e86bfa
3b3d7fb
68f24bc
4ba122f
1dd2cf7
530ba17
78e1612
1888521
368a3c5
1a8069a
de2b5a0
a96db03
eff8785
e49047a
9600f4b
27451e6
c8a953f
0a6de5a
fd79e51
c9aa6a0
2160ea7
f880d0a
f11ed17
9a41797
1061eeb
13bdfe2
cfec6c9
944304d
72f796b
232cbd0
94b4eb4
081e236
524037d
14954cd
a2e30c1
df9e662
778079b
118f4c6
87f6439
209a232
5e7fdbc
e8a7fc7
5004ae1
a8b2f55
19b3100
5c4ad09
3d16416
346e68a
b139c9a
9b86c36
52c6c5a
7b0d1ff
f52f086
e6a5845
30555e1
09bd58f
fe1b82b
d5706f9
232a81d
e3fc99f
1229a24
eb26938
e783d7e
6549fe0
779d5b0
fc26eb6
8655540
8a29539
8ffd230
2d0ec65
8eda5e9
ab85dc9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -4,3 +4,4 @@ build/ | |
| *.swp | ||
| *.DS_Store | ||
| #.+ | ||
| hg19_GRCh37_1000genomes | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,4 +1,4 @@ | ||
| dependson=COWorkflowsBasePlugin:1.2.1 | ||
| dependson=COWorkflowsBasePlugin:1.4.2 | ||
| RoddyAPIVersion=3.2 | ||
| GroovyVersion=2.4 | ||
| JDKVersion=1.8 |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,2 +1,2 @@ | ||
| 5.0 | ||
| 6.0 | ||
| 0 |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,107 @@ | ||
| ### Reference files for GRCh38 | ||
vinjana marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| Information on downloading and parsing files needed for GRCh38 reference genome | ||
|
|
||
|
|
||
| #### Reference genome | ||
|
|
||
| We are using the GRCh38 version hosted at `http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa`, | ||
|
|
||
| The above reference genome (GRCh38_decoy_ebv_phiX_alt_hla) can be downloaded and processed using the repositiry at `https://github.com/DKFZ-ODCF/setup-reference-data`. | ||
|
|
||
| List of files needed from the above process | ||
| - GRCh38_decoy_ebv_phiX_alt_hla_chr.fa | ||
| - GRCh38_decoy_ebv_phiX_alt_hla_chr.fa.chrLenOnlyACGT_realChromosomes.tsv | ||
| - GRCh38_decoy_ebv_phiX_alt_hla_chr.fa.chrLength.tsv | ||
|
|
||
|
|
||
| #### dbSNP version 135 | ||
|
|
||
| Downloaded from ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/00-All.vcf.gz | ||
|
|
||
| Post-processing for usage in ACEseq: | ||
| ``` | ||
| DBSNP_VERSION=135 | ||
| zcat 00-All.vcf.gz | | ||
| awk '/^#/{print} /VC=SNV/{ v=$8; sub(/.*dbSNPBuildID=/, "", v); sub(/;.*/, "", v); if (v~/^[0-9]+$/ && int(v)<='$DBSNP_VERSION') print }' | | ||
| bgzip > 00-All.SNV.vcf.gz | ||
| tabix -p vcf 00-All.SNV.vcf.gz | ||
| ``` | ||
|
|
||
| #### Generating mappability track | ||
| The mappability file (`GRCh38_Mappability_Align_100mer_m2e2.ALT_HLA.bedGraph.gz`) is created using the `create_mappability.sh` bash script. | ||
|
|
||
| The tools used have to be in the `tools` folder. The tools `gem-2-wig`, `gem-indexer`, `gem-mappability` are from the package *gem-tools* from the Vlaams Instituut voor Biotechnologie (VIB). | ||
| Information about the tools can be found here: https://wiki.bits.vib.be/index.php/Create_a_mappability_track#Install_and_run_the_GEM_library_tools. The tools were downloaded from https://sourceforge.net/projects/gemlibrary/files/gem-library/Binary%20pre-release%202/. The corresponding paper should be this one: https://www.researchgate.net/publication/221776385_Fast_Computation_and_Applications_of_Genome_Mappability | ||
|
|
||
| The comptatible gemtools version (1.7.1-i3) was downloaded from "http://barnaserver.com/gemtools/releases/GEMTools-static-i3-1.7.1.tar.gz". | ||
| For hg38 `gem-mappability` an option of `-m2 -e2` were added to keep the score compatabiltble with the hg19 versions downloaded from UCSC. | ||
|
|
||
| The two tools `wigToBigWig` and `BigWigToBedGraph` are downloaded from http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/ | ||
|
|
||
| Finally, `tabix` from htslib is used (see http://www.htslib.org/doc/tabix.html) | ||
|
|
||
| Run the script `sh create_mappability.sh` | ||
|
|
||
|
|
||
| #### Replication time | ||
| Replication time from individual cell lines from GRCh37 ENCODE data were lifted over to GRCh38, and averages were re-calculated. | ||
| The new R-object is uploaded to the `$repo_root/installation/GRCh38_related_files/time_mean_10KB.Rda` | ||
|
|
||
|
|
||
| #### Gaps and centromeres | ||
| - `gap.txt.gz` downloaded from `http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/gap.txt.gz` | ||
| - `centromeres.txt.gz` downloaded from `http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/centromeres.txt.gz` | ||
| - `centromeres_merged.txt.gz` created with the following command | ||
| ``` | ||
| zcat centromeres.txt.gz | sort -k 2 -V | awk 'BEGIN {printf "#bin\tchrom\tchromStart\tchromEnd\n"; chr=""} $2!=chr { if (end != ""){printf $1"\t" chr "\t" start "\t" end "\n"}; chr=$2; start=$3} {end=$4} END {printf $1"\t" chr "\t" start "\t" end "\n"}' | gzip > centromeres_merged.txt.gz | ||
| ``` | ||
| - `gap_with_centromeres.txt.gz` file created with the following command | ||
| ``` | ||
| zcat centromeres_merged.txt.gz | tail -n +2 | awk ' BEGIN {OFS="\t"} {print $0, "1", "N", $4-$3, "centromere", "no"}' | gzip > gap_with_centromeres.txt.gz | ||
| ``` | ||
|
|
||
|
|
||
| #### GC content | ||
| The GC-content (`gc_content_hg38.txt`) is calculated directly from the reference genome with the script `calc_gc_content.py`. | ||
| ``` | ||
| python3 calc_gc_content -v -i ${reference_path}/GRCh38_decoy_ebv_phiX_alt_hla_chr.fa -o gc_content_hg38.txt | ||
| ``` | ||
| The -v flag increases verbosity. Note that you have to use python3! | ||
|
|
||
|
|
||
| #### Beagle reference files | ||
| The variants were downloaded for all the chromosomes mapped to hg38 reference genome | ||
| Info: https://www.internationalgenome.org/announcements/Variant-calls-from-1000-Genomes-Project-data-on-the-GRCh38-reference-assemlby/ | ||
| FTP site: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/ | ||
|
|
||
| Added 'chr' prefix and converted the VCF files to BREF format using Beagle. Refer to the script `beagle_vcf_to_bref.sh` | ||
|
|
||
| This step will generate the following files, | ||
| - `ALL.chr${CHR_NAME}.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.CHR.bref3` | ||
| - `ALL.chrX.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.CHR.bref3` | ||
|
|
||
|
|
||
| #### Beagle genetic map files | ||
| Downloaded from `http://bochet.gcc.biostat.washington.edu/beagle/genetic_maps/plink.GRCh38.map.zip` | ||
|
|
||
| Add `chr` prefix | ||
|
|
||
| ``` | ||
| for chr in `seq 1 22` X ; do cat plink.chr${chr}.GRCh38.map | sed 's/^/chr/' > plink.chr${chr}.GRCh38.CHR.map; done | ||
| ``` | ||
|
|
||
|
|
||
| #### Local controls for no-control workflow (optional) | ||
| These are lift-over files from the hg19 workflow. New hg38 native files will be generated for the next versions. | ||
|
|
||
|
|
||
| #### Exclusion list or blacklist files | ||
| The exclusion list contains the 'homodel' region present in almost all of the 216 samples in the ICGC medulloblastoma project. | ||
| `artifact.homoDels.potentialArtifacts.hg38.txt` | ||
|
|
||
| #### Hg38 cytoband | ||
| Cytoband file was copied from ANNOVAR database files. Only the coordinates from Chr1-22, X, and Y were kept. | ||
| ``` | ||
| cat ANNOVAR/annovar_April2018/humandb/hg38_cytoBand.txt | grep -v "_" | grep -v "^chrM" > hg38_cytoBand.txt | ||
| ``` | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,14 @@ | ||
| #!/bin/bash | ||
| # Convert BEAGLE VCF to BREF format | ||
| # Script takes in the VCF files with header and adds 'chr' prefix to the variants | ||
| module load java/1.8.0_131 | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The script uses a specific version of Java (1.8.0_131). If the script is intended to be used in different environments, consider checking for the Java version dynamically or allowing the user to specify the Java version. |
||
|
|
||
| for chr in `seq 1 22` X Y | ||
| do | ||
| echo $chr | ||
| (zcat ALL.chr${chr}.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz \ | ||
| | head -n 300 \ | ||
| | grep "#" ; zcat ALL.chr${chr}.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz \ | ||
| | grep -v "#" | sed 's/^/chr/') \ | ||
| | java -jar bref3.18May20.d20.jar > ALL.chr${chr}.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.CHR.bref | ||
|
Comment on lines
+9
to
+13
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The script assumes that the input VCF files follow a specific naming convention. If the naming convention changes, the script will fail. Consider making the input file name a parameter to the script. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The output file name is hardcoded. Consider making the output file name a parameter to the script. |
||
| done | ||
|
Comment on lines
+6
to
+14
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The script processes files for chromosomes 1 to 22, X, and Y. If there are other chromosomes to be processed, consider making the list of chromosomes a parameter to the script. |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,59 @@ | ||
| #!/usr/bin/env python3 | ||
| import sys | ||
| import argparse | ||
| import re | ||
|
|
||
| assert sys.version_info.major >= 3, "Python 3 is required for string operations" | ||
|
|
||
| parser = argparse.ArgumentParser() | ||
| parser.add_argument("-i", "--input", help="Input sequence - fasta file format", required=True) | ||
| parser.add_argument("-o", "--output", help="Output file name", default='gc_content.txt') | ||
| parser.add_argument("-t", "--threshold", help="Threshold", default=0.0) | ||
| parser.add_argument("--stepsize", help="Threshold (regions with gc lower than threshold are discarded)", default=10000) | ||
| parser.add_argument("-v", "--verbose", action = 'store_true', help="Add verbosity") | ||
| args = parser.parse_args() | ||
|
|
||
| threshold = float(args.threshold) | ||
| stepsize = int(args.stepsize) | ||
|
|
||
| output_data = ['\t'.join(['chromosome', 'start', 'end', 'gc_content'])] | ||
|
|
||
| if args.verbose: print('Load data') | ||
| with open(args.input, "r") as f: | ||
| data = f.read().splitlines() | ||
|
|
||
| if args.verbose: print('Calculate chromosome stats') | ||
| chr_names, chr_borders = [], [] | ||
|
|
||
| for i in range(len(data)): | ||
| if '>' in data[i]: | ||
| chr_names.append(data[i].split('>')[1].split(' ')[0]) | ||
| chr_borders.append(i) | ||
| chr_borders = chr_borders + [len(data)] | ||
|
|
||
| for i in range(len(chr_names)): | ||
|
|
||
| if not re.match("^(chr)?[0-9,X,Y]{1,2}$", chr_names[i]): | ||
| continue | ||
|
|
||
| if args.verbose: print('Calculating Chromosome {} - length = {} bp'.format(chr_names[i], 61*(chr_borders[i+1] - chr_borders[i]))) | ||
| current_chr = data[chr_borders[i]:chr_borders[i+1]][1:] | ||
|
|
||
| current_chr = ''.join(current_chr) | ||
|
|
||
| for j in range(0, len(current_chr)//stepsize): | ||
| current_data = current_chr[(stepsize*j):(stepsize*(j+1))] | ||
|
|
||
| gc = len(re.findall('[gcGC]', current_data)) | ||
| gcta = len(re.findall('[gctaGCTA]', current_data)) | ||
| if gcta == 0: gcta = 1. | ||
| gc_content = gc/gcta | ||
|
|
||
| if gc_content > threshold: | ||
| output_data.append('\t'.join([chr_names[i], str(stepsize*j), str(stepsize*(j+1)), '{:1.6f}'.format(gc_content)])) | ||
| if args.verbose: print('Finished with calculation. Writing to file {}'.format(args.output)) | ||
|
|
||
| with open(args.output, 'w') as f: | ||
| for line in output_data: | ||
| f.write("%s\n" % line) | ||
| if args.verbose: print('GC-content calculated succesfully') |
| Original file line number | Diff line number | Diff line change | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,37 @@ | ||||||||||||||||
| #!/bin/bash | ||||||||||||||||
| set -ue -o pipefail | ||||||||||||||||
|
|
||||||||||||||||
vinjana marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||
| nr_cores=4 | ||||||||||||||||
| kmer=100 | ||||||||||||||||
| reference_genome="${reference_path}/GRCh38_decoy_ebv_phiX_alt_hla_chr.fa" | ||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The - reference_genome="${reference_path}/GRCh38_decoy_ebv_phiX_alt_hla_chr.fa"
+ reference_path="/path/to/reference"
+ reference_genome="${reference_path}/GRCh38_decoy_ebv_phiX_alt_hla_chr.fa"Commitable suggestion
Suggested change
|
||||||||||||||||
|
|
||||||||||||||||
| mkdir -p ref out | ||||||||||||||||
| # copy the reference but only chr1-22, X and Y | ||||||||||||||||
| echo "copy reference genome" | ||||||||||||||||
| awk '/^>chrM/ {exit} /^>/ {print $1} !/^>/ {print}' ${reference_genome} > ref/reference.fa | ||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This line assumes that the reference genome file is formatted in a specific way. If the format changes, this line may not work as expected. Consider adding a comment to explain the expected format of the reference genome file. |
||||||||||||||||
| echo "gem-indexer" | ||||||||||||||||
| ./gemtools-1.7.1-i3/bin/gem-indexer -m force-slow-algorithm -T ${nr_cores} -c dna -i ref/reference.fa -o out/index | ||||||||||||||||
| echo "gem-mappability" | ||||||||||||||||
| ./gemtools-1.7.1-i3/bin/gem-mappability -m2 -e2 -T ${nr_cores} -I out/index.gem -l ${kmer} -o out/outfiles | ||||||||||||||||
| echo "gem-2-wig" | ||||||||||||||||
| ./gemtools-1.7.1-i3/bin/gem-2-wig -I out/index.gem -i out/outfiles.mappability -o out/outfiles | ||||||||||||||||
| echo "wigToBigWig" | ||||||||||||||||
|
|
||||||||||||||||
| ./tools/wigToBigWig out/outfiles.wig out/outfiles.sizes out/outfiles.bigWig | ||||||||||||||||
| echo "bigWigToBedGraph" | ||||||||||||||||
| ./tools/bigWigToBedGraph out/outfiles.bigWig out/outfiles.bedGraph | ||||||||||||||||
|
|
||||||||||||||||
| echo "create FAKE scores for HLA and ALT contigs" | ||||||||||||||||
| echo "Remove primary chrs and phix contigs" | ||||||||||||||||
| grep ">" GRCh38_decoy_ebv_phiX_alt_hla_chr.fa > contigs | ||||||||||||||||
| cat contigs | grep -v HLA | cut -f1,7 -d " " | sed -e 's/ /\t1\t/g' -e 's/LN://' -e 's/>//' -e 's/$/\t1.0/' > contigs_fake_mappability.bedGraph | ||||||||||||||||
| cat contigs | grep HLA | cut -f1,2 -d " " | sed -e 's/\t/ /' | cut -f1,3 -d " " | sed -e 's/ /\t1\t/g' -e 's/>//' -e 's/$/\t1.0/' >> contigs_fake_mappability.bedGraph | ||||||||||||||||
|
|
||||||||||||||||
| echo "Filter lines and compress" | ||||||||||||||||
| (awk '$4 > 0.0 {print $0}' out/outfiles.bedGraph ; cat contigs_fake_mappability.bedGraph ) | ./tools/bgzip > GRCh38_Mappability_Align_100mer_m2e2_ALT_HLA.bedGraph.gz | ||||||||||||||||
|
|
||||||||||||||||
| echo "Create Index with Tabix" | ||||||||||||||||
| ./tools/tabix -p bed GRCh38_Mappability_Align_100mer_m2e2_ALT_HLA.bedGraph.gz | ||||||||||||||||
|
|
||||||||||||||||
| echo "cleaning" | ||||||||||||||||
| rm -r ref out | ||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The script removes the - mkdir -p ref out
+ ref_dir=$(mktemp -d -t ref-XXXXXXXXXX)
+ out_dir=$(mktemp -d -t out-XXXXXXXXXX)And replace all Commitable suggestion
Suggested change
|
||||||||||||||||
Uh oh!
There was an error while loading. Please reload this page.