Supplementary genome reference files in addition to the basic FASTA and GTF as well as notes and commentary.
Disclaimer: Reference genomes are complicated and there are many caveats, even for commonly used ones (see: which human reference genome to use). Use at your own risk.
Genomics bioinformatic tools generally require a reference file/index, but the exact requirements can vary.
This repository includes notes for setting up a reference genome directory that can be used for common analysis types (RNA-seq, ChIP-seq, whole genome/exome sequencing) as well as some pre-generated supplementary files.
The organization scheme is inspired by Illumina iGenomes, a collection of sequence and annotation files for commonly analyzed genomes.
The files are placed in separate directories based on the genome reference version, such as hg38 or mm10.
Within each genome directory, the files are named based on the type.
The two primary files that are required:
genome.fa- genome sequence in FASTA formatgenes.gtf- gene annotations in GTF format
These can usually be downloaded from genome browsers like Ensembl, UCSC, or NCBI. The best source will vary depending on the species of interest. There are also species-specific resources like FlyBase or WormBase. The differences can be minor (different contig names), but they could be more substantial (thousands of additional genes).
There are a few issues to be aware of to avoid potential downstream problems.
Both genome.fa and genes.gtf have to be valid FASTA and GTF files, but they should also be compatible with each other.
If a FASTA index and a sequence dictionary can be successfully generated (see below), the FASTA is likely compatible with all tools. One exception is that contig names (such as "chr1") with spaces will cause issues for some applications. Check contig names with:
grep "^>" genome.faContig names should be identical in the FASTA and the GTF. Create a list of contigs in both files and check that the GTF does not have more contigs than the FASTA:
cat genome.fa.fai | cut -f 1 | sort > contigs.fa.txt
cat genes.gtf | cut -f 1 | sort | uniq > contigs.gtf.txt
wc -l contigs.*.txtThis counts the total number of contigs from each file.
Check that all the contig names in the GTF are present in the FASTA:
diff --side-by-side --suppress-common-lines contigs.fa.txt contigs.gtf.txtOnly the non-matching contigs will be output. If the FASTA has additional contigs, that should be fine.
The GTF should include gene_name and gene_id attributes for all records.
Check the number of genes with and without those attributes:
cat genes.gtf | grep -F -v "transcript_id" | grep -F -w "gene" | wc -l
cat genes.gtf | grep -F -v "transcript_id" | grep -F -w "gene" | grep -F "gene_name" | wc -l
cat genes.gtf | grep -F -v "transcript_id" | grep -F -w "gene" | grep -F "gene_id" | wc -lThe three numbers should be identical.
The GTF should include gene, transcript, and exon features (column 3) for all genes.
Generate FASTA index genome.fa.fai using samtools:
samtools faidx genome.faGenerate sequence dictionary genome.dict using Picard (used by Picard and GATK):
java -Xms16G -Xmx16G -jar /path/to/picard.jar CreateSequenceDictionary \
REFERENCE=genome.fa OUTPUT=genome.dictGenerate chrom.sizes (used by bedtools and UCSC utilities):
cut -f 1,2 genome.fa.fai > chrom.sizesGenerate genome BED file using bedtools:
bedtools makewindows -g chrom.sizes -w 999999999 \
| LC_ALL=C sort -k1,1 \
> genome.bedGenerate Bowtie 2 index (also compatible with Bowtie 1.2.3+):
mkdir Bowtie2
ln -s ../genome.fa Bowtie2/genome.fa
bowtie2-build --threads 4 Bowtie2/genome.fa Bowtie2/genomeGenerate BWA index:
mkdir BWA
ln -s ../genome.fa BWA/genome.fa
bwa index -a bwtsw BWA/genome.faGenerate gene predictions in refFlat format (must be gzipped for Picard CollectRnaSeqMetrics):
# gtfToGenePred is part of UCSC Genome Browser Utilities
gtfToGenePred -genePredExt -geneNameAsName2 genes.gtf refFlat.tmp
paste <(cut -f 12 refFlat.tmp) <(cut -f 1-10 refFlat.tmp) > refFlat.txt
rm -f refFlat.tmp
gzip refFlat.txtGenerate STAR index:
mkdir STAR
STAR --runMode genomeGenerate --sjdbOverhang 100 --runThreadN 4 \
--genomeDir STAR --genomeFastaFiles genome.fa --sjdbGTFfile genes.gtfGenerate kallisto index:
kallisto index --make-unique --index kallisto.idx transcripts.fa Generate Salmon index:
salmon index --type quasi --threads 4 \
--transcripts transcripts.fa \
--index SalmonGenerate Salmon (version 1.2+) index for decoy-aware selective alignment:
mkdir Salmon
cat transcripts.fa genome.fa > Salmon/gentrome.fa
gzip -v Salmon/gentrome.fa
# a default k of 31 is optimized for reads >75bp, consider a smaller k with shorter reads
salmon index --threads 4 --kmerLen 23 \
--transcripts Salmon/gentrome.fa.gz \
--decoys Salmon/decoys.txt \
--gencode \
--index Salmon