In this example, we'll show you how to prepare whole genome alignments and single gene alignments for use with viral-mcorr. We will be working with the set of 191 SARS-like coronavirus whole genome sequences used in our paper (link here). These were also used as the basis to create the Nextstrain build for SARS-like betacoronaviruses https://nextstrain.org/groups/blab/sars-like-cov.
- To get started, follow the instructions [link will go here] to install
viral-mcorrand download or clone this repository. You will also need to install ViralMSA: https://github.com/niemasd/ViralMSA. We used Minimap2 https://github.com/lh3/minimap2 to perform the alignment (as recommended by ViralMSA). You will also need to download this GitHub repository to your computer as well as install an in-house program we built to make XMFA files:
cd ~/Downloads
git clone https://github.com/kussell-lab/virus_alignment_example.git
cd virus_alignment_example/geneMSA
go install .
go install github.com/kussell-lab/ReferenceAlignmentGenerator/CollectGeneAlignments@latest- The unaligned sequences are provided in the file "sequences.fasta" take them, and align them to the NCBI reference
genome for SARS-CoV-2, provided in this repository and also here.
You can do this using the following commandline prompt:
ViralMSA.py -s sequences.fasta -r ncbi_reference/GCF_009858895.2_ASM985889v3_genomic.fasta -e EMAIL -o viralmsa_output
- Take the aligned genomes, then split them into separate files for each consensus genome using split_fasta.py
python3 split_fasta.py ./ viral_msa_output/sequences.fasta.aln- Use CollectGeneAlignments to make an XMFA file for the genomes
CollectGeneAlignments genomefile_list ncbi_reference/GCF_009858895.2_ASM985889v3_genomic.gff aligned_genomes aligned_sl-cov.xmfa --progress --appendix=".fa"- Remove orf1a CDS region (overlaps with the orf1ab region, and throws off
mcorrViralGenomeandmcorrLDGenome, which assemble all CDS regions into one continuous coding sequence before it begins analysis)
python3 remove_extragene.py ./ aligned_sl-cov.xmfa- At this point
aligned_sl-cov.xmfais ready for use withmcorrViralGenomeandmcorrLDGenome. To prepare single gene multi-fasta files for use withmcorr-gene-aln, just use the following commands:
geneMSA aligned_sl-cov.xmfaThis will output alignments for single genes into the folder genes
The TL;DR version is you can replicate the above steps by executing the shell script in this repository align_viralgenomes.sh:
bash align_viralgenomes.sh <sequences fasta> <reference fasta> <reference gff> <outdir> <out prefix>sequences fastais the fasta file containing the unaligned sequences.reference fastais the fasta file of the reference genomereference gffis the GFF3 file for the reference genomeoutdiris the output directory for the aligned xmfa files for all genes and single genesout prefixis the prefix for the xmfa file of whole genome alignments
To run through the example with the shell script use the following commandline prompts:
cd ~/Downloads
cd sars-like_cov_alignment_example
bash align_viralgenomes.sh sequences.fasta ncbi_reference/GCF_009858895.2_ASM985889v3_genomic.fasta ncbi_reference/GCF_009858895.2_ASM985889v3_genomic.gff example_out aligned_sl-cov