Assign reads from a short- or long-read sequencing library to influenza A serotypes, plus Influenza B/C/D, with iav_serotype command line tool.
The basic premise is to competitively align reads to an up-to-date database of influenza A sequences tagged with information about segment and serotype. Then, determined if a read (pair) aligns better to a particular serotype.
Input: paired-end short reads - OR - long reads
Output: 1. files of serotype-specific reads. 2. Read summary table. 3. Plot of serotype distribution in sample
NOTE: As of v0.2.0 and later, the codebase is completely rewritten to avoid R dependencies and limitations. The scoring is slightly different to account for total alignment length of paired reads. This effects the final results. If you are updating from an earlier version, I recommend a clean installaiton.
-
Clone this github repo
-
Create
condaenvironment using .yaml file, e.g.
mamba env create -f influenza_a_serotype/environment/iav_serotype.yaml
- activate environment
conda activate iav_serotype
- Use
pipto install command line tool.
cd influenza_a_serotype
pip install .
(you should now be able to call iav_serotype from the command line to bring up the help menu)
- Download and unpack database files. Choose from full database or lite databases.
cdto the directory you want them to live.
Full database (v1.25, ~1.7 GB total)
NOTE: this database has all available full-length influenza A genome segments. It does not contain influenza B, C, or D genomes.
wget https://zenodo.org/records/11509609/files/Influenza_A_segment_sequences.tar.gz
tar -xvf Influenza_A_segment_sequences.tar.gz
You should now have these 2 files:
DBs/v1.25/Influenza_A_segment_info1.tsv
DBs/v1.25/Influenza_A_segment_sequences.fna
Lite database (lite1.1, 4.3 MB total)
NOTE: This database is a dramatically reduced database with representatives from most of the influenza A serotypes plus influenza B, C, and D. This allows searching with a fraction of the compute time and memory footprint at the slight expense of accuracy.
wget https://zenodo.org/records/17354032/files/Influenza_Lite_DB.tar.gz
tar -xvf Influenza_Lite_DB.tar.gz
You should now have these 2 files:
liteDB_v1.1/Influenza_A_segment_info1.tsv
liteDB_v1.1/Influenza_A_segment_sequences.fna
- (optional) set database as conda environmental variable
conda env config vars set IAVS_DB=/path/to/DBs/v1.25
or
conda env config vars set IAVS_DB=/path/to/liteDB_v1.1
Right now, requirement is either 1 set of paired-end short reads per run, or 1 or more long read files. Any and all reads must be decompressed with the .fastq extension.
short paired-end reads:
iav_serotype -r my_reads/virome.R1.fastq my_reads/virome.R2.fastq -s my_virome_iav -o iav_project --db /path/to/DBs/v1.25
long reads:
iav_serotype -r long_reads/virome1.fastq long_reads/virome2.fastq long_reads/virome3.fastq -s my_lr_virome_iav -o iav_project --db /path/to/DBs/v1.25
- {sample}_per_serotype_summary.tsv <- Main summary table for serotype counts
- {sample}_influenza_A.sorted.bam <- filtered, sorted alignment file
- {sample}_per_read_summary.tsv <- per-read summary file
- {sample}_read_serotype_assignment.pdf <- plot of serotype counts
- {sample}_{serotype}.txt <- serotype-specific read IDs
- {sample}_{serotype}.R1.fastq <- serotype-specific reads (optional)
- {sample}_{serotype}.R2.fastq <- serotype-specific reads (optional)
- {sample}_read_stats.tsv <- input read stats table
Clustered representative sequences from most influenza A serotypes, plus sequences from influenza B, C, and D.
NOTE: Use database v1.25 with iav_serotype v0.1.3 or later. A small number of added reference sequences have short sequences that were likely assembled into the genome by mistake. These will cause assignment of non-specific reads as "ambiguous" IAV in iav_serotype v0.1.1, but these alignments are filtered out starting in iav_serotype v0.1.2. Further, in iav_serotype v0.1.3, the minimap2 flag -f 100000 is added to account for very high prevalence minimizers in reference. Thank you.
Description:
Fresh download of all complete influenza A sequences + metadata tables from NCBI virus. Consists of 981,537 complete segements.
-
Search and download date=2024-06-05, NCBI virus taxid=2955291, length filter( 3000 > 700 )
-
Then, sequences and metadata were parsed to remove any duplicate accessions and any sequences with uninformative serotype labels, e.g. "H3", "mixed", "HxNx".
-
Finally, low complexity regions were masked with
bbmask.shwith default parameters for low-complexity filtering (available withbbtools)
NOTE: it was pointed out to me that several sequences with uninformative serotype labels were included in both version of the database, e.g. "H3", "mixed", "HxNx". Since these poorly-labeled sequences only represent ~1,600/~210,000 sequences in v1.1, performance should not be substantially effected, but these will be removed in future releases.
Added sequences and metadata rows to v1.0, searches on April 26, 2024:
-
FluDB query Influenza A "complete", collection date(06-01-2022 - 05-31-2023)
-
NCBI virus taxid=2955291, length filter( 3000 > 700 ), collection date(06-01-2023 - 04-26-2024)
Then, sequences and metadata were parsed to remove any duplicate accessions.
using NCBI datasets command line tool, (April 23, 2024), this is how I downloaded Influenza A assmeblies:
datasets download genome taxon 2955291 --assembly-level complete --exclude-atypical --filename Influenza_A_dataset1 --include gbff
This retrieved about about 8,000 genomes/64,000 segments. There are about 1,000,000 segments on GenBank, so I am not sure why they weren't all retrieved.
After unzipping the downloaded directory I used the following script to process these data into a .fasta and table (.tsv)
python parse_iav_genbank/influenza_a_gbf_to_fna_and_table.py ncbi_dataset
Note: I'm working on getting a more complete database, as I think there is relevant diversity not included here.
- Add Influenza B, C, and D