Skip to content

mtisza1/influenza_a_serotype

Repository files navigation

influenza_a_serotype

Assign reads from a short- or long-read sequencing library to influenza A serotypes, plus Influenza B/C/D, with iav_serotype command line tool.

The basic premise is to competitively align reads to an up-to-date database of influenza A sequences tagged with information about segment and serotype. Then, determined if a read (pair) aligns better to a particular serotype.

Input: paired-end short reads - OR - long reads

Output: 1. files of serotype-specific reads. 2. Read summary table. 3. Plot of serotype distribution in sample

NOTE: As of v0.2.0 and later, the codebase is completely rewritten to avoid R dependencies and limitations. The scoring is slightly different to account for total alignment length of paired reads. This effects the final results. If you are updating from an earlier version, I recommend a clean installaiton.

Installation

  1. Clone this github repo

  2. Create conda environment using .yaml file, e.g.

mamba env create -f influenza_a_serotype/environment/iav_serotype.yaml

  1. activate environment

conda activate iav_serotype

  1. Use pip to install command line tool.

cd influenza_a_serotype

pip install .

(you should now be able to call iav_serotype from the command line to bring up the help menu)

  1. Download and unpack database files. Choose from full database or lite databases. cd to the directory you want them to live.

Full database (v1.25, ~1.7 GB total)

NOTE: this database has all available full-length influenza A genome segments. It does not contain influenza B, C, or D genomes.

wget https://zenodo.org/records/11509609/files/Influenza_A_segment_sequences.tar.gz

tar -xvf Influenza_A_segment_sequences.tar.gz

You should now have these 2 files:

DBs/v1.25/Influenza_A_segment_info1.tsv

DBs/v1.25/Influenza_A_segment_sequences.fna

Lite database (lite1.1, 4.3 MB total)

NOTE: This database is a dramatically reduced database with representatives from most of the influenza A serotypes plus influenza B, C, and D. This allows searching with a fraction of the compute time and memory footprint at the slight expense of accuracy.

wget https://zenodo.org/records/17354032/files/Influenza_Lite_DB.tar.gz

tar -xvf Influenza_Lite_DB.tar.gz

You should now have these 2 files:

liteDB_v1.1/Influenza_A_segment_info1.tsv

liteDB_v1.1/Influenza_A_segment_sequences.fna

  1. (optional) set database as conda environmental variable

conda env config vars set IAVS_DB=/path/to/DBs/v1.25

or

conda env config vars set IAVS_DB=/path/to/liteDB_v1.1

Run iav_serotype

Right now, requirement is either 1 set of paired-end short reads per run, or 1 or more long read files. Any and all reads must be decompressed with the .fastq extension.

examples:

short paired-end reads:

iav_serotype -r my_reads/virome.R1.fastq my_reads/virome.R2.fastq -s my_virome_iav -o iav_project --db /path/to/DBs/v1.25

long reads:

iav_serotype -r long_reads/virome1.fastq long_reads/virome2.fastq long_reads/virome3.fastq -s my_lr_virome_iav -o iav_project --db /path/to/DBs/v1.25

Standard Outputs

  • {sample}_per_serotype_summary.tsv <- Main summary table for serotype counts
  • {sample}_influenza_A.sorted.bam <- filtered, sorted alignment file
  • {sample}_per_read_summary.tsv <- per-read summary file
  • {sample}_read_serotype_assignment.pdf <- plot of serotype counts
  • {sample}_{serotype}.txt <- serotype-specific read IDs
  • {sample}_{serotype}.R1.fastq <- serotype-specific reads (optional)
  • {sample}_{serotype}.R2.fastq <- serotype-specific reads (optional)
  • {sample}_read_stats.tsv <- input read stats table

Database notes

lite1.1

Clustered representative sequences from most influenza A serotypes, plus sequences from influenza B, C, and D.

v1.25

NOTE: Use database v1.25 with iav_serotype v0.1.3 or later. A small number of added reference sequences have short sequences that were likely assembled into the genome by mistake. These will cause assignment of non-specific reads as "ambiguous" IAV in iav_serotype v0.1.1, but these alignments are filtered out starting in iav_serotype v0.1.2. Further, in iav_serotype v0.1.3, the minimap2 flag -f 100000 is added to account for very high prevalence minimizers in reference. Thank you.

Description:

Fresh download of all complete influenza A sequences + metadata tables from NCBI virus. Consists of 981,537 complete segements.

  1. Search and download date=2024-06-05, NCBI virus taxid=2955291, length filter( 3000 > 700 )

  2. Then, sequences and metadata were parsed to remove any duplicate accessions and any sequences with uninformative serotype labels, e.g. "H3", "mixed", "HxNx".

  3. Finally, low complexity regions were masked with bbmask.sh with default parameters for low-complexity filtering (available with bbtools)

Zenodo

v1.1

NOTE: it was pointed out to me that several sequences with uninformative serotype labels were included in both version of the database, e.g. "H3", "mixed", "HxNx". Since these poorly-labeled sequences only represent ~1,600/~210,000 sequences in v1.1, performance should not be substantially effected, but these will be removed in future releases.

Added sequences and metadata rows to v1.0, searches on April 26, 2024:

  1. FluDB query Influenza A "complete", collection date(06-01-2022 - 05-31-2023)

  2. NCBI virus taxid=2955291, length filter( 3000 > 700 ), collection date(06-01-2023 - 04-26-2024)

Then, sequences and metadata were parsed to remove any duplicate accessions.

v1.0

using NCBI datasets command line tool, (April 23, 2024), this is how I downloaded Influenza A assmeblies:

datasets download genome taxon 2955291 --assembly-level complete --exclude-atypical --filename Influenza_A_dataset1 --include gbff

This retrieved about about 8,000 genomes/64,000 segments. There are about 1,000,000 segments on GenBank, so I am not sure why they weren't all retrieved.

After unzipping the downloaded directory I used the following script to process these data into a .fasta and table (.tsv)

python parse_iav_genbank/influenza_a_gbf_to_fna_and_table.py ncbi_dataset

Note: I'm working on getting a more complete database, as I think there is relevant diversity not included here.

Future updates

  1. Add Influenza B, C, and D

About

assign sequencing reads to influenza A serotypes

Resources

License

Stars

Watchers

Forks

Languages