strvcf_annotator

https://img.shields.io/badge/docs-GitHub%20Pages-blue

STR (Short Tandem Repeat) annotation tool for VCF files.

strvcf_annotator is a Python library and CLI tool for annotating variants in VCF files that overlap short tandem repeat (STR) regions. The tool converts SNPs, SNVs and indels into full repeat sequences and adds STR metadata.

Installation

Package is available through PyPI. To install, type:

pip install strvcf-annotator

# Install from source
git clone https://github.com/acg-team/strvcf_annotator.git
cd strvcf_annotator
pip install -e .

# Dev dependencies
pip install -r requirements_dev.txt

Quick Start

Command Line

# Annotate a single VCF
strvcf-annotator --input input.vcf --str-bed repeats.bed --output output.vcf

# Batch-process a directory
strvcf-annotator --input-dir vcf_files/ --str-bed repeats.bed --output-dir annotated/

# With verbose logging
strvcf-annotator --input input.vcf --str-bed repeats.bed --output output.vcf --verbose

Library Usage

from strvcf_annotator import STRAnnotator

# Create the annotator
annotator = STRAnnotator('repeats.bed')

# Annotate a single file
annotator.annotate_vcf_file('input.vcf', 'output.vcf')

# Batch processing
annotator.process_directory('vcf_files/', 'annotated/')

# Streaming processing
import pysam
vcf_in = pysam.VariantFile('input.vcf')
for record in annotator.annotate_vcf_stream(vcf_in):
    print(f"Repeat unit: {record.info['RU']}")

Input format

BED file with STR regions

CHROM   START   END     PERIOD  RU
chr1    100     115     3       CAG
chr1    200     212     4       ATCG
chr2    300     318     3       GAT

CHROM: Chromosome name
START: Start position (0-based, BED format)
END: End position (0-based, exclusive)
PERIOD: Repeat unit length
RU: Repeat unit sequence

VCF file

A standard VCF with variants. Must contain:

FORMAT field GT (genotype)

Output format

The annotated VCF contains additional fields:

INFO fields

RU: Repeat unit
PERIOD: Repeat period (unit length)
REF: Reference copy number
PERFECT: TRUE if both alleles are perfect repeats

FORMAT fields

REPCN: Genotype expressed as repeat copy numbers

Example

##INFO=<ID=RU,Number=1,Type=String,Description="Repeat unit">
##INFO=<ID=PERIOD,Number=1,Type=Integer,Description="Repeat period">
##INFO=<ID=REF,Number=1,Type=Integer,Description="Reference copy number">
##INFO=<ID=PERFECT,Number=1,Type=String,Description="Perfect repeat indicator">
##FORMAT=<ID=REPCN,Number=2,Type=Integer,Description="Repeat copy number">

#CHROM  POS  ID  REF         ALT             QUAL  FILTER  INFO                              FORMAT      Sample1
chr1    101  .   CAGCAGCAG   CAGCAGCAGCAG    .     .       RU=CAG;PERIOD=3;REF=3;PERFECT=TRUE  GT:REPCN    0/1:3,4

Extending functionality

Creating a custom parser

from strvcf_annotator.parsers.base import BaseVCFParser

class CustomParser(BaseVCFParser):
    def get_genotype(self, record, sample_idx):
        # Your logic for extracting the genotype
        pass

    def has_variant(self, record, sample_idx):
        # Your logic for determining if there is a variant
        pass

    def extract_info(self, record, sample_idx):
        # Your logic for extracting additional fields
        pass

    def validate_record(self, record):
        # Your logic for validating the record
        pass

# Usage
annotator = STRAnnotator('repeats.bed', parser=CustomParser())

Documentation

Full documentation is available at:

https://acg-team.github.io/strvcf_annotator/

Troubleshooting

Issue: Unnormalized VCF

This tool only accepts normalized VCFs. Please normalize with bcftools before running. Example (produces a normalized, indexed VCF):

# Replace reference.fa with the exact reference used for the VCF
bcftools norm -f reference.fa -m input.vcf

Issue: Unsorted VCF

The tool automatically sorts the VCF in memory, but for large files pre-sorting is recommended:

bcftools sort input.vcf -o sorted.vcf

Issue: Reference mismatch

If you see warnings about a reference mismatch, check:

The correctness of the STR BED file
Matching reference genome versions

Contributing

Contributions are welcome! For major changes, please open an issue first to discuss what you’d like to change. Please ensure:

All tests pass
Code follows existing style
New features include tests
Documentation is updated

License

MIT License

Credits

Test bed files were taken from ConSTRain repository https://github.com/acg-team/ConSTRain.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
src/strvcf_annotator		src/strvcf_annotator
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.rst		README.rst
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
requirements_dev.txt		requirements_dev.txt
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

strvcf_annotator

Installation

Quick Start

Command Line

Library Usage

Input format

BED file with STR regions

VCF file

Output format

INFO fields

FORMAT fields

Example

Extending functionality

Creating a custom parser

Documentation

Troubleshooting

Issue: Unnormalized VCF

Issue: Unsorted VCF

Issue: Reference mismatch

Contributing

License

Credits

About

Uh oh!

Releases 5

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

acg-team/strvcf_annotator

Folders and files

Latest commit

History

Repository files navigation

strvcf_annotator

Installation

Quick Start

Command Line

Library Usage

Input format

BED file with STR regions

VCF file

Output format

INFO fields

FORMAT fields

Example

Extending functionality

Creating a custom parser

Documentation

Troubleshooting

Issue: Unnormalized VCF

Issue: Unsorted VCF

Issue: Reference mismatch

Contributing

License

Credits

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages