STR (Short Tandem Repeat) annotation tool for VCF files.
strvcf_annotator is a Python library and CLI tool for annotating variants in VCF files that overlap short tandem repeat (STR) regions. The tool converts SNPs, SNVs and indels into full repeat sequences and adds STR metadata.
Package is available through PyPI. To install, type:
pip install strvcf-annotator# Install from source
git clone https://github.com/acg-team/strvcf_annotator.git
cd strvcf_annotator
pip install -e .# Dev dependencies
pip install -r requirements_dev.txt# Annotate a single VCF
strvcf-annotator --input input.vcf --str-bed repeats.bed --output output.vcf
# Batch-process a directory
strvcf-annotator --input-dir vcf_files/ --str-bed repeats.bed --output-dir annotated/
# With verbose logging
strvcf-annotator --input input.vcf --str-bed repeats.bed --output output.vcf --verbosefrom strvcf_annotator import STRAnnotator
# Create the annotator
annotator = STRAnnotator('repeats.bed')
# Annotate a single file
annotator.annotate_vcf_file('input.vcf', 'output.vcf')
# Batch processing
annotator.process_directory('vcf_files/', 'annotated/')
# Streaming processing
import pysam
vcf_in = pysam.VariantFile('input.vcf')
for record in annotator.annotate_vcf_stream(vcf_in):
print(f"Repeat unit: {record.info['RU']}")CHROM START END PERIOD RU
chr1 100 115 3 CAG
chr1 200 212 4 ATCG
chr2 300 318 3 GAT- CHROM: Chromosome name
- START: Start position (0-based, BED format)
- END: End position (0-based, exclusive)
- PERIOD: Repeat unit length
- RU: Repeat unit sequence
A standard VCF with variants. Must contain:
- FORMAT field GT (genotype)
- Optional: AD (allelic depth), DP (total depth)
The annotated VCF contains additional fields:
- RU: Repeat unit
- PERIOD: Repeat period (unit length)
- REF: Reference copy number
- PERFECT: TRUE if both alleles are perfect repeats
- REPCN: Genotype expressed as repeat copy numbers
##INFO=<ID=RU,Number=1,Type=String,Description="Repeat unit">
##INFO=<ID=PERIOD,Number=1,Type=Integer,Description="Repeat period">
##INFO=<ID=REF,Number=1,Type=Integer,Description="Reference copy number">
##INFO=<ID=PERFECT,Number=1,Type=String,Description="Perfect repeat indicator">
##FORMAT=<ID=REPCN,Number=2,Type=Integer,Description="Repeat copy number">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr1 101 . CAGCAGCAG CAGCAGCAGCAG . . RU=CAG;PERIOD=3;REF=3;PERFECT=TRUE GT:REPCN 0/1:3,4from strvcf_annotator.parsers.base import BaseVCFParser
class CustomParser(BaseVCFParser):
def get_genotype(self, record, sample_idx):
# Your logic for extracting the genotype
pass
def has_variant(self, record, sample_idx):
# Your logic for determining if there is a variant
pass
def extract_info(self, record, sample_idx):
# Your logic for extracting additional fields
pass
def validate_record(self, record):
# Your logic for validating the record
pass
# Usage
annotator = STRAnnotator('repeats.bed', parser=CustomParser())# Install the package in editable (dev) mode
pip install -e .This tool only accepts normalized VCFs. Please normalize with bcftools before running. Example (produces a normalized, indexed VCF):
# Replace reference.fa with the exact reference used for the VCF
bcftools norm -f reference.fa -m input.vcfThe tool automatically sorts the VCF in memory, but for large files pre-sorting is recommended:
bcftools sort input.vcf -o sorted.vcfIf you see warnings about a reference mismatch, check:
- The correctness of the STR BED file
- Matching reference genome versions
Contributions are welcome! For major changes, please open an issue first to discuss what you’d like to change. Please ensure:
- All tests pass
- Code follows existing style
- New features include tests
- Documentation is updated
MIT License
Test bed files were taken from ConSTRain repository https://github.com/acg-team/ConSTRain.