This repository contains comprehensive analyses for genome-wide Short Tandem Repeat (STR) variations in human populations.
The pipeline was applied to the following datasets:
- 1000 Genomes Project (1KGP)
- Human Genome Diversity Project (HGDP)
- Simon Genome Diversity Project (SGDP)
- H3Africa
Before running the pipeline, ensure the following input files are available.
STR genotyping VCF for 1KGP and H3Africa are obtained from:
- EnsembleTR
GitHub: https://github.com/gymreklab/EnsembleTR
(Multi-caller consensus STR genotyping; HipSTR calls extracted viaMETHODS[2] == 1.)
STR genotypes of HGDP and SGDP are generated using HipSTR:
- HipSTR:
links to be added
Columns should include:
- Sample ID (used to align with matrix columns)
- Continental population labels
- Regional population labels
All core analysis scripts are located in the scripts/ directory.
preprocess_vcfs.py
Scripts convert EnsembleTR or HipSTR VCFs into per-chromosome and genome-wide STR matrices.
Main steps:
- parse STR genotypes (
NCOPY,GB) - filter loci by motif period and variance
- optional: keep only HipSTR calls from EnsembleTR (
METHODS[2] == 1) - produce:
str_info.tsv(metadata for STR loci)str_matrix.tsv(samples × STR loci)
integrate_cohorts.py
Used when merging data from 1KGP, HGDP, SGDP, and H3Africa.
Steps:
- identify overlapping loci using PyRanges
- apply variance filtering to each dataset
- remove batch-driven loci using population-level STR mean comparisons
- output merged STR matrices aligned to the same loci
This enables joint population structure analyses.
unsupervised_clustering.py, genetic_distances.py, geo_distances.py
These provide a direct comparison of population structure resolution between STRs and SNPs.
Features:
- PCA-based dimensionality reduction
- K-means clustering
- Adjusted Rand Index (ARI) evaluation at:
- continental population levels
- regional population levels
- hierarchical clustering of genetic distance matrices (STRs and SNPs)
- comparison of STR/SNP distances to geographic distances using Mantel tests
supervised_assignment.py
Two classifiers are implemented:
- Random Forest
- Naive Bayes
Outputs include accuracy metrics at both continental and regional levels for STRs and SNPs.
dNMF.py
The model assumes ancestral components are encoded jointly in STR expansions and contractions.
Input:
- STR genotype matrices from 1KGP and HGDP+SGDP
Output:
- Ancestry coefficient matrices
- STR contribution matrices
Using dNMF, we detected:
- K = 12 ancestral components in 1KGP
- K = 11 ancestral components in HGDP+SGDP
STR_signatures.py
- Identify top STR loci per direction per component
- Define direction-specific STR signatures
- Perform enrichment tests for STR motif lengths and genomic regions
This repository includes Jupyter notebooks and R scripts for generating figures in the notebooks/ and R_scripts/ directory.
- Python 3.10 or newer
- numpy
- pandas
- scipy
- scikit-learn
- statsmodels
- pyranges
Installable via pip or conda, for example:
pip install numpy pandas scipy scikit-learn statsmodels pyranges
