A Python tool for parsing VCF (Variant Call Format) files and storing genetic variants in a PostgreSQL database with aggregated frequency data across different sample collections.
VarMTdb processes VCF files and stores genetic variants in a relational database schema designed for efficient querying of variant frequencies across populations and gene-based analysis. The tool automatically creates the database structure and handles the data insertion process.
- Automated Database Setup: Creates PostgreSQL database and tables automatically
- VCF Processing: Parses VCF files using pysam
- Aggregated Storage: Stores variant frequencies across different sample collections
- Conflict Handling: Uses PostgreSQL UPSERT operations to handle duplicate variants
- Performance Optimization: Creates essential database indexes for fast querying
- Web Interface: Streamlit app for querying variants by chromosome, gene, and allele frequency
- Single-Container Deployment: Docker image bundling PostgreSQL and the Streamlit UI
The recommended way to run VarMT. Everything (PostgreSQL, the Streamlit UI, and the CLI) lives in one container.
docker build -t varmt .docker run -d \
--name varmt-app \
-v varmt-pgdata:/data/postgresql \
-p 8502:8501 \
varmt-v varmt-pgdata:/data/postgresql— named volume so the database survives container restarts-p 8502:8501— exposes the Streamlit UI on host port 8502 (change if needed)- On first start, the container initializes PostgreSQL and creates the
varmtdatabase and tables
./docker/ingest.sh /path/to/your.vcfThe script copies the VCF into the container and runs the ingestion. Each call adds a new collection — run it once per VCF file.
./docker/indexing.shCreates the indexes that make queries fast on large datasets. Only needs to be run once; PostgreSQL maintains the indexes automatically for subsequent ingestions.
Browse to http://localhost:8502 to query the variants. If running on a remote server, set up an SSH tunnel:
ssh -L 1455:localhost:8502 user@serverThen open http://localhost:1455 on your local machine.
Just run ./docker/ingest.sh /path/to/another.vcf again — collections accumulate over time. No need to restart the container or recreate indexes.
Use this only if you don't want to run Docker. Requires a PostgreSQL server you manage yourself.
conda env create -f environment.yml
conda activate varMTenvThe use of .pgpass file is recommended for security reasons. You can also pass the password directly via -p.
With .pgpass (recommended):
python3 src/vcf2db.py -d database_name -u username -l host -c -t -i -x -v path/to/vcfWith password on command line:
python3 src/vcf2db.py -d database_name -u username -p password -l host -c -t -i -x -v path/to/vcfRequires config/db_connection.yml with database connection parameters. Supports .pgpass authentication (omit password in config) or direct password specification.
streamlit run src/streamlit_app.pyThese arguments apply to both the Docker workflow (when calling vcf2db.py via docker exec) and manual installation.
Database Connection:
-d, --database: PostgreSQL database name (required)-u, --username: PostgreSQL username (required)-p, --password: PostgreSQL password (optional if using .pgpass)-l, --host: PostgreSQL host name (required)
Database Operations:
-c, --create: Delete and recreate database if existing-t, --tables: Create the database tables-i, --insert: Insert VCF data into tables-x, --indexes: Create indexes on tables
Data Input:
-v, --vcf: Path to VCF file or directory containing VCF files-r, --reference_genome: Reference genome version (optional, default=GRCh38)
-
Create Database and Tables:
python3 src/vcf2db.py -d myvariantdb -u postgres -l localhost -c -t -v data/
-
Process VCF Files:
python3 src/vcf2db.py -d myvariantdb -u postgres -l localhost -i -v data/
-
Create Indexes (after data loading):
python3 src/vcf2db.py -d myvariantdb -u postgres -l localhost -x
varMT/
├── Dockerfile # Container definition
├── supervisord.conf # Manages PostgreSQL + Streamlit inside the container
├── environment.txt # Pip dependencies (used by Docker build)
├── environment.yml # Conda environment spec (manual install)
├── docker/
│ ├── init-db.sh # First-run: initdb + create varmt DB and tables
│ ├── db_connection.yml # In-container Streamlit DB config (localhost)
│ ├── ingest.sh # Wrapper: docker cp VCF + run vcf2db.py
│ └── indexing.sh # Wrapper: run vcf2db.py -x for indexing
├── src/
│ ├── vcf2db.py # Main CLI application
│ ├── vcf2db_cli.py # Command line interface
│ ├── streamlit_app.py # Streamlit web interface entry point
│ ├── pages/
│ │ └── 1_Advanced_Search.py # Advanced gene search page
│ ├── queries/
│ │ └── variant_queries.py # Centralized SQL queries
│ └── utils/
│ ├── db_utils.py # Database utility functions
│ ├── streamlit_db.py # DatabaseClient for Streamlit queries
│ ├── vep_utils.py # VEP annotation parsing utilities
│ ├── csv_parser.py # CSV/gene mapping file parser
│ └── setup_logging.py # Logging configuration
├── config/
│ └── db_connection.yml # Local Streamlit DB config (not tracked)
├── tests/ # Pytest suite
├── res/
│ └── data/
│ └── subset_hg19.vcf # Sample VCF file
└── docs/
├── database_documentation.md
└── ERschema.png
- Port already in use when starting the container: another service is bound to the host port. Use a different port (e.g.
-p 8503:8501). initdb: directory exists but is not empty: a previous container left partial data in the volume. Remove it withdocker volume rm varmt-pgdataand start fresh.- Streamlit shows "connection failed": ensure the container is running (
docker ps) and that you're hitting the host port mapped to 8501.
The sample dataset provided (res/data/subset_hg19.vcf) is publicly available (source 1000 Genomes Project).