VarMT - Genetic Variant Database Tool

A Python tool for parsing VCF (Variant Call Format) files and storing genetic variants in a PostgreSQL database with aggregated frequency data across different sample collections.

Overview

VarMTdb processes VCF files and stores genetic variants in a relational database schema designed for efficient querying of variant frequencies across populations and gene-based analysis. The tool automatically creates the database structure and handles the data insertion process.

Features

Automated Database Setup: Creates PostgreSQL database and tables automatically
VCF Processing: Parses VCF files using pysam
Aggregated Storage: Stores variant frequencies across different sample collections
Conflict Handling: Uses PostgreSQL UPSERT operations to handle duplicate variants
Performance Optimization: Creates essential database indexes for fast querying
Web Interface: Streamlit app for querying variants by chromosome, gene, and allele frequency
Single-Container Deployment: Docker image bundling PostgreSQL and the Streamlit UI

Quick Start (Docker)

The recommended way to run VarMT. Everything (PostgreSQL, the Streamlit UI, and the CLI) lives in one container.

1. Build the image

docker build -t varmt .

2. Start the container

docker run -d \
    --name varmt-app \
    -v varmt-pgdata:/data/postgresql \
    -p 8502:8501 \
    varmt

-v varmt-pgdata:/data/postgresql — named volume so the database survives container restarts
-p 8502:8501 — exposes the Streamlit UI on host port 8502 (change if needed)
On first start, the container initializes PostgreSQL and creates the varmt database and tables

3. Ingest a VCF file

./docker/ingest.sh /path/to/your.vcf

The script copies the VCF into the container and runs the ingestion. Each call adds a new collection — run it once per VCF file.

4. Create indexes (one-time, after first ingestion)

./docker/indexing.sh

Creates the indexes that make queries fast on large datasets. Only needs to be run once; PostgreSQL maintains the indexes automatically for subsequent ingestions.

5. Open the web UI

Browse to http://localhost:8502 to query the variants. If running on a remote server, set up an SSH tunnel:

ssh -L 1455:localhost:8502 user@server

Then open http://localhost:1455 on your local machine.

Adding more collections

Just run ./docker/ingest.sh /path/to/another.vcf again — collections accumulate over time. No need to restart the container or recreate indexes.

Manual Installation

Use this only if you don't want to run Docker. Requires a PostgreSQL server you manage yourself.

Environment Setup

conda env create -f environment.yml
conda activate varMTenv

Database Authentication

The use of .pgpass file is recommended for security reasons. You can also pass the password directly via -p.

Running

With .pgpass (recommended):

python3 src/vcf2db.py -d database_name -u username -l host -c -t -i -x -v path/to/vcf

With password on command line:

python3 src/vcf2db.py -d database_name -u username -p password -l host -c -t -i -x -v path/to/vcf

Web Interface

Requires config/db_connection.yml with database connection parameters. Supports .pgpass authentication (omit password in config) or direct password specification.

streamlit run src/streamlit_app.py

CLI Reference

These arguments apply to both the Docker workflow (when calling vcf2db.py via docker exec) and manual installation.

Database Connection:

-d, --database: PostgreSQL database name (required)
-u, --username: PostgreSQL username (required)
-p, --password: PostgreSQL password (optional if using .pgpass)
-l, --host: PostgreSQL host name (required)

Database Operations:

-c, --create: Delete and recreate database if existing
-t, --tables: Create the database tables
-i, --insert: Insert VCF data into tables
-x, --indexes: Create indexes on tables

Data Input:

-v, --vcf: Path to VCF file or directory containing VCF files
-r, --reference_genome: Reference genome version (optional, default=GRCh38)

Step-by-Step Workflow (Manual)

Create Database and Tables:

python3 src/vcf2db.py -d myvariantdb -u postgres -l localhost -c -t -v data/

Process VCF Files:

python3 src/vcf2db.py -d myvariantdb -u postgres -l localhost -i -v data/

Create Indexes (after data loading):

python3 src/vcf2db.py -d myvariantdb -u postgres -l localhost -x

Project Structure

varMT/
├── Dockerfile                   # Container definition
├── supervisord.conf             # Manages PostgreSQL + Streamlit inside the container
├── environment.txt              # Pip dependencies (used by Docker build)
├── environment.yml              # Conda environment spec (manual install)
├── docker/
│   ├── init-db.sh               # First-run: initdb + create varmt DB and tables
│   ├── db_connection.yml        # In-container Streamlit DB config (localhost)
│   ├── ingest.sh                # Wrapper: docker cp VCF + run vcf2db.py
│   └── indexing.sh              # Wrapper: run vcf2db.py -x for indexing
├── src/
│   ├── vcf2db.py                # Main CLI application
│   ├── vcf2db_cli.py            # Command line interface
│   ├── streamlit_app.py         # Streamlit web interface entry point
│   ├── pages/
│   │   └── 1_Advanced_Search.py # Advanced gene search page
│   ├── queries/
│   │   └── variant_queries.py   # Centralized SQL queries
│   └── utils/
│       ├── db_utils.py          # Database utility functions
│       ├── streamlit_db.py      # DatabaseClient for Streamlit queries
│       ├── vep_utils.py         # VEP annotation parsing utilities
│       ├── csv_parser.py        # CSV/gene mapping file parser
│       └── setup_logging.py     # Logging configuration
├── config/
│   └── db_connection.yml        # Local Streamlit DB config (not tracked)
├── tests/                       # Pytest suite
├── res/
│   └── data/
│       └── subset_hg19.vcf      # Sample VCF file
└── docs/
    ├── database_documentation.md
    └── ERschema.png

Troubleshooting

Port already in use when starting the container: another service is bound to the host port. Use a different port (e.g. -p 8503:8501).
initdb: directory exists but is not empty: a previous container left partial data in the volume. Remove it with docker volume rm varmt-pgdata and start fresh.
Streamlit shows "connection failed": ensure the container is running (docker ps) and that you're hitting the host port mapped to 8501.

Sample Data

The sample dataset provided (res/data/subset_hg19.vcf) is publicly available (source 1000 Genomes Project).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VarMT - Genetic Variant Database Tool

Overview

Features

Quick Start (Docker)

1. Build the image

2. Start the container

3. Ingest a VCF file

4. Create indexes (one-time, after first ingestion)

5. Open the web UI

Adding more collections

Manual Installation

Environment Setup

Database Authentication

Running

Web Interface

CLI Reference

Step-by-Step Workflow (Manual)

Project Structure

Troubleshooting

Sample Data

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
docker		docker
docs		docs
res		res
src		src
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
environment.txt		environment.txt
environment.yml		environment.yml
supervisord.conf		supervisord.conf

Folders and files

Latest commit

History

Repository files navigation

VarMT - Genetic Variant Database Tool

Overview

Features

Quick Start (Docker)

1. Build the image

2. Start the container

3. Ingest a VCF file

4. Create indexes (one-time, after first ingestion)

5. Open the web UI

Adding more collections

Manual Installation

Environment Setup

Database Authentication

Running

Web Interface

CLI Reference

Step-by-Step Workflow (Manual)

Project Structure

Troubleshooting

Sample Data

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages