A comprehensive web-based genomic annotation platform for exploring and analyzing the genomes of Phocaeicola vulgatus and Bacteroides uniformis
- About
- Features
- Architecture
- Prerequisites
- Quick Start
- Installation
- Configuration
- Database Setup
- Data Import
- Development
- Deployment
- API Documentation
- License
The transversal theme aims at mechanistically understanding the complex role that human-associated microbiomes play in human health and disease. Our current knowledge of bacterial gene functions come primarily from very few model bacteria, failing to capture the genetic diversity within the gut microbiome. One of the goals of METT is to systematically tackle the vast genetic matter in the gut microbiome and two establish new model microbes. The Flagship Project of METT has focused efforts on annotating the genomes of Phocaeicola Vulgatus and Bacteroides uniformis, two of the most prevalent and abundant bacterial species of the human microbiome.
The current version is a web-based genomic annotation editing platform designed to browse the genomes of the type strains B. uniformis (ATCC8492) and P. vulgatus (ATCC8482). The annotation data generated by the ME TT has been organised on an FTP directory hosted at EBI and contains structural annotations (such as Prokka and Mobilome predictions, etc.) as well as functional annotations (including biosynthetic gene clusters, carbohydrate active enzymes, etc.) .
Type Strains:
- B. uniformis (ATCC8492)
- P. vulgatus (ATCC8482)
- Comprehensive Genome Browser: Interactive visualization of genome annotations
- Multi-strain Analysis: Compare annotations across multiple bacterial strains
- Functional Annotations:
- Gene essentiality data
- Proteomics evidence
- Metabolic pathway information
- Gene-reaction associations
- Experimental Data Integration:
- Fitness correlation analysis
- Mutant growth data
- Thermal proteome profiling (TPP)
- Protein-protein interaction networks
- Advanced Search: Elasticsearch-powered search across all annotations
- Sequence Analysis: PyHMMER integration for homology searches
- RESTful API: Comprehensive API for programmatic access
- Docker & Kubernetes Ready: Production-ready containerization
- Real-time Updates: Celery-based async task processing
┌─────────────────────────────────────────────────────────────┐
│ METT Data Portal │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────────────────┐ │
│ │ React UI │ ◄─────► │ Django REST API │ │
│ │ (Frontend) │ │ (Backend) │ │
│ └──────────────┘ └──────────────────────────┘ │
│ │ │
│ ┌────────┴────────┐ │
│ │ │ │
│ ┌────────▼────────┐ ┌────▼─────────┐ │
│ │ PostgreSQL │ │ Elasticsearch│ │
│ │ (Relational) │ │ (Search) │ │
│ └─────────────────┘ └──────────────┘ │
│ │
│ ┌─────────────────┐ │
│ │ Celery │ │
│ │ (Task Queue) │ │
│ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Technology Stack:
- Backend: Django 5.x, Django Ninja (API), Celery
- Frontend: React, TypeScript, SCSS
- Database: PostgreSQL 13+, Elasticsearch 8+
- Task Queue: Celery with Redis/RabbitMQ
- Containerization: Docker, Kubernetes
- Package Management: uv (Python), npm (JavaScript)
- Python: 3.13+ (Download)
- Node.js: 18+ (Download)
- PostgreSQL: 13+ (Download)
- Elasticsearch: 8+ (Download)
- Docker: (Optional, for containerized deployment) (Download)
- uv: Python package manager (Install)
- RAM: 16GB minimum, 32GB recommended
- Storage: 20GB+ for full dataset
- OS: Linux, macOS, or Windows (WSL2)
# 1. Clone the repository
git clone <repository-url>
cd mett-dataportal
# 2. Set up Python environment
cd dataportal_api
uv pip install -r uv.lock
python manage.py migrate
# 3. Set up Elasticsearch
python manage.py create_es_index
# 4. Set up Frontend
cd ../dataportal-app
npm install
# 5. Run development servers
# 5a. Setup auth mechanism (temporary)
cd dataportal_api
python manage.py migrate
python manage.py seed_roles
# Terminal 1 - Backend
cd dataportal_api
python manage.py runserver
# Terminal 2 - Frontend
cd dataportal-app
npm startcd dataportal_api
# Generate lock file (if needed)
uv lock
# Install dependencies (development) - (uv will respect the active conda env.)
pip install uv
# Sync dependencies from the existing lock
uv sync
# Install dependencies (production)
uv pip install -r uv.lock --no-dev
# Install pre-commit hooks
pre-commit installCreate a .env file in dataportal_api/:
# Database
DATABASE_URL=postgresql://user:password@localhost:5432/mett_db
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_DB=mett_db
POSTGRES_USER=user
POSTGRES_PASSWORD=password
# Elasticsearch
ELASTICSEARCH_HOST=localhost
ELASTICSEARCH_PORT=9200
ELASTICSEARCH_SCHEME=http
ES_INDEX_VERSION=2025.09.03
# Django
SECRET_KEY=your-secret-key-here
DEBUG=True
ALLOWED_HOSTS=localhost,127.0.0.1
# Celery
CELERY_BROKER_URL=redis://localhost:6379/0
CELERY_RESULT_BACKEND=redis://localhost:6379/0
# CORS
CORS_ALLOWED_ORIGINS=http://localhost:3000
# FTP Configuration
FTP_SERVER=ftp.ebi.ac.uk
FTP_BASE_PATH=/pub/databases/mett/annotations/v1_2024-04-15Alternatively, use the provided environment setup scripts:
# Development
source set-env-dev.sh
# Production
source set-env-prod.shcd dataportal-app
# Install dependencies
npm install
# Development server
npm start
# Production build
npm run build# Backend
docker build -t mett-dataportal:latest -f dataportal_api/Dockerfile .
# Frontend
docker build -t mett-dataportal-app:latest -f dataportal-app/Dockerfile .# Backend
docker run --rm -it -p 8000:8000 mett-dataportal:latest
# Frontend
docker run --rm -it -p 3000:80 mett-dataportal-app:latest# Start all services
docker-compose up -d
# Stop all services
docker-compose downThe project uses Pydantic for configuration management. Configuration files can be found in dataportal_api/dataportal/settings.py.
- Development:
config/local.env - Production: Environment variables should be set via Kubernetes secrets or Docker environment
cd dataportal_api
# Run all migrations
python manage.py migrate
# PyHMMER specific migrations
python manage.py migrate pyhmmer_search
# Celery Beat migrations (for scheduled tasks)
python manage.py migrate django_celery_beat# Create all indices with default version
$ python manage.py create_es_index
# Create indices with specific version
$ python manage.py create_es_index --es-version v5
# Create specific model index
$ python manage.py create_es_index --model GeneFitnessCorrelationDocument --es-version 2025.09.03
# Recreate indices (delete and create)
$ python manage.py create_es_index --es-version 2025.09.03 --if-exists recreateIndices use the pattern: {index_name}_{version} (e.g., feature_index_2025.09.03)
python manage.py import_species \
--index species_index \
--csv ../data-generators/data/species.csvBasic Strains (Contigs Only):
$ python manage.py import_strains \
--es-index strain_index \
--map-tsv ../data-generators/data/gff-assembly-prefixes.tsv \
--ftp-server ftp.ebi.ac.uk \
--ftp-directory /pub/databases/mett/all_hd_isolates/deduplicated_assemblies/ \
--set-type-strains BU_ATCC8492 PV_ATCC8482 \
--gff-server ftp.ebi.ac.uk \
--gff-base /pub/databases/mett/annotations/v1_2024-04-15/Complete Import (Strains + Drug Data):
$ python manage.py import_strains \
--es-index strain_index \
--map-tsv ../data-generators/data/gff-assembly-prefixes.tsv \
--ftp-server ftp.ebi.ac.uk \
--ftp-directory /pub/databases/mett/all_hd_isolates/deduplicated_assemblies/ \
--set-type-strains BU_ATCC8492 PV_ATCC8482 \
--gff-server ftp.ebi.ac.uk \
--gff-base /pub/databases/mett/annotations/v1_2024-04-15/ \
--include-mic \
--mic-bu-file ../data-generators/Sub-Projects-Data/SP5/BU_growth_inhibition.csv \
--mic-pv-file ../data-generators/Sub-Projects-Data/SP5/PV_growth_inhibition.csv \
--include-metabolism \
--metab-bu-file ../data-generators/Sub-Projects-Data/SP5/SP5_drug_metabolism_BU_v0.csv \
--metab-pv-file ../data-generators/Sub-Projects-Data/SP5/SP5_drug_metabolism_PV_v0.csvIncremental Updates:
Add Drug MIC data only:
$ python manage.py import_strains \
--es-index strain_index \
--skip-strains \
--include-mic \
--mic-bu-file ../data-generators/Sub-Projects-Data/SP5/BU_growth_inhibition.csv \
--mic-pv-file ../data-generators/Sub-Projects-Data/SP5/PV_growth_inhibition.csvAdd Drug Metabolism data only:
$ python manage.py import_strains \
--es-index strain_index \
--skip-strains \
--include-metabolism \
--metab-bu-file ../data-generators/Sub-Projects-Data/SP5/SP5_drug_metabolism_BU_v0.csv \
--metab-pv-file ../data-generators/Sub-Projects-Data/SP5/SP5_drug_metabolism_PV_v0.csv$ python manage.py import_features \
--index feature_index \
--ftp-server ftp.ebi.ac.uk \
--ftp-root /pub/databases/mett/annotations/v1_2024-04-15 \
--mapping-task-file ../data-generators/data/gff-assembly-prefixes.tsv \
--essentiality-dir ../data-generators/Sub-Projects-Data/SP1/essentiality/ \
--proteomics-dir ../data-generators/Sub-Projects-Data/proteomics_evidence/Essentiality Data (in case skipped in above step):
$ python manage.py import_features \
--index feature_index \
--ftp-server ftp.ebi.ac.uk \
--ftp-root /pub/databases/mett/annotations/v1_2024-04-15 \
--skip-core-genes \
--essentiality-dir ../data-generators/Sub-Projects-Data/SP1/essentiality/Proteomics Evidence (in case skipped in above step):
$ python manage.py import_features \
--index feature_index \
--ftp-server ftp.ebi.ac.uk \
--ftp-root /pub/databases/mett/annotations/v1_2024-04-15 \
--skip-core-genes \
--proteomics-dir ../data-generators/Sub-Projects-Data/proteomics_evidence/Metabolic Gene-Reaction Data:
$ python manage.py import_features \
--index feature_index \
--ftp-server ftp.ebi.ac.uk \
--ftp-root /pub/databases/mett/annotations/v1_2024-04-15 \
--skip-core-genes \
--gene-rx-dir ../data-generators/Sub-Projects-Data/SP3/GEMs/gene_rx/ \
--met-rx-dir ../data-generators/Sub-Projects-Data/SP3/GEMs/met_rx/ \
--rx-gpr-dir ../data-generators/Sub-Projects-Data/SP3/GEMs/gpr/STRING DBXREF Data: Single file
python manage.py import_dbxref \
--index feature_index \
--tsv ../data-generators/stringdb-mapper/output/bu_to_string_raw.tsv \
--db-name STRINGDirectory with multiple files
python manage.py import_dbxref \
--index feature_index \
--tsv-dir ../data-generators/stringdb-mapper/output \
--db-name STRING$ python manage.py import_fitness_lfc \
--index feature_index \
--fitness-dir ../data-generators/Sub-Projects-Data/SP1/Fitness_data$ python manage.py import_mutant_growth \
--index feature_index \
--mutant-growth-dir ../data-generators/Sub-Projects-Data/SP3/Pvul_caecal$ python manage.py ingest_pooled_ttp \
--index feature_index \
--csv-file ../data-generators/Sub-Projects-Data/SP2/pooled_TPP.csv \
--pool-metadata ../data-generators/Sub-Projects-Data/SP2/pool_metadata.csv$ python manage.py import_fitness_correlations \
--index fitness_correlation_index \
--correlation-dir ../data-generators/Sub-Projects-Data/SP1/Fitness_corr_data \
--preload-gff \
--ftp-server ftp.ebi.ac.uk \
--ftp-directory /pub/databases/mett/annotations/v1_2024-04-15/# Basic import
$ python manage.py import_ppi_with_genes \
--index ppi_index \
--pattern "*.csv" \
--csv-folder ../data-generators/Sub-Projects-Data/SP2/
# With refresh optimization (recommended for large datasets)
$ python manage.py import_ppi_with_genes \
--index ppi_index \
--pattern "*.csv" \
--csv-folder ../data-generators/Sub-Projects-Data/SP2/ \
--refresh-every-rows 500000
# Alternative: --refresh-every-secs 120$ python manage.py import_operons \
--index operon_index \
--operons-dir ../data-generators/Sub-Projects-Data/SP3/Operons/ \
--preload-gff \
--ftp-server ftp.ebi.ac.uk \
--ftp-directory /pub/databases/mett/annotations/v1_2024-04-15/$ python manage.py import_orthologs_with_genes \
--index ortholog_index \
--ortholog-directory ../data-generators/Sub-Projects-Data/SP3/Orthologs/PairwiseOrthologs/ \
--ftp-server ftp.ebi.ac.uk \
--ftp-directory /pub/databases/mett/annotations/v1_2024-04-15/Generate index files for FASTA and GFF3 files:
cd data-generators/index-scripts
# Process FASTA files
./process_fasta.sh
# Process GFF3 files
./process_gff3.shSee Index Scripts README for details.
This project uses automated code formatting and linting:
Pre-commit hooks are configured to run automatically:
# Install hooks
pre-commit install
# Run manually on all files
pre-commit run --all-files# Format code with Black
black .
# Lint and fix with Ruff
ruff check --fix .cd dataportal_api
# Run all tests
pytest
# Run specific test directory
pytest dataportal/tests/services/ -v
# Run with coverage
pytest --cov=dataportal --cov-report=html
pytest dataportal/tests/services/ --cov=dataportal.services --cov-report=html
# Run specific test modules
pytest dataportal/tests/test_api.py -v
# Run PyHMMER integration tests
python pyhmmer_search/run_integration_tests.pycd dataportal-app
# Run tests
npm test
# Run tests with coverage
npm test -- --coverage
# Run in watch mode
npm test -- --watch# 1. Start Redis (for Celery)
docker run -d -p 6379:6379 redis:7
# 2. Start Django development server
cd dataportal_api
python manage.py runserver
# 3. Start Celery worker (in another terminal)
cd dataportal_api
celery -A dataportal worker -l info
# 4. Start Celery beat (for scheduled tasks, in another terminal)
cd dataportal_api
celery -A dataportal beat -l info
# 5. Start React development server (in another terminal)
cd dataportal-app
npm run dev# Label nodes for PyHMMER data affinity
kubectl label node <node-name> mett-pyhmmer-data=true
# Verify labels
kubectl get nodes -l mett-pyhmmer-data=true# Deploy PostgreSQL
kubectl apply -f k8s/postgres/production/
# Deploy Elasticsearch
kubectl apply -f k8s/elasticsearch/production/
# Deploy Application
kubectl apply -f k8s/mett-app/overlays/production/Kubernetes configurations are organized by environment:
- Development:
k8s/*/dev/ - Production:
k8s/*/production/
See individual README files in each directory for details.
The API is built with Django Ninja and provides comprehensive endpoints for:
- Species:
/api/species/ - Strains:
/api/strains/ - Features/Genes:
/api/features/ - Protein-Protein Interactions:
/api/ppi/ - Operons:
/api/operons/ - Orthologs:
/api/orthologs/ - Fitness Data:
/api/fitness/ - PyHMMER Search:
/api/pyhmmer/
When running the development server:
- Swagger UI (Dev):
http://localhost:8000/api/docs - Swagger UI (Production):
http://www.gut-microbes.org/api/docs
- All API endpoints accept an optional
formatquery parameter. - Supported values:
json(default) andtsv. - When
format=tsv, only thedatapayload is serialized as tab-separated text and returned astext/tab-separated-values. - Example:
GET /api/genomes?species=BU&format=tsv - Endpoints that already stream binary/TSV payloads ignore this parameter and keep their existing behavior.
This project is part of the METT (Microbiome Engineering Transversal Theme) initiative.
For licensing information, please contact the project maintainers.