Skip to content

EBI-Metagenomics/mett-dataportal

Repository files navigation

METT Data Portal

A comprehensive web-based genomic annotation platform for exploring and analyzing the genomes of Phocaeicola vulgatus and Bacteroides uniformis

Python 3.13 Django React


Table of Contents


About

The transversal theme aims at mechanistically understanding the complex role that human-associated microbiomes play in human health and disease. Our current knowledge of bacterial gene functions come primarily from very few model bacteria, failing to capture the genetic diversity within the gut microbiome. One of the goals of METT is to systematically tackle the vast genetic matter in the gut microbiome and two establish new model microbes. The Flagship Project of METT has focused efforts on annotating the genomes of Phocaeicola Vulgatus and Bacteroides uniformis, two of the most prevalent and abundant bacterial species of the human microbiome.

The current version is a web-based genomic annotation editing platform designed to browse the genomes of the type strains B. uniformis (ATCC8492) and P. vulgatus (ATCC8482). The annotation data generated by the ME TT has been organised on an FTP directory hosted at EBI and contains structural annotations (such as Prokka and Mobilome predictions, etc.) as well as functional annotations (including biosynthetic gene clusters, carbohydrate active enzymes, etc.) .

Type Strains:

  • B. uniformis (ATCC8492)
  • P. vulgatus (ATCC8482)

Features

  • Comprehensive Genome Browser: Interactive visualization of genome annotations
  • Multi-strain Analysis: Compare annotations across multiple bacterial strains
  • Functional Annotations:
    • Gene essentiality data
    • Proteomics evidence
    • Metabolic pathway information
    • Gene-reaction associations
  • Experimental Data Integration:
    • Fitness correlation analysis
    • Mutant growth data
    • Thermal proteome profiling (TPP)
    • Protein-protein interaction networks
  • Advanced Search: Elasticsearch-powered search across all annotations
  • Sequence Analysis: PyHMMER integration for homology searches
  • RESTful API: Comprehensive API for programmatic access
  • Docker & Kubernetes Ready: Production-ready containerization
  • Real-time Updates: Celery-based async task processing

Architecture

┌─────────────────────────────────────────────────────────────┐
│                     METT Data Portal                        │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌──────────────┐         ┌──────────────────────────┐      │
│  │   React UI   │ ◄─────► │   Django REST API        │      │
│  │  (Frontend)  │         │   (Backend)              │      │
│  └──────────────┘         └──────────────────────────┘      │
│                                    │                        │
│                           ┌────────┴────────┐               │
│                           │                 │               │
│                  ┌────────▼────────┐  ┌────▼─────────┐      │
│                  │   PostgreSQL    │  │ Elasticsearch│      │
│                  │   (Relational)  │  │   (Search)   │      │
│                  └─────────────────┘  └──────────────┘      │
│                                                             │
│                  ┌─────────────────┐                        │
│                  │     Celery      │                        │
│                  │  (Task Queue)   │                        │
│                  └─────────────────┘                        │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Technology Stack:

  • Backend: Django 5.x, Django Ninja (API), Celery
  • Frontend: React, TypeScript, SCSS
  • Database: PostgreSQL 13+, Elasticsearch 8+
  • Task Queue: Celery with Redis/RabbitMQ
  • Containerization: Docker, Kubernetes
  • Package Management: uv (Python), npm (JavaScript)

Prerequisites

Required Software

System Requirements

  • RAM: 16GB minimum, 32GB recommended
  • Storage: 20GB+ for full dataset
  • OS: Linux, macOS, or Windows (WSL2)

Quick Start

Development Setup (Minimal)

# 1. Clone the repository
git clone <repository-url>
cd mett-dataportal

# 2. Set up Python environment
cd dataportal_api
uv pip install -r uv.lock
python manage.py migrate

# 3. Set up Elasticsearch
python manage.py create_es_index

# 4. Set up Frontend
cd ../dataportal-app
npm install

# 5. Run development servers

# 5a. Setup auth mechanism (temporary)
cd dataportal_api
python manage.py migrate
python manage.py seed_roles

# Terminal 1 - Backend
cd dataportal_api
python manage.py runserver

# Terminal 2 - Frontend
cd dataportal-app
npm start

Installation

Backend Setup

1. Install Python Dependencies

cd dataportal_api

# Generate lock file (if needed)
uv lock

# Install dependencies (development) - (uv will respect the active conda env.)
pip install uv

# Sync dependencies from the existing lock
uv sync

# Install dependencies (production)
uv pip install -r uv.lock --no-dev

# Install pre-commit hooks
pre-commit install

2. Environment Variables

Create a .env file in dataportal_api/:

# Database
DATABASE_URL=postgresql://user:password@localhost:5432/mett_db
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_DB=mett_db
POSTGRES_USER=user
POSTGRES_PASSWORD=password

# Elasticsearch
ELASTICSEARCH_HOST=localhost
ELASTICSEARCH_PORT=9200
ELASTICSEARCH_SCHEME=http
ES_INDEX_VERSION=2025.09.03

# Django
SECRET_KEY=your-secret-key-here
DEBUG=True
ALLOWED_HOSTS=localhost,127.0.0.1

# Celery
CELERY_BROKER_URL=redis://localhost:6379/0
CELERY_RESULT_BACKEND=redis://localhost:6379/0

# CORS
CORS_ALLOWED_ORIGINS=http://localhost:3000

# FTP Configuration
FTP_SERVER=ftp.ebi.ac.uk
FTP_BASE_PATH=/pub/databases/mett/annotations/v1_2024-04-15

Alternatively, use the provided environment setup scripts:

# Development
source set-env-dev.sh

# Production
source set-env-prod.sh

Frontend Setup

cd dataportal-app

# Install dependencies
npm install

# Development server
npm start

# Production build
npm run build

Docker Setup

Build Docker Image

# Backend
docker build -t mett-dataportal:latest -f dataportal_api/Dockerfile .

# Frontend
docker build -t mett-dataportal-app:latest -f dataportal-app/Dockerfile .

Run with Docker

# Backend
docker run --rm -it -p 8000:8000 mett-dataportal:latest

# Frontend
docker run --rm -it -p 3000:80 mett-dataportal-app:latest

Docker Compose (Recommended)

# Start all services
docker-compose up -d

# Stop all services
docker-compose down

Configuration

Pydantic Configuration

The project uses Pydantic for configuration management. Configuration files can be found in dataportal_api/dataportal/settings.py.

Environment Files

  • Development: config/local.env
  • Production: Environment variables should be set via Kubernetes secrets or Docker environment

Database Setup

PostgreSQL Migrations

cd dataportal_api

# Run all migrations
python manage.py migrate

# PyHMMER specific migrations
python manage.py migrate pyhmmer_search

# Celery Beat migrations (for scheduled tasks)
python manage.py migrate django_celery_beat

Elasticsearch Indices

Create Indices

# Create all indices with default version
$ python manage.py create_es_index

# Create indices with specific version
$ python manage.py create_es_index --es-version v5

# Create specific model index
$ python manage.py create_es_index --model GeneFitnessCorrelationDocument --es-version 2025.09.03

# Recreate indices (delete and create)
$ python manage.py create_es_index --es-version 2025.09.03 --if-exists recreate

Index Naming Convention

Indices use the pattern: {index_name}_{version} (e.g., feature_index_2025.09.03)


Data Import

Core Data Import

1. Species Data

python manage.py import_species \
  --index species_index \
  --csv ../data-generators/data/species.csv

2. Strain Data

Basic Strains (Contigs Only):

$ python manage.py import_strains \
  --es-index strain_index \
  --map-tsv ../data-generators/data/gff-assembly-prefixes.tsv \
  --ftp-server ftp.ebi.ac.uk \
  --ftp-directory /pub/databases/mett/all_hd_isolates/deduplicated_assemblies/ \
  --set-type-strains BU_ATCC8492 PV_ATCC8482 \
  --gff-server ftp.ebi.ac.uk \
  --gff-base /pub/databases/mett/annotations/v1_2024-04-15/

Complete Import (Strains + Drug Data):

$ python manage.py import_strains \
  --es-index strain_index \
  --map-tsv ../data-generators/data/gff-assembly-prefixes.tsv \
  --ftp-server ftp.ebi.ac.uk \
  --ftp-directory /pub/databases/mett/all_hd_isolates/deduplicated_assemblies/ \
  --set-type-strains BU_ATCC8492 PV_ATCC8482 \
  --gff-server ftp.ebi.ac.uk \
  --gff-base /pub/databases/mett/annotations/v1_2024-04-15/ \
  --include-mic \
  --mic-bu-file ../data-generators/Sub-Projects-Data/SP5/BU_growth_inhibition.csv \
  --mic-pv-file ../data-generators/Sub-Projects-Data/SP5/PV_growth_inhibition.csv \
  --include-metabolism \
  --metab-bu-file ../data-generators/Sub-Projects-Data/SP5/SP5_drug_metabolism_BU_v0.csv \
  --metab-pv-file ../data-generators/Sub-Projects-Data/SP5/SP5_drug_metabolism_PV_v0.csv

Incremental Updates:

Add Drug MIC data only:

$ python manage.py import_strains \
  --es-index strain_index \
  --skip-strains \
  --include-mic \
  --mic-bu-file ../data-generators/Sub-Projects-Data/SP5/BU_growth_inhibition.csv \
  --mic-pv-file ../data-generators/Sub-Projects-Data/SP5/PV_growth_inhibition.csv

Add Drug Metabolism data only:

$ python manage.py import_strains \
  --es-index strain_index \
  --skip-strains \
  --include-metabolism \
  --metab-bu-file ../data-generators/Sub-Projects-Data/SP5/SP5_drug_metabolism_BU_v0.csv \
  --metab-pv-file ../data-generators/Sub-Projects-Data/SP5/SP5_drug_metabolism_PV_v0.csv

Feature Annotations

Core Genes Features Import

$ python manage.py import_features \
  --index feature_index \
  --ftp-server ftp.ebi.ac.uk \
  --ftp-root /pub/databases/mett/annotations/v1_2024-04-15 \
  --mapping-task-file ../data-generators/data/gff-assembly-prefixes.tsv \
  --essentiality-dir ../data-generators/Sub-Projects-Data/SP1/essentiality/ \
  --proteomics-dir ../data-generators/Sub-Projects-Data/proteomics_evidence/

Incremental Feature Updates

Essentiality Data (in case skipped in above step):

$ python manage.py import_features \
  --index feature_index \
  --ftp-server ftp.ebi.ac.uk \
  --ftp-root /pub/databases/mett/annotations/v1_2024-04-15 \
  --skip-core-genes \
  --essentiality-dir ../data-generators/Sub-Projects-Data/SP1/essentiality/

Proteomics Evidence (in case skipped in above step):

$ python manage.py import_features \
  --index feature_index \
  --ftp-server ftp.ebi.ac.uk \
  --ftp-root /pub/databases/mett/annotations/v1_2024-04-15 \
  --skip-core-genes \
  --proteomics-dir ../data-generators/Sub-Projects-Data/proteomics_evidence/

Metabolic Gene-Reaction Data:

$ python manage.py import_features \
  --index feature_index \
  --ftp-server ftp.ebi.ac.uk \
  --ftp-root /pub/databases/mett/annotations/v1_2024-04-15 \
  --skip-core-genes \
  --gene-rx-dir ../data-generators/Sub-Projects-Data/SP3/GEMs/gene_rx/ \
  --met-rx-dir ../data-generators/Sub-Projects-Data/SP3/GEMs/met_rx/ \
  --rx-gpr-dir ../data-generators/Sub-Projects-Data/SP3/GEMs/gpr/

STRING DBXREF Data: Single file

python manage.py import_dbxref \
    --index feature_index \
    --tsv ../data-generators/stringdb-mapper/output/bu_to_string_raw.tsv \
    --db-name STRING

Directory with multiple files

python manage.py import_dbxref \
    --index feature_index \
    --tsv-dir ../data-generators/stringdb-mapper/output \
    --db-name STRING

Experimental Data

Fitness Data

$ python manage.py import_fitness_lfc \
  --index feature_index \
  --fitness-dir ../data-generators/Sub-Projects-Data/SP1/Fitness_data

Mutant Growth Data

$ python manage.py import_mutant_growth \
  --index feature_index \
  --mutant-growth-dir ../data-generators/Sub-Projects-Data/SP3/Pvul_caecal

Thermal Proteome Profiling (TPP)

$ python manage.py ingest_pooled_ttp \
  --index feature_index \
  --csv-file ../data-generators/Sub-Projects-Data/SP2/pooled_TPP.csv \
  --pool-metadata ../data-generators/Sub-Projects-Data/SP2/pool_metadata.csv

Fitness Correlation Data

$ python manage.py import_fitness_correlations \
  --index fitness_correlation_index \
  --correlation-dir ../data-generators/Sub-Projects-Data/SP1/Fitness_corr_data \
  --preload-gff \
  --ftp-server ftp.ebi.ac.uk \
  --ftp-directory /pub/databases/mett/annotations/v1_2024-04-15/

Network Data

Protein-Protein Interactions (PPI)

# Basic import
$ python manage.py import_ppi_with_genes \
  --index ppi_index \
  --pattern "*.csv" \
  --csv-folder ../data-generators/Sub-Projects-Data/SP2/

# With refresh optimization (recommended for large datasets)
$ python manage.py import_ppi_with_genes \
  --index ppi_index \
  --pattern "*.csv" \
  --csv-folder ../data-generators/Sub-Projects-Data/SP2/ \
  --refresh-every-rows 500000
  # Alternative: --refresh-every-secs 120

Operons

$ python manage.py import_operons \
  --index operon_index \
  --operons-dir ../data-generators/Sub-Projects-Data/SP3/Operons/ \
  --preload-gff \
  --ftp-server ftp.ebi.ac.uk \
  --ftp-directory /pub/databases/mett/annotations/v1_2024-04-15/

Ortholog Pairs

$ python manage.py import_orthologs_with_genes \
  --index ortholog_index \
  --ortholog-directory ../data-generators/Sub-Projects-Data/SP3/Orthologs/PairwiseOrthologs/ \
  --ftp-server ftp.ebi.ac.uk \
  --ftp-directory /pub/databases/mett/annotations/v1_2024-04-15/

Index File Generation

Generate index files for FASTA and GFF3 files:

cd data-generators/index-scripts

# Process FASTA files
./process_fasta.sh

# Process GFF3 files
./process_gff3.sh

See Index Scripts README for details.


Development

Code Style

This project uses automated code formatting and linting:

  • Black: Python code formatting
  • Ruff: Python linting

Pre-commit Hooks

Pre-commit hooks are configured to run automatically:

# Install hooks
pre-commit install

# Run manually on all files
pre-commit run --all-files

Manual Formatting

# Format code with Black
black .

# Lint and fix with Ruff
ruff check --fix .

Testing

Backend Tests

cd dataportal_api

# Run all tests
pytest

# Run specific test directory
pytest dataportal/tests/services/ -v

# Run with coverage
pytest --cov=dataportal --cov-report=html
pytest dataportal/tests/services/ --cov=dataportal.services --cov-report=html

# Run specific test modules
pytest dataportal/tests/test_api.py -v

# Run PyHMMER integration tests
python pyhmmer_search/run_integration_tests.py

Frontend Tests

cd dataportal-app

# Run tests
npm test

# Run tests with coverage
npm test -- --coverage

# Run in watch mode
npm test -- --watch

Local Development Workflow

# 1. Start Redis (for Celery)
docker run -d -p 6379:6379 redis:7

# 2. Start Django development server
cd dataportal_api
python manage.py runserver

# 3. Start Celery worker (in another terminal)
cd dataportal_api
celery -A dataportal worker -l info

# 4. Start Celery beat (for scheduled tasks, in another terminal)
cd dataportal_api
celery -A dataportal beat -l info

# 5. Start React development server (in another terminal)
cd dataportal-app
npm run dev

Deployment

Kubernetes Deployment

Prerequisites

# Label nodes for PyHMMER data affinity
kubectl label node <node-name> mett-pyhmmer-data=true

# Verify labels
kubectl get nodes -l mett-pyhmmer-data=true

Deploy Resources

# Deploy PostgreSQL
kubectl apply -f k8s/postgres/production/

# Deploy Elasticsearch
kubectl apply -f k8s/elasticsearch/production/

# Deploy Application
kubectl apply -f k8s/mett-app/overlays/production/

Configuration

Kubernetes configurations are organized by environment:

  • Development: k8s/*/dev/
  • Production: k8s/*/production/

See individual README files in each directory for details.


API Documentation

Endpoints Overview

The API is built with Django Ninja and provides comprehensive endpoints for:

  • Species: /api/species/
  • Strains: /api/strains/
  • Features/Genes: /api/features/
  • Protein-Protein Interactions: /api/ppi/
  • Operons: /api/operons/
  • Orthologs: /api/orthologs/
  • Fitness Data: /api/fitness/
  • PyHMMER Search: /api/pyhmmer/

Interactive API Documentation

When running the development server:

  • Swagger UI (Dev): http://localhost:8000/api/docs
  • Swagger UI (Production): http://www.gut-microbes.org/api/docs

Response Formats

  • All API endpoints accept an optional format query parameter.
  • Supported values: json (default) and tsv.
  • When format=tsv, only the data payload is serialized as tab-separated text and returned as text/tab-separated-values.
  • Example: GET /api/genomes?species=BU&format=tsv
  • Endpoints that already stream binary/TSV payloads ignore this parameter and keep their existing behavior.

License

This project is part of the METT (Microbiome Engineering Transversal Theme) initiative.

For licensing information, please contact the project maintainers.


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •