Skip to content

[MoML '25 / MLSB '25] An MCMC framework for protein design that integrates AlphaFold2 structural predictions with ESM2 evolutionary priors, enabling principled exploration of sequence-structure space beyond gradient-based optimization. Based on ColabDesign.

License

Notifications You must be signed in to change notification settings

flagshippioneering/RelaxedSequenceSampling

Repository files navigation

Relaxed Sequence Sampling (RSS) for Diverse Protein Design

An MCMC framework for protein design that integrates AlphaFold2 structural predictions with ESM2 evolutionary priors, enabling principled exploration of sequence-structure space beyond gradient-based optimization.

Installation

git clone https://github.com/flagshippioneering/relaxedsequencesampling.git
cd relaxedsequencesampling
uv sync

Quick Start

Start MLflow Tracking Server (Optional)

mlflow server --host 0.0.0.0 --port 5000

Run Protein Design

# Using RSS (recommended)
python train.py --config config_exp/rss.yaml

# Using default gradient descent
python train.py --config config_exp/rso.yaml

# Or specify parameters directly
python train.py \
  --design_type rss \
  --pdb_filename 1brs.pdb \
  --chain A \
  --binder_chain D \
  --iters 1000 \
  --beta_t 1.0 \
  --esm_weight 0.2

Design Types

default - Gradient Descent Optimization

  • Method: Standard gradient descent on soft sequence logits
  • Optimizer: SGD or Adam
  • Use Case: Fast, deterministic optimization when exploration is less critical
  • Key Parameters:
    • eta_init: Learning rate (default: 0.01)
    • optimizer: "sgd" or "adam"
    • norm_seq_grad: Normalize sequence gradients

rss - Relaxed Sequence Sampling (MCMC)

  • Method: MCMC with Metropolis-Adjusted Langevin Algorithm (MALA) + masked PLM jumps
  • Exploration: Better exploration of sequence space via stochastic sampling
  • Use Case: When you need diverse, high-quality sequences with proper uncertainty quantification
  • Three-Phase Schedule:
    1. Pre-relax (optional): Deterministic descent to find low-energy region
    2. Warm-up (optional): SGLD without MH correction for rapid mixing
    3. Main MALA: Full MCMC with detailed balance for sampling

RSS Parameters

Core MCMC Parameters

  • beta_t (default: 1.0): Inverse temperature for target distribution
    • Higher = more focused on low energy, lower = more exploration
  • eta_init (default: 0.01): Initial step size for Langevin walks
  • eta_t_main (default: 0.0001): Step size for main MALA phase
  • use_mh (default: True): Use Metropolis-Hastings correction for detailed balance
  • stateless (default: True): Stateless AF2 evaluation for proper MALA acceptance

ESM2 Integration

  • esm_weight (default: 0.2): Weight for ESM2 language model loss (λ)
  • esm_model_name (default: "esm2_t30_150M_UR50D"): ESM2 model variant
  • esm_loss_type (default: "cross_entropy"): Loss type for ESM2 scoring

Jump Kernel (Masked PLM)

  • p_jump (default: 0.3): Probability of jump vs walk at each step
  • kappa (default: 0.3): Mask probability scaling (gradient-informed masking)
  • tau (default: 2.0): Temperature for ESM2 token sampling
  • gamma (default: 1.0): Swap-bias update strength
  • mask_budget_frac (default: 0.2): Expected fraction of sequence to mask

Phase Control

  • use_prerelax (default: True): Enable deterministic pre-relaxation
  • prerelax_iters (default: 100): Number of pre-relax iterations
  • use_warmup (default: True): Enable SGLD warm-up phase
  • warmup_iters (default: 300): Number of warm-up iterations
  • warmup_beta (default: 1.0): Beta for warm-up phase
  • warmup_eta (default: 0.003): Step size for warm-up

Optimization Settings

  • use_eta_rm (default: True): Adaptive step size via Robbins-Monro
  • clip_grad (default: True): Clip gradients to prevent instability
  • center_logits (default: True): Center proposals in softmax-invariant subspace

Sampling & Logging

  • iters (default: 1000): Total MCMC iterations
  • num_samples (default: 10): Number of PDB samples to save
  • sampling_freq (default: 50): Save sample every N iterations
  • log_every_mlflow (default: 25): Log metrics every N iterations

Output Structure

Results are saved to --output_dir (default: ./results):

results/
└── {pdb_name}_{mlflow_run_id}/
    ├── binder_{pdb_name}_{chain}_{timestamp}.pdb
    ├── sequence_with_targets.json
    ├── sampled_sequences.json
    └── sampled_pdbs/
        ├── binder_{pdb_name}_{chain}_{iter}_{timestamp}.pdb
        └── ...

Configuration Files

See config_exp/ for example configurations:

  • rss.yaml: RSS with recommended parameters
  • rso.yaml: Default gradient descent baseline

MLflow Tracking

Configure MLflow server:

python train.py \
  --mlflow_tracking_host 127.0.0.1 \
  --mlflow_tracking_port 5000 \
  --enable_ml_flow True

View results at http://127.0.0.1:5000

Original ColabDesign

This repository extends ColabDesign with RSS capabilities. The contribution of this code by Flagship Pioneering is under a CC BY-SA 4.0 license. See License.

About

[MoML '25 / MLSB '25] An MCMC framework for protein design that integrates AlphaFold2 structural predictions with ESM2 evolutionary priors, enabling principled exploration of sequence-structure space beyond gradient-based optimization. Based on ColabDesign.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published