Multimodal Protein Language Model
This documentation provides an overview, installation instructions, usage examples, and API reference for the multimodal_protein_language_model repository by ayyucedemirbas. It supports sequence-to-structure/function prediction using transformer-based encoder-decoder architecture with a mixture-of-experts and optional structural image input.
The MultimodalProteinModel integrates:
- Protein Sequence Encoder based on transformer layers with mixture-of-experts routing.
- Protein Structure/Function Decoder generating structural tokens.
- Image Encoder for optional 2D structural data to perform multimodal fusion.
- Custom learning rate scheduler following the "Attention Is All You Need" warmup strategy.
Use cases include predicting protein secondary/tertiary structures, binding sites, or functional motifs, optionally guided by structural images.
multimodal_protein_language_model/
├── README.md # Minimal original readme
├── LICENSE # MIT License
├── encoder.py # Transformer encoder with MoE layers
├── decoder.py # Transformer decoder with MoE layers
├── layers.py # Core MultiheadAttention, MixtureOfExperts, positional encoding
├── model.py # Complete MultimodalProteinModel class
├── preprocessing.py # Sequence and structure tokenization utilities
└── training.py # High-level training routine and entry point
-
Clone the repository
git clone https://github.com/ayyucedemirbas/multimodal_protein_language_model.git cd multimodal_protein_language_model -
Create a virtual environment (recommended)
python3 -m venv venv source venv/bin/activate -
Install dependencies
pip install tensorflow numpy
Two helper functions in preprocessing.py:
-
preprocess_protein_sequence(sequence: str, max_length: int, vocab: dict) -> tf.TensorConverts an amino acid sequence to integer tokens, pads/truncates tomax_length. -
preprocess_structure_data(structure_data: List[str], max_length: int, vocab: dict) -> tf.TensorConverts structure tokens (e.g., secondary structure labels) to integers, adds start/end tokens, pads/truncates.
Example:
from preprocessing import preprocess_protein_sequence, preprocess_structure_data
# Sample vocab
aa_vocab = {aa: i+3 for i, aa in enumerate("ACDEFGHIKLMNPQRSTVWY")}
aa_vocab.update({"<PAD>":0, "<START>":1, "<END>":2, "<UNK>":3})
seq_tensor = preprocess_protein_sequence("ACDIPK", max_length=10, vocab=aa_vocab)- Layers: Embedding, positional encoding,
num_layersofEncoderLayer. - EncoderLayer: Multi-head self-attention (with dropout & layer norm) + Mixture-of-Experts feed-forward.
from encoder import ProteinEncoder
encoder = ProteinEncoder(
num_layers=6, d_model=512, num_heads=8,
d_ff=2048, num_experts=8, k=2,
amino_acid_vocab_size=24, max_position=1024,
dropout_rate=0.1
)
enc_output = encoder(input_seq_tensor)- Layers: Embedding, positional encoding,
num_layersofDecoderLayer. - DecoderLayer: Masked self-attention + encoder-decoder cross-attention + MoE feed-forward.
from decoder import ProteinDecoder
decoder = ProteinDecoder(
num_layers=6, d_model=512, num_heads=8,
d_ff=2048, num_experts=8, k=2,
target_vocab_size=structure_vocab_size,
max_position=1024
)
logits, attn_weights = decoder(target_tokens, enc_output)- Image Encoder: 3 Conv2D + MaxPool blocks, Flatten, Dense to
d_model. - Fusion: Concatenate sequence features and repeated image features, project via
Dense(d_model).
from model import CustomLearningRateScheduler
lr_schedule = CustomLearningRateScheduler(d_model=512, warmup_steps=4000)
optimizer = tf.keras.optimizers.Adam(lr_schedule)train_multimodal_protein_model(...) orchestrates preprocessing, dataset creation, model compilation, and training.
protein_seqs: List of strings (amino acid sequences).structure_data: List of lists/strings of structure labels.structural_images: Optional array of image tensors.batch_size,epochs, model hyperparameters,checkpoint_path.
Example Usage:
from training import train_multimodal_protein_model
# Dummy data
protein_seqs = ["ACDEFGHIKLMNPQRS"]
structure_data = [["H","E","C","C"]]
# Train
model, history, aa_vocab, struct_vocab = train_multimodal_protein_model(
protein_seqs, structure_data, epochs=5, batch_size=2
)- MultiheadAttention:
call([q,k,v], mask=None, training=None)→(output, attn_weights) - ExpertLayer: Feed-forward sub-layer.
- MixtureOfExperts:
call(x, training=None)→ gated MoE output. - **positional_encoding(position, d_model)
→ Tensor of shape(1, position, d_model)`
-
MultimodalProteinModel:
call((protein_seq, structure_targets, structural_image), training)→(logits, attention_weights)train_step(data)→ dict with'loss'and'accuracy'.create_masks(inp, tar)→(enc_padding_mask, combined_mask, dec_padding_mask).metricsproperty →[loss_tracker, accuracy_metric]
- **train_multimodal_protein_model(...)
** →(model, history, amino_acid_vocab, structure_vocab)`
This project is licensed under the GNU GENERAL PUBLIC LICENSE Version 3. Feel free to use and modify.