GitHub - ayyucedemirbas/multimodal_protein_language_model

Multimodal Protein Language Model

This documentation provides an overview, installation instructions, usage examples, and API reference for the multimodal_protein_language_model repository by ayyucedemirbas. It supports sequence-to-structure/function prediction using transformer-based encoder-decoder architecture with a mixture-of-experts and optional structural image input.

Overview

The MultimodalProteinModel integrates:

Protein Sequence Encoder based on transformer layers with mixture-of-experts routing.
Protein Structure/Function Decoder generating structural tokens.
Image Encoder for optional 2D structural data to perform multimodal fusion.
Custom learning rate scheduler following the "Attention Is All You Need" warmup strategy.

Use cases include predicting protein secondary/tertiary structures, binding sites, or functional motifs, optionally guided by structural images.

Repository Structure

multimodal_protein_language_model/
├── README.md            # Minimal original readme
├── LICENSE              # MIT License
├── encoder.py           # Transformer encoder with MoE layers
├── decoder.py           # Transformer decoder with MoE layers
├── layers.py            # Core MultiheadAttention, MixtureOfExperts, positional encoding
├── model.py             # Complete MultimodalProteinModel class
├── preprocessing.py     # Sequence and structure tokenization utilities
└── training.py          # High-level training routine and entry point

Installation

Clone the repository

git clone https://github.com/ayyucedemirbas/multimodal_protein_language_model.git
cd multimodal_protein_language_model

Create a virtual environment (recommended)

python3 -m venv venv
source venv/bin/activate

Install dependencies
```
pip install tensorflow numpy
```

Data Preprocessing

Two helper functions in preprocessing.py:

preprocess_protein_sequence(sequence: str, max_length: int, vocab: dict) -> tf.Tensor Converts an amino acid sequence to integer tokens, pads/truncates to max_length.
preprocess_structure_data(structure_data: List[str], max_length: int, vocab: dict) -> tf.Tensor Converts structure tokens (e.g., secondary structure labels) to integers, adds start/end tokens, pads/truncates.

Example:

from preprocessing import preprocess_protein_sequence, preprocess_structure_data
# Sample vocab
aa_vocab = {aa: i+3 for i, aa in enumerate("ACDEFGHIKLMNPQRSTVWY")}
aa_vocab.update({"<PAD>":0, "<START>":1, "<END>":2, "<UNK>":3})
seq_tensor = preprocess_protein_sequence("ACDIPK", max_length=10, vocab=aa_vocab)

Model Architecture

`ProteinEncoder` (encoder.py)

Layers: Embedding, positional encoding, num_layers of EncoderLayer.
EncoderLayer: Multi-head self-attention (with dropout & layer norm) + Mixture-of-Experts feed-forward.

from encoder import ProteinEncoder
encoder = ProteinEncoder(
    num_layers=6, d_model=512, num_heads=8,
    d_ff=2048, num_experts=8, k=2,
    amino_acid_vocab_size=24, max_position=1024,
    dropout_rate=0.1
)
enc_output = encoder(input_seq_tensor)

`ProteinDecoder` (decoder.py)

Layers: Embedding, positional encoding, num_layers of DecoderLayer.
DecoderLayer: Masked self-attention + encoder-decoder cross-attention + MoE feed-forward.

from decoder import ProteinDecoder
decoder = ProteinDecoder(
    num_layers=6, d_model=512, num_heads=8,
    d_ff=2048, num_experts=8, k=2,
    target_vocab_size=structure_vocab_size,
    max_position=1024
)
logits, attn_weights = decoder(target_tokens, enc_output)

Multimodal Fusion (model.py)

Image Encoder: 3 Conv2D + MaxPool blocks, Flatten, Dense to d_model.
Fusion: Concatenate sequence features and repeated image features, project via Dense(d_model).

Custom Learning Rate Scheduler

from model import CustomLearningRateScheduler
lr_schedule = CustomLearningRateScheduler(d_model=512, warmup_steps=4000)
optimizer = tf.keras.optimizers.Adam(lr_schedule)

Training (`training.py`)

train_multimodal_protein_model(...) orchestrates preprocessing, dataset creation, model compilation, and training.

Key Arguments:

protein_seqs: List of strings (amino acid sequences).
structure_data: List of lists/strings of structure labels.
structural_images: Optional array of image tensors.
batch_size, epochs, model hyperparameters, checkpoint_path.

Example Usage:

from training import train_multimodal_protein_model
# Dummy data
protein_seqs = ["ACDEFGHIKLMNPQRS"]
structure_data = [["H","E","C","C"]]
# Train
model, history, aa_vocab, struct_vocab = train_multimodal_protein_model(
    protein_seqs, structure_data, epochs=5, batch_size=2
)

API Reference

`layers.py`

MultiheadAttention: call([q,k,v], mask=None, training=None) → (output, attn_weights)
ExpertLayer: Feed-forward sub-layer.
MixtureOfExperts: call(x, training=None) → gated MoE output.
**positional_encoding(position, d_model)→ Tensor of shape(1, position, d_model)`

`model.py`

MultimodalProteinModel:
- call((protein_seq, structure_targets, structural_image), training) → (logits, attention_weights)
- train_step(data) → dict with 'loss' and 'accuracy'
- .create_masks(inp, tar) → (enc_padding_mask, combined_mask, dec_padding_mask)
- .metrics property → [loss_tracker, accuracy_metric]

`training.py`

**train_multimodal_protein_model(...)** → (model, history, amino_acid_vocab, structure_vocab)`

License

This project is licensed under the GNU GENERAL PUBLIC LICENSE Version 3. Feel free to use and modify.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Table of Contents

Overview

Repository Structure

Installation

Data Preprocessing

Model Architecture

`ProteinEncoder` (encoder.py)

`ProteinDecoder` (decoder.py)

Multimodal Fusion (model.py)

Custom Learning Rate Scheduler

Training (`training.py`)

Key Arguments:

API Reference

`layers.py`

`model.py`

`training.py`

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
LICENSE		LICENSE
README.md		README.md
decoder.py		decoder.py
encoder.py		encoder.py
example_usage.ipynb		example_usage.ipynb
layers.py		layers.py
model.py		model.py
preprocessing.py		preprocessing.py
training.py		training.py

License

ayyucedemirbas/multimodal_protein_language_model

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Overview

Repository Structure

Installation

Data Preprocessing

Model Architecture

ProteinEncoder (encoder.py)

ProteinDecoder (decoder.py)

Multimodal Fusion (model.py)

Custom Learning Rate Scheduler

Training (training.py)

Key Arguments:

API Reference

layers.py

model.py

training.py

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`ProteinEncoder` (encoder.py)

`ProteinDecoder` (decoder.py)

Training (`training.py`)

`layers.py`

`model.py`

`training.py`

Packages