Skip to content

Latest commit

 

History

History
109 lines (94 loc) · 4.17 KB

File metadata and controls

109 lines (94 loc) · 4.17 KB

Classification of contacts in protein structures

This software uses machine learning algorithms to classify residue-residue contacts within protein structures into categories as defined by the RING software (HBOND, VDW, SBOND, etc.) using structural and physicochemical features.

Project structure

BioinformaticsProject-main/
├── 3di_model/             # FoldSeek model files
├── config/                # Configuration files
├── data/                  # Input data and PDB structures
├── models/                # Trained models
├── output_3di/            # 3Di encoding output
├── output_features/       # Structural feature output
├── report/                # Figures and plots for analysis
├── scripts/               # Main scripts for feature extraction, training, evaluation
├── requirements.txt       # Python dependencies
└── README.md              # Project documentation

Description of features and data

Features used

Each contact is described using:

s_ss8 : Secondary structure from DSSP
s_rsa : Relative Solvent Accessibility
s_phi : Phi torsion angle
s_psi : Psi torsion angle
s_a1 : Atchley factor 1 - Hydrophilicity vs. hydrophobicity s_a2 : Atchley factor 2 - Secondary structure propensity
s_a3 : Atchley factor 3 - Molecular size s_a4 : Atchley factor 4 - Codon usage / Polarizability s_a5 : Atchley factor 5 - Electrostatic charge
s_3di_state : 3Di FoldSeek numeric state ID
s_3di_letter : 3Di FoldSeek letter

These are computed for both the source and target residues.

Prediction classes

The following are the supported contact types, matching RING nomenclature:

  • Hydrogen bond (HBOND)
  • Van der Waals (VDW)
  • Disulfide bridge (SBOND)
  • Salt bridge (IONIC)
  • π–π stacking (PIPISTACK)
  • π–cation (PICATION)
  • π–hydrogen bond (PIHBOND)
  • Metal ion coordination (METAL_ION)
  • Halogen bond (HALOGEN)
  • Unclassified

Instructions for software setup and usage

1. Install dependencies and DSSP

To install all packages used: pip install -r requirements.txt

2. Configuration

All paths and parameters are stored in config/config.json. Update this file to match your system setup.

3. Execution

a. Models and model retraining

Model training is performed in the following Jupyter notebook: train_model.ipynb

This script performs the following steps:

  1. Loads training data from the path specified in config/config.json
  2. Preprocesses the features
  3. Visualizes feature distributions and correlations
  4. Trains and evaluates multiple models, including:
    • Naive Bayes
    • Random Forest
    • LightGBM
  5. Saves trained models as .pkl files into the models/ directory
  6. Generates evaluation metrics such as:
    • Classification report
    • Confusion matrix
    • Matthews Correlation Coefficient (MCC)
    • Balanced Accuracy
    • Average Precision Score
    • Area Under ROC Curve (AUC)

To execute the full training and evaluation process, run this command in your terminal:

jupyter notebook train_model.ipynb

b. Model application to new .pdb structure

Prediction is performed using the notebook: model_prediction.ipynb

This script performs the following steps:

  1. Extracts structural and 3Di features from the input PDB file using:
    • scripts/calc_features.py
    • scripts/calc_3di.py (within run_feature_extraction from scripts/runFeatureExtraction.py)
  2. Visualizes extracted features using:
    • scripts/data_visualization.py
  3. Loads a pre-trained model from path defined in config/config.json (e.g., Random Forest or Naive Bayes model in models/ directory)
  4. Applies the model to the newly extracted contact features
  5. Outputs predicted contact types and their classification scores

To apply a model to a new protein structure run the notebook using:

jupyter notebook model_prediction.ipynb

Ensure that:

  • The input file exists in the data/pdbs/ directory
  • Paths in config/config.json are set correctly
  • Feature extraction tools (DSSP, FoldSeek) are installed and configured

Authors

Christina Caporale Natalya Lavrenchuk