Skip to content

Considerations for the inclusion of MS data in our studies (with sources) #82

@heleensev

Description

@heleensev

This issue is about mass spectrometry data (MS), we have not made a decision about whether to include the data or not. We want our affinity predictor to have no bias towards peptides that are processed in the cell (antigen processing or AP). This is because we want our predictor (DeepRank) to solely focus on modelling the interaction based on physio-chemical features. MS data has this bias because the pMHC (HLA bound to peptide) was eluted from human cells. Furthermore, the MS
technique has the limitation that highly hydrophobic peptides cannot be detected, and cysteine containing peptides are also harder to measure)

Here are some sources (and conclusions from these sources) that will help us make a decision about the inclusion of MS data in our experiments:

  • MHCflurry 2.0 uses both BA (affinity) data and MS (mass spectrometry) for their affinity predictor in a 31/69 ratio in order not to bias their predictor too much to antigen processing (MS pMHC samples have been processed by the cell, BA pMHC are not necessarily processed by the cell but might be)
    • MHCflurry 2.0 only uses decoys as MS negatives for their BA predictor training data: "random sequences sampled according to the same amino acid distribution as the hits" The ratio of negative peptides to MS hits is not clear.
    • The MS hits are assigned a measurement value of "< 100nm" in order not to influence the training loss too much "..MS training data does not guide the relative ranking (exact IC50) of strong binders"
      • The section "Considerations of using MS data" also mentions the use of random negative peptides instead of real MS negatives will not bias their BA predictor to peptides that have been filtered by AP. Possible caveat: how do we know that random negative peptides are not actually binders?
    • The cysteine depletion in MS data is addressed when discussing the AP predictor: In one AP predictor experiment cysteine containing peptides were removed in both training and testing, it showed no significant decrease in AUC. Caveat: why not test cysteine-containing peptides and see if they are disproportionally classified as negative, their check does not seem robust
    • And in the discussion: "An important limitation of this work is that we apply datasets of MHC class I ligands detected by MS both to train and to benchmark our predictors. Assay biases, which we expect are modeled by the AP predictor, have the potential to erroneously inflate our accuracy scores. Although the main known bias, depletion of cysteine, does not seem to have a dramatic effect on AP predictive accuracy, we cannot rule out the contributions of other kinds of bias"
    • The BA training set is comprised of 493,473 MS entries and 219,596 affinity measurements. The inclusion of MS will greatly increase the amount of (training) data of our experiments.
  • Citations in the paper also address biases or artefacts in MS data:
    • MHCflurry 2.0 refers to Abelin et al. 2017 This paper is also a source of 4,808 mono-allelic MS hits. In this paper a small experiment is done to show a lack a peptide distribution bias between BA (from IEDB) and MS data (from their paper). For this experiment a predictor (ESP) is used from Fusaro et al. 2019 which was trained to measure "highly responding" peptides for detection with MS. The density plot shows that BA samples are almost just as likely detected by MS as the MS samples.
  • We should take a look at the protocols from MHCflurry's (consequently our) main MS data sources (papers) and maybe find a subselection of studies that we believe will not introduce bias to the peptide distribution:
    • Immune Epitope Database (IEDB)(Vita et al., 2019) on April 27, 2020. (estimate: 305,043 hits)
    • SysteMHC Atlas project(Shao et al., 2018) (downloaded on May 6, 2019) (46,880 MS hits)
    • Sarakizova et al(Sarkizova et al., 2020) ( 136,742 MS hits)
    • Abelin et al (Abelin et al., 2019) (4,808 MS hits)
  • Lastly we should also consider the biases that BA data may contain. The MHCflurry 2.0 dataset contains samples from many different protocols, measurement types (competive/direct/..) from many different studies.
    • Abelin et al. 2017 also mentions affinity biases: "Meanwhile, other unintentional forms of bias, such as pre-existing notions of the length distribution or limitations on peptide synthesis and solubility, are difficult to avoid"

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions