Skip to content

Mapping Large Set of UniProt proteins to OpenProteinSet Uniclust30 MSAs & ColabFold server usage #556

@slee-ai

Description

@slee-ai

I’m trying to get MSAs for ~3,500 human proteins (given as UniProt accessions) and I’d like to reuse precomputed MSAs wherever possible and had following questions about using ColabFold server and the OpenProteinSet database

A) ColabFold server usage

  1. When I call the ColabFold MSA server from a pipeline, does it retrieve cached alignments (if available) or compute a new MSA for each query?
  2. Are there any throughput/capacity limits per user I should respect if I submit thousands of single-chain queries (e.g., for 3.5k proteins)?

B) OpenProteinSet (OPS) mapping questions

  1. I’ve read the OPS paper and understand MSAs were generated with Uniclust30 (downloaded Dec 28, 2021) and that OPS contains ~16M cluster MSAs (one per cluster) plus a filtered 270k subset. On the Uniclust site I see the latest release corresponding to Dec, 2021 is 2021_06 and was wondering if this would be the version used for Open Protein Set.
  2. The uniclust site only provides uniref_mapping.tsv and 'uniref30_2021_03.tar.gzand I realize uniref30 is different from uniclust30. https://wwwuser.gwdguser.de/~compbiol/uniclust/2021_03/UniRef30_2021_03.tar.gz).uniref_mapping.tsv` for 2021_06 has ~29.9M rows, which looks like representatives/members but the row count doesn’t match 16M clusters and was wondering where I can get the list of all protein ids of the 16M MSAs that were generated
  3. If there’s a reference script/snippet to resolve UniProt accession → Uniclust30 cluster ID or uniclust30 representative protein id (whichever one is used to name the a3m file provided in the OPS uniclust30_unfiltered that would be ideal. Thank you very much!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions