Mapping Large Set of UniProt proteins to OpenProteinSet Uniclust30 MSAs & ColabFold server usage

I’m trying to get MSAs for ~3,500 human proteins (given as UniProt accessions) and I’d like to reuse precomputed MSAs wherever possible and had following questions about using ColabFold server and the OpenProteinSet database


**A) ColabFold server usage**

1. When I call the ColabFold MSA server from a pipeline, does it retrieve cached alignments (if available) or compute a new MSA for each query?
2. Are there any throughput/capacity limits per user I should respect if I submit thousands of single-chain queries (e.g., for 3.5k proteins)?

**B) OpenProteinSet (OPS) mapping questions**

1.  I’ve read the OPS paper and understand MSAs were generated with Uniclust30 (downloaded Dec 28, 2021) and that OPS contains ~16M cluster MSAs (one per cluster) plus a filtered 270k subset. On the Uniclust site I see the latest release corresponding to Dec, 2021 is 2021_06 and was wondering if this would be the version used for Open Protein Set. 
2. The uniclust site only provides `uniref_mapping.tsv` and 'uniref30_2021_03.tar.gz` and I realize uniref30 is different from uniclust30. https://wwwuser.gwdguser.de/~compbiol/uniclust/2021_03/UniRef30_2021_03.tar.gz). `uniref_mapping.tsv` for 2021_06 has ~29.9M rows, which looks like representatives/members but the row count doesn’t match 16M clusters and was wondering where I can get the list of all protein ids of the 16M MSAs that were generated
6. If there’s a reference script/snippet to resolve UniProt accession → Uniclust30 cluster ID or uniclust30 representative protein id (whichever one is used to name the a3m file provided in the OPS `uniclust30_unfiltered` that would be ideal. Thank you very much!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mapping Large Set of UniProt proteins to OpenProteinSet Uniclust30 MSAs & ColabFold server usage #556

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Mapping Large Set of UniProt proteins to OpenProteinSet Uniclust30 MSAs & ColabFold server usage #556

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions