-
Notifications
You must be signed in to change notification settings - Fork 647
Open
Description
I’m trying to get MSAs for ~3,500 human proteins (given as UniProt accessions) and I’d like to reuse precomputed MSAs wherever possible and had following questions about using ColabFold server and the OpenProteinSet database
A) ColabFold server usage
- When I call the ColabFold MSA server from a pipeline, does it retrieve cached alignments (if available) or compute a new MSA for each query?
- Are there any throughput/capacity limits per user I should respect if I submit thousands of single-chain queries (e.g., for 3.5k proteins)?
B) OpenProteinSet (OPS) mapping questions
- I’ve read the OPS paper and understand MSAs were generated with Uniclust30 (downloaded Dec 28, 2021) and that OPS contains ~16M cluster MSAs (one per cluster) plus a filtered 270k subset. On the Uniclust site I see the latest release corresponding to Dec, 2021 is 2021_06 and was wondering if this would be the version used for Open Protein Set.
- The uniclust site only provides
uniref_mapping.tsvand 'uniref30_2021_03.tar.gzand I realize uniref30 is different from uniclust30. https://wwwuser.gwdguser.de/~compbiol/uniclust/2021_03/UniRef30_2021_03.tar.gz).uniref_mapping.tsv` for 2021_06 has ~29.9M rows, which looks like representatives/members but the row count doesn’t match 16M clusters and was wondering where I can get the list of all protein ids of the 16M MSAs that were generated - If there’s a reference script/snippet to resolve UniProt accession → Uniclust30 cluster ID or uniclust30 representative protein id (whichever one is used to name the a3m file provided in the OPS
uniclust30_unfilteredthat would be ideal. Thank you very much!
Metadata
Metadata
Assignees
Labels
No labels