-
Notifications
You must be signed in to change notification settings - Fork 53
Description
Describe the bug
The training/validation preprocessing script for templates filters out many of the most relevant templates for some chains. The bug is in check_sequence() which removes a template if its sequence length covers more than 95% of the query, but this should first check if the template sequence is a subsequence of the query. We have found skipped templates to be very common for certain types of chains like TCR Alpha and Beta where the top template alignments may align to >95% of the query positions but with many AA substitutions.
The function check_sequence() is called by the pipeline in openfold3/scripts/data_preprocessing/preprocess_template_alignments_train_val_of3.py. Inference has an updated pipeline in openfold3/scripts/data_preprocessing/preprocess_template_alignments_new_of3.py but the training option in that newer pipeline isn't implemented yet. Is there a different script with an accurate preprocessing pipeline for training/validation templates?
To Reproduce
Here is a simple example where a template with 100% coverage but 0% identity is filtered out by check_sequence().
from openfold3.core.data.io.sequence.template import parse_hmmsearch_a3m
from openfold3.core.data.primitives.sequence.template import check_sequence
# create short example template alignment file
with open("/tmp/test.a3m", "w") as f:
f.write(
">query_A/1-20"
"\n"
"QQQQQQQQQQQQQQQQQQQQ"
"\n"
">template1_A/1-20 mol:protein"
"\n"
"YYYYYYYYYYYYYYYYYYYY"
"\n",
)
# parse template alignments
with open(
"/tmp/test.a3m",
"r",
) as f:
a3m_string = f.read()
template_hits = parse_hmmsearch_a3m(a3m_string=a3m_string)
# run relevant portions of create_template_cache_for_query()
query = template_hits[0]
hit = template_hits[1]
hit_pdb_id, hit_chain_id = hit.name.split("_")
# 1. Apply sequence filters: AF3 SI Section 2.4
if check_sequence(query_seq=query.hit_sequence.replace("-", ""), hit=hit):
print(
f"Template {hit_pdb_id} sequence does not pass sequence filters. Skipping this template."
)
Results: Template template1 sequence does not pass sequence filters. Skipping this template.
Expected behavior
From AF3 SI Section 2.4: "We also remove templates that contain the exact query sequence with greater than 95% coverage as well as short templates less than 10 residues or covering less than 10% of the query." I expect this example template1_A to be kept because the aligned sequence is not an exact match to the query, is 20 residues long and covers 100% of the query.
Please add label: OpenFold Consortium Member. Thank you!