Skip to content

[BUG] Top templates incorrectly filtered out by training preprocessing scriptΒ #72

@ECalfeeAdaptive

Description

@ECalfeeAdaptive

Describe the bug
The training/validation preprocessing script for templates filters out many of the most relevant templates for some chains. The bug is in check_sequence() which removes a template if its sequence length covers more than 95% of the query, but this should first check if the template sequence is a subsequence of the query. We have found skipped templates to be very common for certain types of chains like TCR Alpha and Beta where the top template alignments may align to >95% of the query positions but with many AA substitutions.

The function check_sequence() is called by the pipeline in openfold3/scripts/data_preprocessing/preprocess_template_alignments_train_val_of3.py. Inference has an updated pipeline in openfold3/scripts/data_preprocessing/preprocess_template_alignments_new_of3.py but the training option in that newer pipeline isn't implemented yet. Is there a different script with an accurate preprocessing pipeline for training/validation templates?

To Reproduce
Here is a simple example where a template with 100% coverage but 0% identity is filtered out by check_sequence().

from openfold3.core.data.io.sequence.template import parse_hmmsearch_a3m
from openfold3.core.data.primitives.sequence.template import check_sequence

# create short example template alignment file
with open("/tmp/test.a3m", "w") as f:
    f.write(
        ">query_A/1-20"
        "\n"
        "QQQQQQQQQQQQQQQQQQQQ"
        "\n"
        ">template1_A/1-20 mol:protein"
        "\n"
        "YYYYYYYYYYYYYYYYYYYY"
        "\n",
    )

# parse template alignments
with open(
    "/tmp/test.a3m",
    "r",
) as f:
    a3m_string = f.read()
template_hits = parse_hmmsearch_a3m(a3m_string=a3m_string)

# run relevant portions of create_template_cache_for_query()
query = template_hits[0]
hit = template_hits[1]
hit_pdb_id, hit_chain_id = hit.name.split("_")
# 1. Apply sequence filters: AF3 SI Section 2.4
if check_sequence(query_seq=query.hit_sequence.replace("-", ""), hit=hit):
    print(
        f"Template {hit_pdb_id} sequence does not pass sequence filters. Skipping this template."
    )

Results: Template template1 sequence does not pass sequence filters. Skipping this template.

Expected behavior
From AF3 SI Section 2.4: "We also remove templates that contain the exact query sequence with greater than 95% coverage as well as short templates less than 10 residues or covering less than 10% of the query." I expect this example template1_A to be kept because the aligned sequence is not an exact match to the query, is 20 residues long and covers 100% of the query.

Please add label: OpenFold Consortium Member. Thank you!

Metadata

Metadata

Assignees

Labels

OpenFold Consortium MemberUse this tag if you are a member of the OpenFold Consortium to receive higher prioritybugSomething isn't workingdata preprocessingRelating to the preprocessing of queries and datasets

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions