[BUG][Documentation gap] Template alignments skipped in preprocessing. Request details on input .sto requirements.

**Documentation gap**
Could I please get more information on the requirements for the input .sto files for OF3 templates and how these differ from OF2? I have read through the documentation on expected headers for template formats [here](https://openfold-3.readthedocs.io/en/latest/template_how_to.htm|#template-aligment-file-format). The [detailed template explanation](https://github.com/aqlaboratory/openfold3/blob/main/docs/source/template_explanation.md) part of the OF3 documentation is still missing. 

I have been training on OF2 and would like to transition over to training on OF3. I would like to re-use pre-computed alignments from the OF2 precompute alignments pipeline for OF3 training if at all possible but don't understand the current requirements for OF3 inputs.

**Unexpected behavior**
I have tried using unmodified OF2 precomputed alignments for OF3 training. The `hmm_output.sto` template alignment files from OF2 do not have the query sequence as the first entry (only template sequences are included). As a result, these templates are skipped in the OF3 training preprocessing templates pipeline because they don't pass the `match_query_chain_and_sequence()` check [here](https://github.com/aqlaboratory/openfold-3/blob/main/openfold3/core/data/pipelines/preprocessing/template.py#L406) The result is no templates are used by OF3.

**To Reproduce**
- Run `scripts/data_preprocessing/preprocess_template_alignments_train_val_of3.py` using `alignments/chain_id/hmm_output.sto` files precomputed by the [OF2 hmmsearch pipeline](https://github.com/aqlaboratory/openfold/blob/main/scripts/precompute_alignments.py#L119). Alignments are all for protein chains.
```
        python3 openfold3/scripts/data_preprocessing/preprocess_template_alignments_train_val_of3.py
        --template_alignment_directory alignments
        --template_alignment_filename hmm_output.sto
        --template_structures_directory databases/templates/mmcif_files
        --template_file_format cif
        --query_structures_directory input_fasta_files
        --query_file_format fasta
        --query_seq_load_logic fasta
        --dataset_cache_file training_cache_without_templates.json
        --updated_dataset_cache_file training_cache_with_templates.json
        --template_cache_directory template_cache
        --max_templates_construct 10000
        --max_templates_filter 20
        --is_core_train True
        --num_workers 48
```

**Expected behavior**
I expected the aligned templates to be kept for all chains.

**Configuration**
 - GPU A100 node (96 cpus)
 - Installation from repo

**Additional context**
If possible, please also confirm if the OF2 alignments .sto format (precomputed with same script linked above) should be ok for MSA inputs since it's challenging to check the validity of the msa features.

Label: `OpenFold Consortium Member`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG][Documentation gap] Template alignments skipped in preprocessing. Request details on input .sto requirements. #42

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG][Documentation gap] Template alignments skipped in preprocessing. Request details on input .sto requirements. #42

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions