Skip to content

[BUG][Documentation gap] Template alignments skipped in preprocessing. Request details on input .sto requirements. #42

@ECalfeeAdaptive

Description

@ECalfeeAdaptive

Documentation gap
Could I please get more information on the requirements for the input .sto files for OF3 templates and how these differ from OF2? I have read through the documentation on expected headers for template formats here. The detailed template explanation part of the OF3 documentation is still missing.

I have been training on OF2 and would like to transition over to training on OF3. I would like to re-use pre-computed alignments from the OF2 precompute alignments pipeline for OF3 training if at all possible but don't understand the current requirements for OF3 inputs.

Unexpected behavior
I have tried using unmodified OF2 precomputed alignments for OF3 training. The hmm_output.sto template alignment files from OF2 do not have the query sequence as the first entry (only template sequences are included). As a result, these templates are skipped in the OF3 training preprocessing templates pipeline because they don't pass the match_query_chain_and_sequence() check here The result is no templates are used by OF3.

To Reproduce

  • Run scripts/data_preprocessing/preprocess_template_alignments_train_val_of3.py using alignments/chain_id/hmm_output.sto files precomputed by the OF2 hmmsearch pipeline. Alignments are all for protein chains.
        python3 openfold3/scripts/data_preprocessing/preprocess_template_alignments_train_val_of3.py
        --template_alignment_directory alignments
        --template_alignment_filename hmm_output.sto
        --template_structures_directory databases/templates/mmcif_files
        --template_file_format cif
        --query_structures_directory input_fasta_files
        --query_file_format fasta
        --query_seq_load_logic fasta
        --dataset_cache_file training_cache_without_templates.json
        --updated_dataset_cache_file training_cache_with_templates.json
        --template_cache_directory template_cache
        --max_templates_construct 10000
        --max_templates_filter 20
        --is_core_train True
        --num_workers 48

Expected behavior
I expected the aligned templates to be kept for all chains.

Configuration

  • GPU A100 node (96 cpus)
  • Installation from repo

Additional context
If possible, please also confirm if the OF2 alignments .sto format (precomputed with same script linked above) should be ok for MSA inputs since it's challenging to check the validity of the msa features.

Label: OpenFold Consortium Member

Metadata

Metadata

Assignees

Labels

OpenFold Consortium MemberUse this tag if you are a member of the OpenFold Consortium to receive higher prioritybugSomething isn't workingdata preprocessingRelating to the preprocessing of queries and datasetsdocumentationImprovements or additions to documentationtrainingRelating to the training pipeline

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions