-
Notifications
You must be signed in to change notification settings - Fork 9
feat!(backend): remove fasta header validation from backend and refactor how fasta and metadata are merged #4821
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
backend/src/main/resources/db/migration/V1.16__rename_aux_table_columns.sql
Outdated
Show resolved
Hide resolved
This comment was marked as outdated.
This comment was marked as outdated.
fc76c68 to
997af9c
Compare
| name="accept any prefix for multi-segment", | ||
| input_metadata={}, | ||
| input_sequence={ | ||
| "prefix_ebola-sudan": sequence_with_mutation("ebola-sudan"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in theory we could assert that the prefix is submissionId
backend/src/test/kotlin/org/loculus/backend/controller/submission/SubmitEndpointTest.kt
Show resolved
Hide resolved
integration-tests/tests/specs/features/search/lineage-field.spec.ts
Outdated
Show resolved
Hide resolved
Is this something that's been discussed? Are we now requiring that all segments align to reference? (We didn't before right?) Mightn't that e.g. mean we reject some INSDC stuff we currently accept? |
sorry this should say we will allow the option to assign segments using nextclade sort - this shouldn't change anything tho as we anyways only accept INSDC data that aligns and assign segments in ingest using either nextclade align or nextclade sort |
This comment was marked as outdated.
This comment was marked as outdated.
1ec6036 to
b256f29
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with removing the validation but I'm hesitant about the breaking change. Since the backend is anyways already doing the parsing of the FASTA header and has to identify what is the submissionId, whether a segmentName is provided and, if yes, what the segmentName is, I think that we should provide the pipeline with the parsed information. The pipeline already receives the submisisonId and putting it into the sequence dictionary keys is redundant.
If the reason is to avoid writing main, we can change it to use an empty string as key if no segment name is provided.
I also still think that _ is not a good separator and that we should change it (#4734).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| @Service | ||
| class ParseFastaHeader(private val backendConfig: BackendConfig) { | ||
| fun parse(submissionId: String, organism: Organism): Pair<SubmissionId, SegmentName> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SegmentName typealias is now unused, we can delete it. (It's declared in the SubmitModel)
| throw UnprocessableEntityException(unmatchedSeqText + unmatchedMetadataText + ambiguousSequenceText) | ||
| } | ||
|
|
||
| transaction { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need transaction in a @Transactional method? I thought that the annotation already wraps the whole method in a transaction?
| } | ||
|
|
||
| @Transactional | ||
| private fun mapMetadataKeysToSequenceKeys( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class is already quite large. What do you think about moving parts of this method to a separate class/function? I think everything except the transaction might be a good candidate for refactoring (i.e. the for loop and the if statement with the validation).
| val seqKeyInMeta = metadataKeysSet.contains(seqKey) | ||
| when { | ||
| seqKeyInMeta -> seqKey | ||
| else -> null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another refactoring suggestion:
| else -> null | |
| else -> { | |
| unmatchedSequenceKeys.add(seqKey) | |
| continue | |
| } |
and similar for the other else case below. Then we would not need the if (matchedMetadataKey != null) { below.
IMO that straightens the control flow of this loop and makes it easier to understand.
| } | ||
| } | ||
|
|
||
| val metadataKeysWithoutSequences = metadataKeysSet.filterNot { metadataKeyToSequences.containsKey(it) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this does the same but it's more concise:
| val metadataKeysWithoutSequences = metadataKeysSet.filterNot { metadataKeyToSequences.containsKey(it) } | |
| val metadataKeysWithoutSequences = metadataKeysSet.subtract(metadataKeyToSequences.keys) |
| for ((metadataSubmissionId, sequenceSubmissionIds) in metadataKeyToSequences) { | ||
| for (sequenceSubmissionId in sequenceSubmissionIds) { | ||
| SequenceUploadAuxTable.update( | ||
| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick to improve readability:
| { | |
| where = { |
…or how fasta and metadata are merged
b256f29 to
c46e22b
Compare
| const emptyRows = this.emptyRows(segmentNames); | ||
| const existingDataRows = Object.entries(initialData.originalData.unalignedNucleotideSequences).map( | ||
| ([key, value]) => ({ | ||
| const existingDataRows = Object.entries(initialData.processedData.unalignedNucleotideSequences) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@theosanderson this is the change I discussed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I don't think this change makes sense - what users (should) see on this form is their raw data. Already in Pathoplexus we change the unaligned data - we trim terminal Ns - but users should get their original data as they uploaded it here.
And our model for Loculus imagines that the processed unaligned sequences could be completely different from the raw ones - e.g. you could upload raw reads and no FASTA files and those reads would be assembled to create a FASTA in processed data. This would challenge all that so IMO is probably blocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@theosanderson - ok! I guess as the unaligned sequences could be processed it is good to return the actual original data.
Then I guess we need to update this part of the form to let the users download their original data with the submissionId.
I am actually wondering - as this does create a breaking change in the structure of the processedData response anyways - maybe we should add a map from the submissionId of the original data to the assigned segment/subtype. For example
{
accession,
version,
errors,
warnings,
data: {
metadata,
unalignedNucleotideSequences,
alignedNucleotideSequences,
nucleotideInsertions,
alignedAminoAcidSequences,
aminoAcidInsertions,
files,
submissionIdSegmentMap,
}
}resolves #4847 ### Screenshot Improves #4821, comes after #5398 You can use pathoplexus/dev_example_data#2 for testing. Nextclade sort will be used to assign segments/subtypes for all aligned sequences: ``` minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort> ``` For organisms without a nextclade dataset we still allow the fasta headers to be used to determine the segment/subtype - entries must have the format <submissionId>_<segmentName> (as in current set up). As preprocessing now assigns segments it will return a map from the segment (or subtype) to the fastaHeader in the processedData: `sequenceNameToFastaHeaderMap`. This allows us to surface this assignment on the edit page. ## Prepro config changes Instead of having a dictionary for the nextclade datasets and servers we make `nucleotideSequences` a list of sequences: ``` nextclade_dataset_name: L: nextstrain/cchfv/linked/L M: nextstrain/cchfv/linked/M S: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output genes: [RdRp, GPC, NP] ``` ``` nucleotideSequences: - name: L nextclade_dataset_name: nextstrain/cchfv/linked/L nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level> nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq> accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used> gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix > - name: M nextclade_dataset_name: nextstrain/cchfv/linked/M - name: S nextclade_dataset_name: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output ``` Note the templates now also generate the genes list from the merged config. ### PR Checklist - [ ] Update values.schema.json - [x] keep tests for alignment NONE case - [x] Create a minimizer for tests using: https://github.com/loculus-project/nextclade-sort-minimizer-creator - [x] Any manual testing that has been done is documented: submission of EVs from test folder were submitted with the same fastaHeader as the submissionId -> this succeeded, additionally the submission of CCHF with a fastaID column in the metadata was tested (also in folder above), additionally revision of a segment was tested - [x] Have preprocessing send back a segment: fastaHeader mapping ## Future Work - [ ] add integration testing for full EV submission user journey - [ ] improve CCHF minimizer (some segments are again not assigned) - [ ] discuss if the originalData dictionary should be migrated (persistent DB has segmentName as key, now we have fastaHeader as key) - [ ] update PPX docs with new multi-segment submission format 🚀 Preview: https://sort-multi-path.loculus.org

resolves #4708
followed by #4783
The backend no longer asserts that multi-segmented pathogens must follow the
<submissionId>_<segmentName>naming convention in the fasta header.Now a metadata entry
submissionIdwill be matched to the fasta headersubmissionIdand<submissionId>_<suffix>where suffix is an arbitrary string (that cannot contain the separator_).If a fasta entry could be matched to multiple metadata entries we throw an error and suggest that users do not use a
_in their metadatasubmissionIds to avoid confusion.Preprocessing now validates the fasta Header submissionId structure, for now (to not introduce a breaking change) prepro asserts that the fasta header for multi-segmented organisms is of the structure
<submissionId>_<segmentName>- but in a later PR preprocessing will also assign the segment using nextclade sort - ignoring the fasta header structure.BREAKING CHANGE
The structure of the
/extract-unprocessed-dataand the/get-data-to-editendpoint response changes.data.unalignedNucleotideSequencesandoriginalData.unalignedNucleotideSequencesare now dictionaries from the fasta_header (fasta submissionId) to the sequence in the header and no longer the segment_name to the sequence.In order to implement this change a small modification to the frontend revision code was required as
originalData.unalignedNucleotideSequencesnow does not have segment assignment - instead we use theprocessedData.unalignedNucleotideSequencesentry which is contained in the same response.PR Checklist
🚀 Preview: Add
previewlabel to enable