feat!(backend): remove fasta header validation from backend and refactor how fasta and metadata are merged #4821

anna-parker · 2025-08-06T14:36:09Z

resolves #4708

followed by #4783

The backend no longer asserts that multi-segmented pathogens must follow the <submissionId>_<segmentName> naming convention in the fasta header.

Now a metadata entry submissionId will be matched to the fasta header submissionId and <submissionId>_<suffix> where suffix is an arbitrary string (that cannot contain the separator _).

If a fasta entry could be matched to multiple metadata entries we throw an error and suggest that users do not use a _ in their metadata submissionIds to avoid confusion.

Preprocessing now validates the fasta Header submissionId structure, for now (to not introduce a breaking change) prepro asserts that the fasta header for multi-segmented organisms is of the structure <submissionId>_<segmentName> - but in a later PR preprocessing will also assign the segment using nextclade sort - ignoring the fasta header structure.

BREAKING CHANGE

The structure of the /extract-unprocessed-data and the /get-data-to-edit endpoint response changes. data.unalignedNucleotideSequences and originalData.unalignedNucleotideSequences are now dictionaries from the fasta_header (fasta submissionId) to the sequence in the header and no longer the segment_name to the sequence.

In order to implement this change a small modification to the frontend revision code was required as originalData.unalignedNucleotideSequences now does not have segment assignment - instead we use the processedData.unalignedNucleotideSequences entry which is contained in the same response.

PR Checklist

All necessary documentation has been adapted.
The implemented feature is covered by appropriate, automated tests.
Any manual testing that has been done is documented: checked previews worked and data could be submitted, revised and revoked for ebola and for CCHF

🚀 Preview: Add preview label to enable

backend/src/main/resources/db/migration/V1.16__rename_aux_table_columns.sql

preprocessing/nextclade/src/loculus_preprocessing/prepro.py

backend/src/main/kotlin/org/loculus/backend/model/SubmitModel.kt

anna-parker · 2025-08-11T14:13:08Z

preprocessing/nextclade/tests/test_nextclade_preprocessing.py

+        name="accept any prefix for multi-segment",
+        input_metadata={},
+        input_sequence={
+            "prefix_ebola-sudan": sequence_with_mutation("ebola-sudan"),


in theory we could assert that the prefix is submissionId

backend/src/test/kotlin/org/loculus/backend/controller/submission/SubmitEndpointTest.kt

integration-tests/tests/specs/features/search/lineage-field.spec.ts

theosanderson · 2025-08-12T13:17:07Z

in a later PR preprocessing will also assign the segment using nextclade sort - ignoring the fasta header structure.

Is this something that's been discussed? Are we now requiring that all segments align to reference? (We didn't before right?) Mightn't that e.g. mean we reject some INSDC stuff we currently accept?

anna-parker · 2025-08-12T13:22:08Z

Is this something that's been discussed? Are we now requiring that all segments align to reference? (We didn't before right?) Mightn't that e.g. mean we reject some INSDC stuff we currently accept?

sorry this should say we will allow the option to assign segments using nextclade sort - this shouldn't change anything tho as we anyways only accept INSDC data that aligns and assign segments in ingest using either nextclade align or nextclade sort

chaoran-chen · 2025-08-17T19:59:52Z

backend/docs/db/schema.sql

I'm happy with removing the validation but I'm hesitant about the breaking change. Since the backend is anyways already doing the parsing of the FASTA header and has to identify what is the submissionId, whether a segmentName is provided and, if yes, what the segmentName is, I think that we should provide the pipeline with the parsed information. The pipeline already receives the submisisonId and putting it into the sequence dictionary keys is redundant.

If the reason is to avoid writing main, we can change it to use an empty string as key if no segment name is provided.

I also still think that _ is not a good separator and that we should change it (#4734).

fengelniederhammer

The backend part looks reasonable to me. But it doesn't solve the compression migration problem yet, does it?

I did not check the preprocessing changes.

Is there any user facing documentation that we would need to adapt? In a quick search I just found this:

fengelniederhammer · 2025-08-21T08:06:17Z

backend/src/main/kotlin/org/loculus/backend/utils/ParseFastaHeader.kt

-
-@Service
-class ParseFastaHeader(private val backendConfig: BackendConfig) {
-    fun parse(submissionId: String, organism: Organism): Pair<SubmissionId, SegmentName> {


The SegmentName typealias is now unused, we can delete it. (It's declared in the SubmitModel)

fengelniederhammer · 2025-08-21T08:09:15Z

backend/src/main/kotlin/org/loculus/backend/model/SubmitModel.kt

+            throw UnprocessableEntityException(unmatchedSeqText + unmatchedMetadataText + ambiguousSequenceText)
+        }
+
+        transaction {


Do we really need transaction in a @Transactional method? I thought that the annotation already wraps the whole method in a transaction?

fengelniederhammer · 2025-08-21T08:20:47Z

backend/src/main/kotlin/org/loculus/backend/model/SubmitModel.kt

+    }
+
+    @Transactional
+    private fun mapMetadataKeysToSequenceKeys(


This class is already quite large. What do you think about moving parts of this method to a separate class/function? I think everything except the transaction might be a good candidate for refactoring (i.e. the for loop and the if statement with the validation).

fengelniederhammer · 2025-08-21T08:42:59Z

backend/src/main/kotlin/org/loculus/backend/model/SubmitModel.kt

+                val seqKeyInMeta = metadataKeysSet.contains(seqKey)
+                when {
+                    seqKeyInMeta -> seqKey
+                    else -> null


Another refactoring suggestion:

Suggested change

else -> null

else -> {

unmatchedSequenceKeys.add(seqKey)

continue

}

and similar for the other else case below. Then we would not need the if (matchedMetadataKey != null) { below.
IMO that straightens the control flow of this loop and makes it easier to understand.

fengelniederhammer · 2025-08-21T09:01:58Z

backend/src/main/kotlin/org/loculus/backend/model/SubmitModel.kt

+            }
+        }
+
+        val metadataKeysWithoutSequences = metadataKeysSet.filterNot { metadataKeyToSequences.containsKey(it) }


I think this does the same but it's more concise:

Suggested change

val metadataKeysWithoutSequences = metadataKeysSet.filterNot { metadataKeyToSequences.containsKey(it) }

val metadataKeysWithoutSequences = metadataKeysSet.subtract(metadataKeyToSequences.keys)

fengelniederhammer · 2025-08-21T09:05:35Z

backend/src/main/kotlin/org/loculus/backend/model/SubmitModel.kt

+            for ((metadataSubmissionId, sequenceSubmissionIds) in metadataKeyToSequences) {
+                for (sequenceSubmissionId in sequenceSubmissionIds) {
+                    SequenceUploadAuxTable.update(
+                        {


Nitpick to improve readability:

Suggested change

{

where = {

…or how fasta and metadata are merged

anna-parker · 2025-09-29T09:01:53Z

website/src/components/Edit/SequencesForm.tsx

        const emptyRows = this.emptyRows(segmentNames);
-        const existingDataRows = Object.entries(initialData.originalData.unalignedNucleotideSequences).map(
-            ([key, value]) => ({
+        const existingDataRows = Object.entries(initialData.processedData.unalignedNucleotideSequences)


@theosanderson this is the change I discussed

Thanks! I don't think this change makes sense - what users (should) see on this form is their raw data. Already in Pathoplexus we change the unaligned data - we trim terminal Ns - but users should get their original data as they uploaded it here.

And our model for Loculus imagines that the processed unaligned sequences could be completely different from the raw ones - e.g. you could upload raw reads and no FASTA files and those reads would be assembled to create a FASTA in processed data. This would challenge all that so IMO is probably blocking.

@theosanderson - ok! I guess as the unaligned sequences could be processed it is good to return the actual original data.

Then I guess we need to update this part of the form to let the users download their original data with the submissionId.

I am actually wondering - as this does create a breaking change in the structure of the processedData response anyways - maybe we should add a map from the submissionId of the original data to the assigned segment/subtype. For example

{ accession, version, errors, warnings, data: { metadata, unalignedNucleotideSequences, alignedNucleotideSequences, nucleotideInsertions, alignedAminoAcidSequences, aminoAcidInsertions, files, submissionIdSegmentMap, } }

resolves #4847 ### Screenshot Improves #4821, comes after #5398 You can use pathoplexus/dev_example_data#2 for testing. Nextclade sort will be used to assign segments/subtypes for all aligned sequences: ``` minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort> ``` For organisms without a nextclade dataset we still allow the fasta headers to be used to determine the segment/subtype - entries must have the format <submissionId>_<segmentName> (as in current set up). As preprocessing now assigns segments it will return a map from the segment (or subtype) to the fastaHeader in the processedData: `sequenceNameToFastaHeaderMap`. This allows us to surface this assignment on the edit page. ## Prepro config changes Instead of having a dictionary for the nextclade datasets and servers we make `nucleotideSequences` a list of sequences: ``` nextclade_dataset_name: L: nextstrain/cchfv/linked/L M: nextstrain/cchfv/linked/M S: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output genes: [RdRp, GPC, NP] ``` ``` nucleotideSequences: - name: L nextclade_dataset_name: nextstrain/cchfv/linked/L nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level> nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq> accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used> gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix > - name: M nextclade_dataset_name: nextstrain/cchfv/linked/M - name: S nextclade_dataset_name: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output ``` Note the templates now also generate the genes list from the merged config. ### PR Checklist - [ ] Update values.schema.json - [x] keep tests for alignment NONE case - [x] Create a minimizer for tests using: https://github.com/loculus-project/nextclade-sort-minimizer-creator - [x] Any manual testing that has been done is documented: submission of EVs from test folder were submitted with the same fastaHeader as the submissionId -> this succeeded, additionally the submission of CCHF with a fastaID column in the metadata was tested (also in folder above), additionally revision of a segment was tested - [x] Have preprocessing send back a segment: fastaHeader mapping ## Future Work - [ ] add integration testing for full EV submission user journey - [ ] improve CCHF minimizer (some segments are again not assigned) - [ ] discuss if the originalData dictionary should be migrated (persistent DB has segmentName as key, now we have fastaHeader as key) - [ ] update PPX docs with new multi-segment submission format 🚀 Preview: https://sort-multi-path.loculus.org

anna-parker added preview Triggers a deployment to argocd update_db_schema labels Aug 6, 2025

anna-parker commented Aug 6, 2025

View reviewed changes

backend/src/main/resources/db/migration/V1.16__rename_aux_table_columns.sql Outdated Show resolved Hide resolved

anna-parker commented Aug 6, 2025

View reviewed changes

preprocessing/nextclade/src/loculus_preprocessing/prepro.py Show resolved Hide resolved

anna-parker commented Aug 6, 2025

View reviewed changes

preprocessing/nextclade/src/loculus_preprocessing/prepro.py Show resolved Hide resolved

anna-parker added preview Triggers a deployment to argocd and removed preview Triggers a deployment to argocd labels Aug 6, 2025

This comment was marked as outdated.

Sign in to view

anna-parker commented Aug 6, 2025

View reviewed changes

backend/src/main/kotlin/org/loculus/backend/model/SubmitModel.kt Show resolved Hide resolved

anna-parker force-pushed the move_fast_header_validation branch 2 times, most recently from fc76c68 to 997af9c Compare August 11, 2025 12:23

anna-parker commented Aug 11, 2025

View reviewed changes

backend/src/test/kotlin/org/loculus/backend/controller/submission/SubmitEndpointTest.kt Show resolved Hide resolved

anna-parker changed the title ~~feat(backend): remove fasta header validation from backend and refactor how fasta and metadata are merged~~ feat!(backend): remove fasta header validation from backend and refactor how fasta and metadata are merged Aug 11, 2025

anna-parker mentioned this pull request Aug 12, 2025

feat!(prepro, config): assign segment with nextclade sort #4783

Closed

3 tasks

anna-parker marked this pull request as ready for review August 12, 2025 10:06

anna-parker commented Aug 12, 2025

View reviewed changes

integration-tests/tests/specs/features/search/lineage-field.spec.ts Outdated Show resolved Hide resolved

anna-parker requested review from corneliusroemer and theosanderson August 12, 2025 11:57

This comment was marked as outdated.

Sign in to view

anna-parker force-pushed the move_fast_header_validation branch 2 times, most recently from 1ec6036 to b256f29 Compare August 12, 2025 16:10

chaoran-chen removed the preview Triggers a deployment to argocd label Aug 17, 2025

chaoran-chen reviewed Aug 17, 2025

View reviewed changes

fengelniederhammer reviewed Aug 21, 2025

View reviewed changes

anna-parker added 3 commits September 11, 2025 15:34

feat(backend): remove fasta header validation from backend and refact…

0aa0149

…or how fasta and metadata are merged

feat(prepro): move segment validation to prepro

32ab96b

feat(prepro): fix tests

8ab5d14

anna-parker and others added 21 commits September 11, 2025 15:34

fix prepro

9af1826

retry

5cb0014

again

01418ef

feat(prepro,backend): fix merge conflicts

c9c4b67

feat(backend): fix merge conflict

48cbc47

feat(website): try to fix revisions

458ce5f

feat(prepro): clean up more

2a4db34

feat(prepro): add tests

2883291

feat(prepro): increase timeout

fbee233

feat(prepro): add more tests

28c4b46

feat(backend): improve migration

5fd6242

Update schema documentation based on migration changes

af93e19

feat(kotlin): correctly define fields

d182c08

feat(backend): add tests for edge case

4a950cf

double timeout while I investigate

b76d1aa

feat(backend): only change for multi-segmented organisms

477ffa3

for kicks

0c3dff2

feat(prepro): update specification docs

4914069

feat(prepro): try to fix integration tests

e160c4d

feat: revert changes that are not required

278d085

try again

c46e22b

anna-parker force-pushed the move_fast_header_validation branch from b256f29 to c46e22b Compare September 11, 2025 13:35

corneliusroemer mentioned this pull request Sep 22, 2025

refactor(backend): simplify submission by getting rid of aux tables #4915

Closed

3 tasks

anna-parker commented Sep 29, 2025

View reviewed changes

anna-parker mentioned this pull request Oct 17, 2025

feat!(backend): refactor multi-segment submission #5261

Closed

5 tasks

anna-parker marked this pull request as draft November 7, 2025 09:53

This was referenced Nov 10, 2025

feat!(backend): refactor multi-segment submission (2/n) #5398

Merged

feat(prepro): assign segment/subtype using nextclade sort (3/n) #5402

Merged

anna-parker closed this Dec 8, 2025

-                    else -> null
+                    else -> {
+                        unmatchedSequenceKeys.add(seqKey)
+                        continue
+                    }

	val metadataKeysWithoutSequences = metadataKeysSet.filterNot { metadataKeyToSequences.containsKey(it) }
	val metadataKeysWithoutSequences = metadataKeysSet.subtract(metadataKeyToSequences.keys)

feat!(backend): remove fasta header validation from backend and refactor how fasta and metadata are merged #4821

feat!(backend): remove fasta header validation from backend and refactor how fasta and metadata are merged #4821

Uh oh!

Conversation

anna-parker commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

BREAKING CHANGE

PR Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

theosanderson commented Aug 12, 2025

Uh oh!

anna-parker commented Aug 12, 2025

Uh oh!

This comment was marked as outdated.

Choose a reason for hiding this comment

Uh oh!

fengelniederhammer left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

anna-parker commented Aug 6, 2025 •

edited

Loading

fengelniederhammer left a comment •

edited

Loading