Skip to content

Conversation

@anna-parker
Copy link
Contributor

@anna-parker anna-parker commented Aug 6, 2025

resolves #4708

followed by #4783

The backend no longer asserts that multi-segmented pathogens must follow the <submissionId>_<segmentName> naming convention in the fasta header.

Now a metadata entry submissionId will be matched to the fasta header submissionId and <submissionId>_<suffix> where suffix is an arbitrary string (that cannot contain the separator _).

If a fasta entry could be matched to multiple metadata entries we throw an error and suggest that users do not use a _ in their metadata submissionIds to avoid confusion.

Preprocessing now validates the fasta Header submissionId structure, for now (to not introduce a breaking change) prepro asserts that the fasta header for multi-segmented organisms is of the structure <submissionId>_<segmentName> - but in a later PR preprocessing will also assign the segment using nextclade sort - ignoring the fasta header structure.

BREAKING CHANGE

The structure of the /extract-unprocessed-data and the /get-data-to-edit endpoint response changes. data.unalignedNucleotideSequences and originalData.unalignedNucleotideSequences are now dictionaries from the fasta_header (fasta submissionId) to the sequence in the header and no longer the segment_name to the sequence.

In order to implement this change a small modification to the frontend revision code was required as originalData.unalignedNucleotideSequences now does not have segment assignment - instead we use the processedData.unalignedNucleotideSequences entry which is contained in the same response.

PR Checklist

  • All necessary documentation has been adapted.
  • The implemented feature is covered by appropriate, automated tests.
  • Any manual testing that has been done is documented: checked previews worked and data could be submitted, revised and revoked for ebola and for CCHF

🚀 Preview: Add preview label to enable

@anna-parker anna-parker added preview Triggers a deployment to argocd update_db_schema labels Aug 6, 2025
@anna-parker anna-parker added preview Triggers a deployment to argocd and removed preview Triggers a deployment to argocd labels Aug 6, 2025
@anna-parker

This comment was marked as outdated.

@anna-parker anna-parker force-pushed the move_fast_header_validation branch 2 times, most recently from fc76c68 to 997af9c Compare August 11, 2025 12:23
name="accept any prefix for multi-segment",
input_metadata={},
input_sequence={
"prefix_ebola-sudan": sequence_with_mutation("ebola-sudan"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in theory we could assert that the prefix is submissionId

@anna-parker anna-parker changed the title feat(backend): remove fasta header validation from backend and refactor how fasta and metadata are merged feat!(backend): remove fasta header validation from backend and refactor how fasta and metadata are merged Aug 11, 2025
@anna-parker anna-parker marked this pull request as ready for review August 12, 2025 10:06
@theosanderson
Copy link
Member

in a later PR preprocessing will also assign the segment using nextclade sort - ignoring the fasta header structure.

Is this something that's been discussed? Are we now requiring that all segments align to reference? (We didn't before right?) Mightn't that e.g. mean we reject some INSDC stuff we currently accept?

@anna-parker
Copy link
Contributor Author

Is this something that's been discussed? Are we now requiring that all segments align to reference? (We didn't before right?) Mightn't that e.g. mean we reject some INSDC stuff we currently accept?

sorry this should say we will allow the option to assign segments using nextclade sort - this shouldn't change anything tho as we anyways only accept INSDC data that aligns and assign segments in ingest using either nextclade align or nextclade sort

@anna-parker

This comment was marked as outdated.

@anna-parker anna-parker force-pushed the move_fast_header_validation branch 2 times, most recently from 1ec6036 to b256f29 Compare August 12, 2025 16:10
@chaoran-chen chaoran-chen removed the preview Triggers a deployment to argocd label Aug 17, 2025
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with removing the validation but I'm hesitant about the breaking change. Since the backend is anyways already doing the parsing of the FASTA header and has to identify what is the submissionId, whether a segmentName is provided and, if yes, what the segmentName is, I think that we should provide the pipeline with the parsed information. The pipeline already receives the submisisonId and putting it into the sequence dictionary keys is redundant.

If the reason is to avoid writing main, we can change it to use an empty string as key if no segment name is provided.

I also still think that _ is not a good separator and that we should change it (#4734).

Copy link
Contributor

@fengelniederhammer fengelniederhammer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The backend part looks reasonable to me. But it doesn't solve the compression migration problem yet, does it?

I did not check the preprocessing changes.

Is there any user facing documentation that we would need to adapt? In a quick search I just found this:
image


@Service
class ParseFastaHeader(private val backendConfig: BackendConfig) {
fun parse(submissionId: String, organism: Organism): Pair<SubmissionId, SegmentName> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SegmentName typealias is now unused, we can delete it. (It's declared in the SubmitModel)

throw UnprocessableEntityException(unmatchedSeqText + unmatchedMetadataText + ambiguousSequenceText)
}

transaction {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need transaction in a @Transactional method? I thought that the annotation already wraps the whole method in a transaction?

}

@Transactional
private fun mapMetadataKeysToSequenceKeys(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is already quite large. What do you think about moving parts of this method to a separate class/function? I think everything except the transaction might be a good candidate for refactoring (i.e. the for loop and the if statement with the validation).

val seqKeyInMeta = metadataKeysSet.contains(seqKey)
when {
seqKeyInMeta -> seqKey
else -> null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another refactoring suggestion:

Suggested change
else -> null
else -> {
unmatchedSequenceKeys.add(seqKey)
continue
}

and similar for the other else case below. Then we would not need the if (matchedMetadataKey != null) { below.
IMO that straightens the control flow of this loop and makes it easier to understand.

}
}

val metadataKeysWithoutSequences = metadataKeysSet.filterNot { metadataKeyToSequences.containsKey(it) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this does the same but it's more concise:

Suggested change
val metadataKeysWithoutSequences = metadataKeysSet.filterNot { metadataKeyToSequences.containsKey(it) }
val metadataKeysWithoutSequences = metadataKeysSet.subtract(metadataKeyToSequences.keys)

for ((metadataSubmissionId, sequenceSubmissionIds) in metadataKeyToSequences) {
for (sequenceSubmissionId in sequenceSubmissionIds) {
SequenceUploadAuxTable.update(
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick to improve readability:

Suggested change
{
where = {

const emptyRows = this.emptyRows(segmentNames);
const existingDataRows = Object.entries(initialData.originalData.unalignedNucleotideSequences).map(
([key, value]) => ({
const existingDataRows = Object.entries(initialData.processedData.unalignedNucleotideSequences)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@theosanderson this is the change I discussed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I don't think this change makes sense - what users (should) see on this form is their raw data. Already in Pathoplexus we change the unaligned data - we trim terminal Ns - but users should get their original data as they uploaded it here.

And our model for Loculus imagines that the processed unaligned sequences could be completely different from the raw ones - e.g. you could upload raw reads and no FASTA files and those reads would be assembled to create a FASTA in processed data. This would challenge all that so IMO is probably blocking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@theosanderson - ok! I guess as the unaligned sequences could be processed it is good to return the actual original data.

Then I guess we need to update this part of the form to let the users download their original data with the submissionId.

I am actually wondering - as this does create a breaking change in the structure of the processedData response anyways - maybe we should add a map from the submissionId of the original data to the assigned segment/subtype. For example

{
    accession,
    version,
    errors,
    warnings,
    data: {
        metadata,
        unalignedNucleotideSequences,
        alignedNucleotideSequences,
        nucleotideInsertions,
        alignedAminoAcidSequences,
        aminoAcidInsertions,
        files,
        submissionIdSegmentMap,
    }
}

@anna-parker anna-parker marked this pull request as draft November 7, 2025 09:53
anna-parker added a commit that referenced this pull request Nov 20, 2025
resolves #4847

### Screenshot

Improves #4821, comes
after #5398

You can use pathoplexus/dev_example_data#2 for
testing.

Nextclade sort will be used to assign segments/subtypes for all aligned
sequences:
```
minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort>
```
For organisms without a nextclade dataset we still allow the fasta
headers to be used to determine the segment/subtype - entries must have
the format <submissionId>_<segmentName> (as in current set up).

As preprocessing now assigns segments it will return a map from the
segment (or subtype) to the fastaHeader in the processedData:
`sequenceNameToFastaHeaderMap`. This allows us to surface this
assignment on the edit page.

## Prepro config changes

Instead of having a dictionary for the nextclade datasets and servers we
make `nucleotideSequences` a list of sequences:
```
nextclade_dataset_name: 
    L: nextstrain/cchfv/linked/L
    M: nextstrain/cchfv/linked/M
    S: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
genes: [RdRp, GPC, NP]
```

```
nucleotideSequences:
  - name: L
    nextclade_dataset_name: nextstrain/cchfv/linked/L
    nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level> 
    nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq>
    accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used> 
    gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix >
  - name: M
    nextclade_dataset_name: nextstrain/cchfv/linked/M
  - name: S
    nextclade_dataset_name: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
```

Note the templates now also generate the genes list from the merged
config.

### PR Checklist
- [ ] Update values.schema.json
- [x] keep tests for alignment NONE case
- [x] Create a minimizer for tests using:
https://github.com/loculus-project/nextclade-sort-minimizer-creator
- [x] Any manual testing that has been done is documented: submission of
EVs from test folder were submitted with the same fastaHeader as the
submissionId -> this succeeded, additionally the submission of CCHF with
a fastaID column in the metadata was tested (also in folder above),
additionally revision of a segment was tested
- [x] Have preprocessing send back a segment: fastaHeader mapping

## Future Work
- [ ] add integration testing for full EV submission user journey
- [ ] improve CCHF minimizer (some segments are again not assigned)
- [ ] discuss if the originalData dictionary should be migrated
(persistent DB has segmentName as key, now we have fastaHeader as key)
- [ ] update PPX docs with new multi-segment submission format

🚀 Preview: https://sort-multi-path.loculus.org
@anna-parker anna-parker closed this Dec 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multi pathogen support: backend submission

6 participants