Skip to content

Conversation

@anna-parker
Copy link
Contributor

@anna-parker anna-parker commented Jan 5, 2026

resolves #5663 and #5664

This adds an example of a multi-reference multi-segment CCHF organism (only the segments S and M have multiple references)

image

Architecture

As LAPIS does not have a specific way to define references of a segment each reference of a segment is stored as a unique segment in LAPIS - we use a namespaced naming convention to ensure there are no key clashes:

LAPIS Nucleotide sequence naming

    match (multi_reference, multi_segment):
        case (False, _):
            return segmentName
        case (True, True):
            return f"{segmentName}-{referenceName}"
        case (True, False):
            return referenceName

LAPIS Gene sequence naming

    match (multi_reference, multi_segment):
        case (False, _):
            return GeneName
        case (True, _):
            return f"{GeneName}-{referenceName}"

Config Changes

referenceGenomes:
      singleReference:
        nucleotideSequences:
          - name: "main"
            sequence: "[[URL:https://corneliusroemer.github.io/seqs/artefacts/ebola-sudan/reference.fasta]]"
            insdcAccessionFull: NC_002549.1
        genes:
          - name: NP
            sequence: "[[URL:https://corneliusroemer.github.io/seqs/artefacts/ebola-sudan/NP.fasta]]"
          ....

becomes:

referenceGenomes:
      - name: main
        references:
          - name: singleReference
            sequence: "[[URL:https://corneliusroemer.github.io/seqs/artefacts/ebola-sudan/reference.fasta]]"
            insdcAccessionFull: NC_002549.1
            genes:
              - name: NP
                sequence: "[[URL:https://corneliusroemer.github.io/seqs/artefacts/ebola-sudan/NP.fasta]]"
              - name: VP35
                sequence: "[[URL:https://corneliusroemer.github.io/seqs/artefacts/ebola-sudan/VP35.fasta]]"

PR Checklist

  • Add the lapisName vs displayName logic to the config structure at the start so that this doesnt have to be recalculated again and again in the code
  • Get the existing website tests to work
  • Add integration tests for the multi-segment, multi-reference case
  • Later steps: ensure ENA deposition works

🚀 Preview: https://restructure-anya.loculus.org

@anna-parker anna-parker changed the base branch from restructure to main January 5, 2026 17:00
@anna-parker anna-parker changed the base branch from main to prepro_multipath January 9, 2026 19:09
@anna-parker anna-parker added the preview Triggers a deployment to argocd label Jan 9, 2026
@anna-parker

This comment was marked as outdated.

@anna-parker anna-parker added preview Triggers a deployment to argocd and removed preview Triggers a deployment to argocd labels Jan 9, 2026
@anna-parker

This comment was marked as outdated.

@anna-parker anna-parker changed the title Restructure anya feat!(config, website): flip config and website to segment-reference structure Jan 12, 2026
anna-parker added a commit that referenced this pull request Jan 12, 2026
…ing (#5800)

partially resolves
#5663 and
#5664

## Overview

I decided to split out the prepro changes required for
#5799 into 1 PR.

This change allows multiple `references` per `segment`, prepro will
assign sequences to the correct reference within a segment and return
the aligned (and unaligned) sequences with the key (of type:
`SequenceName`) expected by the backend.

Prepro attempts to aligns each sequence to one reference per segment, if
a sequence can be aligned to multiple references it chooses the
reference with the highest `nextclade alignment` or `nextclade sort`
score.

If multiple sequences within a submission align to the same segment
(also if they align to different references of the same segment) the
submission will error.

## Changes
1. The yaml config is changed to reflect segment-reference hierarchy
(see breaking changes below), this config is used to create a list of
processed `NextcladeSequenceDataset` objects for each reference.
2. Improved typing, introduction of the `SequenceName` type (name of
processed sequence as expected by the backend) to distinguish between
`SegmentName` objects. For example if the segment `L` has references `A`
and `B`, then the `SegmentName` is `L` but the `SequenceName` will be
`L_A`.
3. Removal of the `useFirstSegment` config option -`perSegment` metadata
fields will always be assigned to results of the reference they best
align to.
4. The `ASSIGNED_SEGMENT` field is removed and replaced with
`ASSIGNED_REFERENCE` -this is now a `perSegment` field.

Note that over 650 lines of this is just to add multi-reference CCHF as
a test organism for prepro.

### Breaking changes
The prepro config must be changed from 
```
configFile: 
   nextclade_sequence_and_datasets:
    - name: CV-A16 # This does not work yet with multi-segment organisms: #5663
      nextclade_dataset_name: enpen/enterovirus/cv-a16
      accepted_sort_matches: ["community/hodcroftlab/enterovirus/cva16", "community/hodcroftlab/enterovirus/enterovirus/linked/CV-A16"]
      gene_prefix: "CV-A16-"
      genes: ["VP4", "VP2", "VP3", "VP1", "2A", "2B", "2C", "3A", "3B", "3C", "3D"]
    - name: CV-A10
      nextclade_dataset_name: enpen/enterovirus/cv-a10
      accepted_sort_matches: ["community/hodcroftlab/enterovirus/enterovirus/linked/CV-A10"]
      gene_prefix: "CV-A10-"
      genes: ["VP4", "VP2", "VP3", "VP1", "2A", "2B", "2C", "3A", "3B", "3C", "3D"]
```
to (note we can add more segments with a variable number of references):
```
configFile: 
  segments:
    - name: main
      references: 
       -reference: CV-A16
        nextclade_dataset_name: enpen/enterovirus/cv-a16
        accepted_sort_matches: ["community/hodcroftlab/enterovirus/cva16", "community/hodcroftlab/enterovirus/enterovirus/linked/CV-A16"]
        genes: ["VP4", "VP2", "VP3", "VP1", "2A", "2B", "2C", "3A", "3B", "3C", "3D"]
      - reference: CV-A10
        nextclade_dataset_name: enpen/enterovirus/cv-a10
        accepted_sort_matches: ["community/hodcroftlab/enterovirus/enterovirus/linked/CV-A10"]
        genes: ["VP4", "VP2", "VP3", "VP1", "2A", "2B", "2C", "3A", "3B", "3C", "3D"]
   - name: seg2
      genes: ...
      references: 
       -...
```

## TODO:
- [x] check dummy pipeline -> confirmed this works without issues
- [x] rename segment to sequence when corresponding to name of
`segment/reference`
- [x] change `ASSIGNED_SEGMENT` to `ASSIGNED_REFERENCE` and make this
per segment
- [x] get rid of `useFirstSegment`
- [x] fix error for multiple segments of same type exist for multi
reference case
- [x] All necessary documentation has been adapted.
- [x] The implemented feature is covered by appropriate, automated
tests.
- [x] Any manual testing that has been done is documented (i.e. what
exactly was tested?) -> confirmed that all current organisms are
unaffected by this change including EVs, integration also all pass

🚀 Preview: https://prepro-multipath.loculus.org
Base automatically changed from prepro_multipath to main January 12, 2026 14:26
);
};

function getDisplayState(
Copy link
Contributor Author

@anna-parker anna-parker Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really dislike that undefined means that the sequence is visible

);
};

export function getSequenceNames(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should come straight from the config to avoid duplication

@anna-parker

This comment was marked as outdated.

@anna-parker

This comment was marked as outdated.

* @param referenceName - The selected reference for this segment (e.g., "CV-A16"), or null
* @returns SegmentInfo with appropriate LAPIS naming
*/
export function getSegmentInfoWithReference(segmentName: string, referenceName: string | null): SegmentInfo {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again this should just come from the config at the start and then we won't have to call these functions later on

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

preview Triggers a deployment to argocd

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multi Path: Multi Segment - issue with ASSIGNED_SEGMENT

3 participants