-
Notifications
You must be signed in to change notification settings - Fork 9
feat!(config, website): flip config and website to segment-reference structure #5799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
9c96ac0 to
8a08264
Compare
1e4ff42 to
67cd7e8
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
…ing (#5800) partially resolves #5663 and #5664 ## Overview I decided to split out the prepro changes required for #5799 into 1 PR. This change allows multiple `references` per `segment`, prepro will assign sequences to the correct reference within a segment and return the aligned (and unaligned) sequences with the key (of type: `SequenceName`) expected by the backend. Prepro attempts to aligns each sequence to one reference per segment, if a sequence can be aligned to multiple references it chooses the reference with the highest `nextclade alignment` or `nextclade sort` score. If multiple sequences within a submission align to the same segment (also if they align to different references of the same segment) the submission will error. ## Changes 1. The yaml config is changed to reflect segment-reference hierarchy (see breaking changes below), this config is used to create a list of processed `NextcladeSequenceDataset` objects for each reference. 2. Improved typing, introduction of the `SequenceName` type (name of processed sequence as expected by the backend) to distinguish between `SegmentName` objects. For example if the segment `L` has references `A` and `B`, then the `SegmentName` is `L` but the `SequenceName` will be `L_A`. 3. Removal of the `useFirstSegment` config option -`perSegment` metadata fields will always be assigned to results of the reference they best align to. 4. The `ASSIGNED_SEGMENT` field is removed and replaced with `ASSIGNED_REFERENCE` -this is now a `perSegment` field. Note that over 650 lines of this is just to add multi-reference CCHF as a test organism for prepro. ### Breaking changes The prepro config must be changed from ``` configFile: nextclade_sequence_and_datasets: - name: CV-A16 # This does not work yet with multi-segment organisms: #5663 nextclade_dataset_name: enpen/enterovirus/cv-a16 accepted_sort_matches: ["community/hodcroftlab/enterovirus/cva16", "community/hodcroftlab/enterovirus/enterovirus/linked/CV-A16"] gene_prefix: "CV-A16-" genes: ["VP4", "VP2", "VP3", "VP1", "2A", "2B", "2C", "3A", "3B", "3C", "3D"] - name: CV-A10 nextclade_dataset_name: enpen/enterovirus/cv-a10 accepted_sort_matches: ["community/hodcroftlab/enterovirus/enterovirus/linked/CV-A10"] gene_prefix: "CV-A10-" genes: ["VP4", "VP2", "VP3", "VP1", "2A", "2B", "2C", "3A", "3B", "3C", "3D"] ``` to (note we can add more segments with a variable number of references): ``` configFile: segments: - name: main references: -reference: CV-A16 nextclade_dataset_name: enpen/enterovirus/cv-a16 accepted_sort_matches: ["community/hodcroftlab/enterovirus/cva16", "community/hodcroftlab/enterovirus/enterovirus/linked/CV-A16"] genes: ["VP4", "VP2", "VP3", "VP1", "2A", "2B", "2C", "3A", "3B", "3C", "3D"] - reference: CV-A10 nextclade_dataset_name: enpen/enterovirus/cv-a10 accepted_sort_matches: ["community/hodcroftlab/enterovirus/enterovirus/linked/CV-A10"] genes: ["VP4", "VP2", "VP3", "VP1", "2A", "2B", "2C", "3A", "3B", "3C", "3D"] - name: seg2 genes: ... references: -... ``` ## TODO: - [x] check dummy pipeline -> confirmed this works without issues - [x] rename segment to sequence when corresponding to name of `segment/reference` - [x] change `ASSIGNED_SEGMENT` to `ASSIGNED_REFERENCE` and make this per segment - [x] get rid of `useFirstSegment` - [x] fix error for multiple segments of same type exist for multi reference case - [x] All necessary documentation has been adapted. - [x] The implemented feature is covered by appropriate, automated tests. - [x] Any manual testing that has been done is documented (i.e. what exactly was tested?) -> confirmed that all current organisms are unaffected by this change including EVs, integration also all pass 🚀 Preview: https://prepro-multipath.loculus.org
| ); | ||
| }; | ||
|
|
||
| function getDisplayState( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really dislike that undefined means that the sequence is visible
| ); | ||
| }; | ||
|
|
||
| export function getSequenceNames( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should come straight from the config to avoid duplication
1cd98f7 to
7c407e6
Compare
This comment was marked as outdated.
This comment was marked as outdated.
62435bc to
613c62d
Compare
This comment was marked as outdated.
This comment was marked as outdated.
| * @param referenceName - The selected reference for this segment (e.g., "CV-A16"), or null | ||
| * @returns SegmentInfo with appropriate LAPIS naming | ||
| */ | ||
| export function getSegmentInfoWithReference(segmentName: string, referenceName: string | null): SegmentInfo { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again this should just come from the config at the start and then we won't have to call these functions later on
cddd4f5 to
6bffb36
Compare
resolves #5663 and #5664
This adds an example of a multi-reference multi-segment CCHF organism (only the segments S and M have multiple references)
Architecture
As LAPIS does not have a specific way to define references of a segment each reference of a segment is stored as a unique segment in LAPIS - we use a namespaced naming convention to ensure there are no key clashes:
LAPIS Nucleotide sequence naming
LAPIS Gene sequence naming
Config Changes
becomes:
PR Checklist
🚀 Preview: https://restructure-anya.loculus.org