-
Notifications
You must be signed in to change notification settings - Fork 9
demo: flip config to segment-then-reference #5701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@claude Write up a summary of changes in this PR |
|
Claude finished @theosanderson's task in 1m 55s —— View job Summary of PR ChangesThis PR implements a major architectural restructuring of how reference genomes are organized and managed across the Loculus platform, moving from a "suborganism-first" to a "segment-first" data model. This is a significant refactoring that touches both infrastructure configuration and the website frontend. Core Architectural Change: Segment-First Reference Genome StructurePrevious Model (Suborganism-First)The old structure organized reference genomes by "suborganism" (essentially a reference name), with segments nested underneath: New Model (Segment-First)The new structure organizes by segment first, with multiple references available per segment: This is a more flexible model that better represents biological reality - different segments of multi-segmented organisms can use different reference strains. Key Changes by ComponentKubernetes Configuration (
|
|
Notes: we will need to add a check in prepro (probably also change the config for this) that we cannot have multiple references of the same segment as this is a segment duplication and we dont want to allow that per sample. (At the moment we would only error if there are multiple sequences with the same reference) |
| sequence: "[[URL:https://corneliusroemer.github.io/seqs/artefacts/sars-cov-2/ORF9b.fasta]]" | ||
| - name: "S" | ||
| sequence: "[[URL:https://corneliusroemer.github.io/seqs/artefacts/sars-cov-2/S.fasta]]" | ||
| main: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this structure!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we discussed changing main, could singleSegment or genome be a better option?
| "S": | ||
| sequence: "[[URL:https://corneliusroemer.github.io/seqs/artefacts/sars-cov-2/S.fasta]]" | ||
| E: | ||
| singleReference: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is duplicated? I think it can be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Yeah this looks borked
| preprocessing: | ||
| args: | ||
| segment: CV-A16 | ||
| segment: CV-A16-main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it a bit confusing to have a segment/reference structure but use reference-segment in LAPIS. I think we should probably use segment-reference and potentially drop the segment if it is main
| {{- $lapisNucleotideSequences = append $lapisNucleotideSequences (dict | ||
| "name" (printf "%s-%s" $suborganismName $sequence.name) | ||
| "sequence" $sequence.sequence | ||
| {{/* Extract all unique reference names from the first segment */}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh? I think this is also an error, we need to do this for every segment not just the first one... also I dont think we should add the reference as a prefix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goals of the current PR were just to support the current code/set up in a flipped form. There are a bunch of other changes (such as this) that would be needed to fully complete the feature for some organisms.
| nucleotide_insertions: dict[SegmentName, list[NucleotideInsertion]] = {} | ||
| amino_acid_insertions: dict[GeneName, list[AminoAcidInsertion]] = {} | ||
|
|
||
| # For minimizer classification, transform segment names by appending "-main" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again we should remove this
| mappingEntries.push([geneName, geneName]); | ||
| } | ||
| } else { | ||
| // Multiple references: use {reference}-{segment} format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again lets use {segment}-{reference} in the case there are multiple segments and references
| const { nucleotideSegmentNames, geneNames } = referenceGenomeLightweightSchema[SINGLE_REFERENCE]; | ||
| const segments = Object.keys(referenceGenomeLightweightSchema.segments); | ||
|
|
||
| // Check if single reference mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is actually a bug I believe - different segments can have different numbers of references and this will not be handled correctly here if the first segment has one reference but others have multiple
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, the code here is not intended to handle different segments with different numbers of references (but yeah, we'd want to extend to that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no problem! I have started using this as a base for making the flip and I am adding these comments here - if you prefer I can start on a fully fresh branch but I was actually planning to make a PR with my changes off this one - which would you prefer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have strong feelings on best approach. I guess the concern about using this as a base would be that it is pretty rough and ready, but obvs it does help with speed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good! I might play around with this branch a bit more by adding my suggestions to see if I can get it working a bit better and then potentially split my changes into smaller, more managable PRs: #5799
| @@ -0,0 +1,2 @@ | |||
| // Re-export for backward compatibility | |||
| export { SuborganismSelector as ReferenceNameSelector } from './SuborganismSelector.tsx'; | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just delete this actually
7020ab0 to
75e0d94
Compare
resolves #
Screenshot
PR Checklist
🚀 Preview: Add
previewlabel to enable