-
Notifications
You must be signed in to change notification settings - Fork 9
refactor(backend): simplify submission by getting rid of aux tables #4915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @corneliusroemer! Would you like to outline your approach? Just curious because I also found the aux tables a bit complicated but didn't see an obvious solution without them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@corneliusroemer I see that this PR removes streaming and restricts the payload size to ~500MB and I'm concerned this goes in the wrong direction. While the current approach does have issues (e.g. #4852 and the zip bombs that you mentioned), I think they can be addressed without removing the aux tables or reducing scalability. We only recently improved upload scalability to support submissions over 65k sequences (#1613), and I believe continuing to build towards more scalability is important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely agree that discussion would be the first step towards moving on an area like this (though of course a draft PR can be part of the discussion)
b7aaf7f to
e87fbea
Compare
| if (stillProcessing) { | ||
| return ResponseEntity.status(HttpStatus.LOCKED).build() | ||
| } | ||
| // No longer works since we've removed the aux tables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might still need to redefine checkIfStillProcessingSubmittedData?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think we can just remove this
backend/src/main/kotlin/org/loculus/backend/controller/SubmissionControllerDescriptions.kt
Show resolved
Hide resolved
| private val submissionIdFilesMappingPreconditionValidator: SubmissionIdFilesMappingPreconditionValidator, | ||
| private val dateProvider: DateProvider, | ||
| private val backendConfig: BackendConfig, | ||
| private val s3Service: S3Service, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh this wasn't actually used
| groupManagementPreconditionValidator.validateUserIsAllowedToModifyGroup( | ||
| submissionParams.groupId, | ||
| submissionParams.authenticatedUser, | ||
| private fun parseMetadataFile(submissionParams: SubmissionParams): Map<SubmissionId, MetadataEntry> = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we also need to check the size here as metadata can also be compressed (similar to parseSequencesWithSizeLimit)?
| log.info { | ||
| "Generated ${submissionIdToAccessionMap.size} new accessions for original upload with UploadId $uploadId:" | ||
| fun accessionToSequenceInfo(accessions: Collection<Accession>): Map<Accession, SequenceInfo> = | ||
| accessions.chunked(POSTGRESQL_PARAMETER_LIMIT / 2) { chunk -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused by this function. We take a list of accessions and batch it, how are we able to pre-calculate versionColumn.max() as at this point we haven't defined the table?
| val now = dateProvider.getCurrentDateTime() | ||
|
|
||
| val newEntries = metadata.values.map { entry -> | ||
| val accession = submissionIdToAccession[entry.submissionId] ?: throw IllegalStateException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this check should already be handled by submissionIdToAccession
| submissionControllerClient.getOriginalMetadata() | ||
| .andExpect(status().isOk) | ||
| } | ||
| // This test no longer works because we don't use the Aux table anymore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets delete this
| val smallMetadata = "submissionId\tfirstColumn\nc0\tv0\nc1\tv1\nc2\tv2" // Under 100 bytes | ||
| val smallMetadataFile = SubmitFiles.metadataFileWith(content = smallMetadata) | ||
|
|
||
| submissionControllerClient.submit( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should do the same with metadata
anna-parker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
think this is great - makes the whole submission process a lot easier to understand and read through and fixes the issue of the backend crashing if it receives too much data
anna-parker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I just realized we should really delete the aux tables in the SQL tables
Actually we should keep the aux tables for now and delete in a separate PR once we are sure we are nolonger using them
|
I also wanted to add that the aux table has some issues in its current form: #4852 (comment) |
5e26b69 to
e137d03
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I think I'm in favour of keeping the aux tables.
I think we should aim throughout Loculus to be able to have memory requirements which basically have the option to scale at most with the size of a single sequence entry (I acknowledge there are a number of places, most notably the query engine, where we don't have that option atm, but I aspire for us to do so at some point). Achieving that here requires some version of the auxTable. I think there are improvements we could make to it, e.g. "locking" should be per user rather than global.
Yes, if we merged this we could still revert it in the future, or we could change various other things in the future to improve scalability. But it adds more friction to achieve various things we might hope to do in future, where first we would have to undo this. For example supporting megabase sequences.
I haven't seen the auxTables cause a lot of problems in prod - but acknowledge that the issue Anya saw seems important and worth looking into. The aux tables don't feel fundamentally incompatible with the changes we want to make to change how segments are submitted.
I agree they add complexity to the codebase but IMO that is worthwhile given where I hope Loculus to get to in the coming years.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should aim throughout Loculus to be able to have memory requirements which basically have the option to scale at most with the size of a single sequence entry
That's the case here - memory is proportional to N_of_sequences_in_batch x length_of_sequence? I might be misunderstanding.
Achieving that here requires some version of the auxTable.
That's not true?
Yes, if we merged this we could still revert it in the future, or we could change various other things in the future to improve scalability.
Aux tables complicate things unnecessarily - that hampers scalability
I haven't seen the auxTables cause a lot of problems in prod
They are unnecessary complexity - they cause problems when developing/changing code
I agree they add complexity to the codebase but IMO that is worthwhile given where I hope Loculus to get to in the coming years.
Why do you think we need aux tables? Which problem do they solve? Why are they worth the complexity? I think there are much better alternatives - not all sequences need to be in memory - that's just the simplest way of doing it.
chaoran-chen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR and summary, @corneliusroemer! I'm still quite hesitant about this approach and raised a few questions in separate threads and would be happy to hear your thoughts!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uploads of GB-sized data is a bad idea for many reasons, batching is natural. In particular through web submissions, it's not a good idea anyways to upload GB sized files.
I agree that our current web submissions is not designed for uploading multiple GBs. However, given the strong compression rate, a comressed file with 100 MB can already decompress to several GB. Furthermore, I don't see it as a general reason against supporting large uploads. With the (extra) file sharing, we are already moving towards supporting large uploads and, although not a priority for now, we could eventually support multi-part uploads also for the normal sequence upoad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should base the design on current and foreseeable use cases - no Pathoplexus?Loculsu submitters upload GB of uncompressed sequences. Our ingest batches already.
With the (extra) file sharing, we are already moving towards supporting large uploads and, although not a priority for now, we could eventually support multi-part uploads also for the normal sequence upoad.
Why should we support something that's not needed? That's just unnecessary complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the actual memory requirements? To process a 2 GB sequence file, I assume that the program will require more than just 2 GB and it would be good to know how much is needed (both for us to evaluate the PR and for maintainers to configure their instance).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current implementation yes but one can also stream the sequences without aux tables - if you really want to support monolithic 2GB sequence submissions there are simple ways to make this PR work with this without requiring 2 GB backend memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While, for the current Pathoplexus organisms, people tend to sequence and submit only small number of sequences at once, there are many other use cases and situations where uploading larger files would be desired. For example, if someone sets up a new (internal) instance, they might want to import all of their existing data. Also, for bacteria, where the genomes are much larger, a few hundred genomes may already reach GBs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As Loculus cannot serve bacteria currently due to other limitations (i.e. our query engine is not built to support them) I'm not sure this is a very pressing argument.
Potentially having a very well compressed file that when inflated is too large could happen - although if I'm honest I find it highly unlikely that we have a submitter that would submit so much data at once to us - cases like this are more raw read related where we have a method for handling such large files.
So, personally, I'm not sure I really see the importance of keeping this functionality. However, as a compromise, if we change the submission format as discussed in #4734 (comment) I think we can keep the aux table structure relatively simple and make the necessary changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our ingest already submits GB of uncompressed sequences and it does so fine - simply by batching. If we run into problems we can see how to make changes that support use cases - that's not a case for keeping unnecessary complexity now.
If you want to submit large sequences just submit them as files - that's got no impact on the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we merge this PR, is there a good way forward to support large uploads again, or would merging this mean that we decide against supporting large uploads (at least in the mid term)? We've had several issues (created by ourselves) regarding failing uploads of large files (#1613, #996, #1226) and to cite you, @corneliusroemer, "it's clearly something we want to address but it's not necessary for pathoplexus to work", so I think it's important to have a plan how to proceed.
You wrote "reductions of memory usage without reintroducing extra tables are possible but can be done later if considered necessary" – what do you have in mind?
As far as I can see, any really scalable solution would require performing an external memory join. Hereby, using the database/aux tables seems like a good approach to me. Alternatively, we could write temporary files to disk but this seems even more complicated to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you can submit files, you can also submit batches (natural approach).
Yes I've changed my mind on necessity of large uploads - they are not worth the current complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be interested to hear why the current approach is considered as overly complex and complicating further development.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping the aux table structure does make the switch to multi-pathogen upload where segment/subtype assignment is done in the preprocessing pipeline and sequences are just grouped by the backend more complex.
It is still doable but quite ugly: #4821
I wonder though if we should actually first switch the grouping of sequences in the backend to #4734 (comment) - this will still require aux table changes (and a change in the original-data structure as the segment will not be known in advance) but I do this will be slightly less complicated than my current PR proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unnecessary complexity, there's a single 100 line SQL command here, that's not easy to maintain.
|
It's easy to reduce memory requirements by not streaming the sequences from file - as @chaoran-chen and @theosanderson seem to think it's important to support GB sequence submissions I can make that change. Are you happy then? |
…error messages to make tests pass with new more informative error messages
e137d03 to
caa6705
Compare
|
What's actually the status of this PR now? Do we plan to merge it (in one form or another)? |
|
My recollection is that the decision was not to merge this, but to keep the aux tables and adapt them |
|
Yes - we can close this :-) |
Simplify submission in the backend by removing auxiliary upload tables.
Currently, submitted/revised sequences and metadata are first put into auxiliary database. Then validation is done, and if all is successful a complex insert into
sequence_entriestable is performed that joins sequences with metadata and adds extra data.This complex design involving multiple database read/writes, multiple tables and multiple SQL statements makes changes to submission/revise endpoints complicated. It also has performance implications due to repeated serialization to and from the database and many db operations.
Changes to submit/revise are necessary for multi-path, e.g. this PR #4821 from @anna-parker. While reviewing/discussing with her, I wondered whether we could simplify the change (and future ones) by getting rid of the aux tables.
It turns out it was relatively straightforward to do so. This PR is the result. It focuses on simplicity at the cost of memory usage - reductions of memory usage without reintroducing extra tables are possible but can be done later if considered necessary.
I looked into the history of the submit feature and didn't find very clear arguments for the necessity of aux tables. The main reason seems to have been a concern for memory usage when receiving large submissions. I asked @fengelniederhammer about this and IIRC he didn't disagree.
The main trade off is potentially increased memory usage as the current implementation keeps all sequences and metadata in memory. To prevent crashes due to OOM, this PR allows configuration of maximum sizes of a) raw payloads, b) uncompressed payloads. (It's possible that someone uploads a 10kB file that decompresses to multiple GBs).
Current default values were guesses and can be adjusted:
I consider the downside of limited upload to be worth it for multiple reasons:
In practice, I don't expect any noticeable impact on users of this simplification - and it comes with significant maintenance improvements. E.g. this PR becomes much simpler: #4852 (comment)
Screenshot
New zip bomb protection works as well :)

You can test with this test file (use test organism with test button then just add those sequences in large_sequences.fasta.zip)
PR Checklist
🚀 Preview: Add
previewlabel to enable