-
Notifications
You must be signed in to change notification settings - Fork 9
refactor(backend): simplify submission by getting rid of aux tables #4915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
6a2c63e
6015ca3
f4960c1
e767ffe
8b2a0dd
03fe9b4
07d782e
54d9ba2
724fb50
c96e9d6
caa6705
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I agree that our current web submissions is not designed for uploading multiple GBs. However, given the strong compression rate, a comressed file with 100 MB can already decompress to several GB. Furthermore, I don't see it as a general reason against supporting large uploads. With the (extra) file sharing, we are already moving towards supporting large uploads and, although not a priority for now, we could eventually support multi-part uploads also for the normal sequence upoad.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should base the design on current and foreseeable use cases - no Pathoplexus?Loculsu submitters upload GB of uncompressed sequences. Our ingest batches already.
Why should we support something that's not needed? That's just unnecessary complexity.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What are the actual memory requirements? To process a 2 GB sequence file, I assume that the program will require more than just 2 GB and it would be good to know how much is needed (both for us to evaluate the PR and for maintainers to configure their instance).
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the current implementation yes but one can also stream the sequences without aux tables - if you really want to support monolithic 2GB sequence submissions there are simple ways to make this PR work with this without requiring 2 GB backend memory.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. While, for the current Pathoplexus organisms, people tend to sequence and submit only small number of sequences at once, there are many other use cases and situations where uploading larger files would be desired. For example, if someone sets up a new (internal) instance, they might want to import all of their existing data. Also, for bacteria, where the genomes are much larger, a few hundred genomes may already reach GBs.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As Loculus cannot serve bacteria currently due to other limitations (i.e. our query engine is not built to support them) I'm not sure this is a very pressing argument. Potentially having a very well compressed file that when inflated is too large could happen - although if I'm honest I find it highly unlikely that we have a submitter that would submit so much data at once to us - cases like this are more raw read related where we have a method for handling such large files. So, personally, I'm not sure I really see the importance of keeping this functionality. However, as a compromise, if we change the submission format as discussed in #4734 (comment) I think we can keep the aux table structure relatively simple and make the necessary changes.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Our ingest already submits GB of uncompressed sequences and it does so fine - simply by batching. If we run into problems we can see how to make changes that support use cases - that's not a case for keeping unnecessary complexity now. If you want to submit large sequences just submit them as files - that's got no impact on the PR.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we merge this PR, is there a good way forward to support large uploads again, or would merging this mean that we decide against supporting large uploads (at least in the mid term)? We've had several issues (created by ourselves) regarding failing uploads of large files (#1613, #996, #1226) and to cite you, @corneliusroemer, "it's clearly something we want to address but it's not necessary for pathoplexus to work", so I think it's important to have a plan how to proceed. You wrote "reductions of memory usage without reintroducing extra tables are possible but can be done later if considered necessary" – what do you have in mind? As far as I can see, any really scalable solution would require performing an external memory join. Hereby, using the database/aux tables seems like a good approach to me. Alternatively, we could write temporary files to disk but this seems even more complicated to me.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, you can submit files, you can also submit batches (natural approach). Yes I've changed my mind on necessity of large uploads - they are not worth the current complexity.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd be interested to hear why the current approach is considered as overly complex and complicating further development.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Keeping the aux table structure does make the switch to multi-pathogen upload where segment/subtype assignment is done in the preprocessing pipeline and sequences are just grouped by the backend more complex. It is still doable but quite ugly: #4821 I wonder though if we should actually first switch the grouping of sequences in the backend to #4734 (comment) - this will still require aux table changes (and a change in the original-data structure as the segment will not be known in advance) but I do this will be slightly less complicated than my current PR proposal.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's unnecessary complexity, there's a single 100 line SQL command here, that's not easy to maintain. |
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi @corneliusroemer! Would you like to outline your approach? Just curious because I also found the aux tables a bit complicated but didn't see an obvious solution without them.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @corneliusroemer I see that this PR removes streaming and restricts the payload size to ~500MB and I'm concerned this goes in the wrong direction. While the current approach does have issues (e.g. #4852 and the zip bombs that you mentioned), I think they can be addressed without removing the aux tables or reducing scalability. We only recently improved upload scalability to support submissions over 65k sequences (#1613), and I believe continuing to build towards more scalability is important.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I definitely agree that discussion would be the first step towards moving on an area like this (though of course a draft PR can be part of the discussion) |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -37,6 +37,7 @@ import org.loculus.backend.api.UnprocessedData | |
| import org.loculus.backend.auth.AuthenticatedUser | ||
| import org.loculus.backend.auth.HiddenParam | ||
| import org.loculus.backend.config.BackendConfig | ||
| import org.loculus.backend.config.FileSizeConfig | ||
| import org.loculus.backend.controller.LoculusCustomHeaders.X_TOTAL_RECORDS | ||
| import org.loculus.backend.log.REQUEST_ID_MDC_KEY | ||
| import org.loculus.backend.log.RequestIdContext | ||
|
|
@@ -47,11 +48,13 @@ import org.loculus.backend.model.SubmitModel | |
| import org.loculus.backend.service.submission.SubmissionDatabaseService | ||
| import org.loculus.backend.utils.Accession | ||
| import org.loculus.backend.utils.IteratorStreamer | ||
| import org.loculus.backend.utils.formatBytesHuman | ||
| import org.slf4j.MDC | ||
| import org.springframework.http.HttpHeaders | ||
| import org.springframework.http.HttpStatus | ||
| import org.springframework.http.MediaType | ||
| import org.springframework.http.ResponseEntity | ||
| import org.springframework.util.unit.DataSize | ||
| import org.springframework.validation.annotation.Validated | ||
| import org.springframework.web.bind.annotation.DeleteMapping | ||
| import org.springframework.web.bind.annotation.GetMapping | ||
|
|
@@ -65,6 +68,8 @@ import org.springframework.web.bind.annotation.RequestPart | |
| import org.springframework.web.bind.annotation.ResponseStatus | ||
| import org.springframework.web.bind.annotation.RestController | ||
| import org.springframework.web.multipart.MultipartFile | ||
| import org.springframework.web.server.PayloadTooLargeException | ||
| import org.springframework.web.server.ResponseStatusException | ||
| import org.springframework.web.servlet.mvc.method.annotation.StreamingResponseBody | ||
| import java.util.UUID | ||
| import io.swagger.v3.oas.annotations.parameters.RequestBody as SwaggerRequestBody | ||
|
|
@@ -82,11 +87,13 @@ open class SubmissionController( | |
| private val iteratorStreamer: IteratorStreamer, | ||
| private val requestIdContext: RequestIdContext, | ||
| private val backendConfig: BackendConfig, | ||
| private val fileSizeConfig: FileSizeConfig, | ||
| private val objectMapper: ObjectMapper, | ||
| ) { | ||
| @Operation(description = SUBMIT_DESCRIPTION) | ||
| @ApiResponse(responseCode = "200", description = SUBMIT_RESPONSE_DESCRIPTION) | ||
| @ApiResponse(responseCode = "400", description = SUBMIT_ERROR_RESPONSE) | ||
| @ApiResponse(responseCode = "413", description = PAYLOAD_TOO_LARGE_ERROR_RESPONSE) | ||
| @PostMapping("/submit", consumes = ["multipart/form-data"]) | ||
| fun submit( | ||
| @PathVariable @Valid organism: Organism, | ||
|
|
@@ -106,14 +113,16 @@ open class SubmissionController( | |
| ) @RequestParam restrictedUntil: String?, | ||
| @Parameter(description = FILE_MAPPING_DESCRIPTION) @RequestPart(required = false) fileMapping: String?, | ||
| ): List<SubmissionIdMapping> { | ||
| var innerDataUseTermsType = DataUseTermsType.OPEN | ||
| if (backendConfig.dataUseTerms.enabled) { | ||
| if (dataUseTermsType == null) { | ||
| throw BadRequestException("the 'dataUseTermsType' needs to be provided.") | ||
| metadataFile.requireUnder(fileSizeConfig.maxMetadataFileSize, "metadata") | ||
| sequenceFile?.requireUnder(fileSizeConfig.maxSequenceFileSize, "sequence") | ||
| val innerDataUseTermsType = | ||
| if (backendConfig.dataUseTerms.enabled) { | ||
| dataUseTermsType ?: throw BadRequestException( | ||
| "the 'dataUseTermsType' needs to be provided.", | ||
| ) | ||
| } else { | ||
| innerDataUseTermsType = dataUseTermsType | ||
| DataUseTermsType.OPEN | ||
| } | ||
| } | ||
| val fileMappingParsed = parseFileMapping(fileMapping, organism) | ||
|
|
||
| val params = SubmissionParams.OriginalSubmissionParams( | ||
|
|
@@ -130,6 +139,7 @@ open class SubmissionController( | |
|
|
||
| @Operation(description = REVISE_DESCRIPTION) | ||
| @ApiResponse(responseCode = "200", description = REVISE_RESPONSE_DESCRIPTION) | ||
| @ApiResponse(responseCode = "413", description = PAYLOAD_TOO_LARGE_ERROR_RESPONSE) | ||
| @PostMapping("/revise", consumes = ["multipart/form-data"]) | ||
| fun revise( | ||
| @PathVariable @Valid organism: Organism, | ||
|
|
@@ -138,6 +148,8 @@ open class SubmissionController( | |
| @Parameter(description = SEQUENCE_FILE_DESCRIPTION) @RequestParam sequenceFile: MultipartFile?, | ||
| @Parameter(description = FILE_MAPPING_DESCRIPTION) @RequestPart(required = false) fileMapping: String?, | ||
| ): List<SubmissionIdMapping> { | ||
| metadataFile.requireUnder(fileSizeConfig.maxMetadataFileSize, "metadata") | ||
| sequenceFile?.requireUnder(fileSizeConfig.maxSequenceFileSize, "sequence") | ||
| val fileMappingParsed = parseFileMapping(fileMapping, organism) | ||
| val params = SubmissionParams.RevisionSubmissionParams( | ||
| organism, | ||
|
|
@@ -425,10 +437,11 @@ open class SubmissionController( | |
| @HiddenParam authenticatedUser: AuthenticatedUser, | ||
| @RequestParam compression: CompressionFormat?, | ||
| ): ResponseEntity<StreamingResponseBody> { | ||
| val stillProcessing = submitModel.checkIfStillProcessingSubmittedData() | ||
| if (stillProcessing) { | ||
| return ResponseEntity.status(HttpStatus.LOCKED).build() | ||
| } | ||
| // No longer works since we've removed the aux tables | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we might still need to redefine
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually I think we can just remove this |
||
| // val stillProcessing = submitModel.checkIfStillProcessingSubmittedData() | ||
| // if (stillProcessing) { | ||
| // return ResponseEntity.status(HttpStatus.LOCKED).build() | ||
| // } | ||
|
|
||
| val headers = HttpHeaders() | ||
| headers.contentType = MediaType.parseMediaType(MediaType.APPLICATION_NDJSON_VALUE) | ||
|
|
@@ -544,3 +557,13 @@ open class SubmissionController( | |
| return fileMappingParsed | ||
| } | ||
| } | ||
|
|
||
| private fun MultipartFile.requireUnder(limitBytes: Long, name: String) { | ||
| if (size > limitBytes) { | ||
| val maxHuman = formatBytesHuman(limitBytes) | ||
| throw ResponseStatusException( | ||
| HttpStatus.PAYLOAD_TOO_LARGE, | ||
| "$name file is too large. Max $maxHuman.", | ||
| ) | ||
| } | ||
| } | ||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I think I'm in favour of keeping the aux tables.
I think we should aim throughout Loculus to be able to have memory requirements which basically have the option to scale at most with the size of a single sequence entry (I acknowledge there are a number of places, most notably the query engine, where we don't have that option atm, but I aspire for us to do so at some point). Achieving that here requires some version of the auxTable. I think there are improvements we could make to it,
e.g. "locking" should be per user rather than global.Yes, if we merged this we could still revert it in the future, or we could change various other things in the future to improve scalability. But it adds more friction to achieve various things we might hope to do in future, where first we would have to undo this. For example supporting megabase sequences.
I haven't seen the auxTables cause a lot of problems in prod - but acknowledge that the issue Anya saw seems important and worth looking into. The aux tables don't feel fundamentally incompatible with the changes we want to make to change how segments are submitted.
I agree they add complexity to the codebase but IMO that is worthwhile given where I hope Loculus to get to in the coming years.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the case here - memory is proportional to N_of_sequences_in_batch x length_of_sequence? I might be misunderstanding.
That's not true?
Aux tables complicate things unnecessarily - that hampers scalability
They are unnecessary complexity - they cause problems when developing/changing code
Why do you think we need aux tables? Which problem do they solve? Why are they worth the complexity? I think there are much better alternatives - not all sequences need to be in memory - that's just the simplest way of doing it.