refactor(backend): simplify submission by getting rid of aux tables #4915

corneliusroemer · 2025-09-01T19:03:35Z

Simplify submission in the backend by removing auxiliary upload tables.

Currently, submitted/revised sequences and metadata are first put into auxiliary database. Then validation is done, and if all is successful a complex insert into sequence_entries table is performed that joins sequences with metadata and adds extra data.

This complex design involving multiple database read/writes, multiple tables and multiple SQL statements makes changes to submission/revise endpoints complicated. It also has performance implications due to repeated serialization to and from the database and many db operations.

Changes to submit/revise are necessary for multi-path, e.g. this PR #4821 from @anna-parker. While reviewing/discussing with her, I wondered whether we could simplify the change (and future ones) by getting rid of the aux tables.

It turns out it was relatively straightforward to do so. This PR is the result. It focuses on simplicity at the cost of memory usage - reductions of memory usage without reintroducing extra tables are possible but can be done later if considered necessary.

I looked into the history of the submit feature and didn't find very clear arguments for the necessity of aux tables. The main reason seems to have been a concern for memory usage when receiving large submissions. I asked @fengelniederhammer about this and IIRC he didn't disagree.

The main trade off is potentially increased memory usage as the current implementation keeps all sequences and metadata in memory. To prevent crashes due to OOM, this PR allows configuration of maximum sizes of a) raw payloads, b) uncompressed payloads. (It's possible that someone uploads a 10kB file that decompresses to multiple GBs).

Current default values were guesses and can be adjusted:

loculus.submission.max-metadata-file-size-bytes=52428800
loculus.submission.max-sequence-file-size-bytes=209715200
loculus.submission.max-uncompressed-sequence-size-bytes=524288000

I consider the downside of limited upload to be worth it for multiple reasons:

Loculus still accepts large uploads, one can increase the limits as long as one provides sufficient memory to the backend. Loculus already requires large amounts of RAM for SILO and the many LAPIS instances. A GB or so extra for the backend is negligible.
Even if the max upload file size is limiting, submitters can simply batch submissions - for programmatic submitters (ingest, CLI) this is trivial.
Uploads of GB-sized data is a bad idea for many reasons, batching is natural. In particular through web submissions, it's not a good idea anyways to upload GB sized files.

In practice, I don't expect any noticeable impact on users of this simplification - and it comes with significant maintenance improvements. E.g. this PR becomes much simpler: #4852 (comment)

Screenshot

New zip bomb protection works as well :)

You can test with this test file (use test organism with test button then just add those sequences in large_sequences.fasta.zip)

PR Checklist

All necessary documentation has been adapted.
The implemented feature is covered by appropriate, automated tests.
Any manual testing that has been done is documented (i.e. what exactly was tested?)

🚀 Preview: Add preview label to enable

chaoran-chen · 2025-09-01T20:19:21Z

backend/src/main/kotlin/org/loculus/backend/controller/SubmissionController.kt

Hi @corneliusroemer! Would you like to outline your approach? Just curious because I also found the aux tables a bit complicated but didn't see an obvious solution without them.

@corneliusroemer I see that this PR removes streaming and restricts the payload size to ~500MB and I'm concerned this goes in the wrong direction. While the current approach does have issues (e.g. #4852 and the zip bombs that you mentioned), I think they can be addressed without removing the aux tables or reducing scalability. We only recently improved upload scalability to support submissions over 65k sequences (#1613), and I believe continuing to build towards more scalability is important.

I definitely agree that discussion would be the first step towards moving on an area like this (though of course a draft PR can be part of the discussion)

anna-parker · 2025-09-12T12:56:18Z

backend/src/main/kotlin/org/loculus/backend/controller/SubmissionController.kt

-        if (stillProcessing) {
-            return ResponseEntity.status(HttpStatus.LOCKED).build()
-        }
+        // No longer works since we've removed the aux tables


I think we might still need to redefine checkIfStillProcessingSubmittedData?

Actually I think we can just remove this

backend/src/main/kotlin/org/loculus/backend/controller/SubmissionControllerDescriptions.kt

anna-parker · 2025-09-12T13:02:36Z

backend/src/main/kotlin/org/loculus/backend/model/SubmitModel.kt

    private val submissionIdFilesMappingPreconditionValidator: SubmissionIdFilesMappingPreconditionValidator,
    private val dateProvider: DateProvider,
    private val backendConfig: BackendConfig,
-    private val s3Service: S3Service,


oh this wasn't actually used

anna-parker · 2025-09-12T13:16:00Z

backend/src/main/kotlin/org/loculus/backend/model/SubmitModel.kt

-            groupManagementPreconditionValidator.validateUserIsAllowedToModifyGroup(
-                submissionParams.groupId,
-                submissionParams.authenticatedUser,
+    private fun parseMetadataFile(submissionParams: SubmissionParams): Map<SubmissionId, MetadataEntry> =


I guess we also need to check the size here as metadata can also be compressed (similar to parseSequencesWithSizeLimit)?

anna-parker · 2025-09-12T13:49:51Z

backend/src/main/kotlin/org/loculus/backend/service/submission/UploadDatabaseService.kt

-        log.info {
-            "Generated ${submissionIdToAccessionMap.size} new accessions for original upload with UploadId $uploadId:"
+    fun accessionToSequenceInfo(accessions: Collection<Accession>): Map<Accession, SequenceInfo> =
+        accessions.chunked(POSTGRESQL_PARAMETER_LIMIT / 2) { chunk ->


I am confused by this function. We take a list of accessions and batch it, how are we able to pre-calculate versionColumn.max() as at this point we haven't defined the table?

anna-parker · 2025-09-12T13:51:02Z

backend/src/main/kotlin/org/loculus/backend/service/submission/UploadDatabaseService.kt

+        val now = dateProvider.getCurrentDateTime()
+
+        val newEntries = metadata.values.map { entry ->
+            val accession = submissionIdToAccession[entry.submissionId] ?: throw IllegalStateException(


this check should already be handled by submissionIdToAccession

anna-parker · 2025-09-12T14:01:00Z

...src/test/kotlin/org/loculus/backend/controller/submission/GetOriginalMetadataEndpointTest.kt

-        submissionControllerClient.getOriginalMetadata()
-            .andExpect(status().isOk)
-    }
+    // This test no longer works because we don't use the Aux table anymore


lets delete this

anna-parker · 2025-09-12T14:02:50Z

backend/src/test/kotlin/org/loculus/backend/controller/submission/FileSizeLimitsTest.kt

+        val smallMetadata = "submissionId\tfirstColumn\nc0\tv0\nc1\tv1\nc2\tv2" // Under 100 bytes
+        val smallMetadataFile = SubmitFiles.metadataFileWith(content = smallMetadata)
+
+        submissionControllerClient.submit(


we should do the same with metadata

anna-parker

think this is great - makes the whole submission process a lot easier to understand and read through and fixes the issue of the backend crashing if it receives too much data

anna-parker

Oh I just realized we should really delete the aux tables in the SQL tables

Actually we should keep the aux tables for now and delete in a separate PR once we are sure we are nolonger using them

anna-parker · 2025-09-22T08:57:43Z

I also wanted to add that the aux table has some issues in its current form: #4852 (comment)

theosanderson · 2025-09-29T11:56:10Z