Skip to content
Closed
Copy link
Member

@theosanderson theosanderson Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I think I'm in favour of keeping the aux tables.

I think we should aim throughout Loculus to be able to have memory requirements which basically have the option to scale at most with the size of a single sequence entry (I acknowledge there are a number of places, most notably the query engine, where we don't have that option atm, but I aspire for us to do so at some point). Achieving that here requires some version of the auxTable. I think there are improvements we could make to it, e.g. "locking" should be per user rather than global.

Yes, if we merged this we could still revert it in the future, or we could change various other things in the future to improve scalability. But it adds more friction to achieve various things we might hope to do in future, where first we would have to undo this. For example supporting megabase sequences.

I haven't seen the auxTables cause a lot of problems in prod - but acknowledge that the issue Anya saw seems important and worth looking into. The aux tables don't feel fundamentally incompatible with the changes we want to make to change how segments are submitted.

I agree they add complexity to the codebase but IMO that is worthwhile given where I hope Loculus to get to in the coming years.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should aim throughout Loculus to be able to have memory requirements which basically have the option to scale at most with the size of a single sequence entry

That's the case here - memory is proportional to N_of_sequences_in_batch x length_of_sequence? I might be misunderstanding.

Achieving that here requires some version of the auxTable.

That's not true?

Yes, if we merged this we could still revert it in the future, or we could change various other things in the future to improve scalability.

Aux tables complicate things unnecessarily - that hampers scalability

I haven't seen the auxTables cause a lot of problems in prod

They are unnecessary complexity - they cause problems when developing/changing code

I agree they add complexity to the codebase but IMO that is worthwhile given where I hope Loculus to get to in the coming years.

Why do you think we need aux tables? Which problem do they solve? Why are they worth the complexity? I think there are much better alternatives - not all sequences need to be in memory - that's just the simplest way of doing it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uploads of GB-sized data is a bad idea for many reasons, batching is natural. In particular through web submissions, it's not a good idea anyways to upload GB sized files.

I agree that our current web submissions is not designed for uploading multiple GBs. However, given the strong compression rate, a comressed file with 100 MB can already decompress to several GB. Furthermore, I don't see it as a general reason against supporting large uploads. With the (extra) file sharing, we are already moving towards supporting large uploads and, although not a priority for now, we could eventually support multi-part uploads also for the normal sequence upoad.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should base the design on current and foreseeable use cases - no Pathoplexus?Loculsu submitters upload GB of uncompressed sequences. Our ingest batches already.

With the (extra) file sharing, we are already moving towards supporting large uploads and, although not a priority for now, we could eventually support multi-part uploads also for the normal sequence upoad.

Why should we support something that's not needed? That's just unnecessary complexity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the actual memory requirements? To process a 2 GB sequence file, I assume that the program will require more than just 2 GB and it would be good to know how much is needed (both for us to evaluate the PR and for maintainers to configure their instance).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current implementation yes but one can also stream the sequences without aux tables - if you really want to support monolithic 2GB sequence submissions there are simple ways to make this PR work with this without requiring 2 GB backend memory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While, for the current Pathoplexus organisms, people tend to sequence and submit only small number of sequences at once, there are many other use cases and situations where uploading larger files would be desired. For example, if someone sets up a new (internal) instance, they might want to import all of their existing data. Also, for bacteria, where the genomes are much larger, a few hundred genomes may already reach GBs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Loculus cannot serve bacteria currently due to other limitations (i.e. our query engine is not built to support them) I'm not sure this is a very pressing argument.

Potentially having a very well compressed file that when inflated is too large could happen - although if I'm honest I find it highly unlikely that we have a submitter that would submit so much data at once to us - cases like this are more raw read related where we have a method for handling such large files.

So, personally, I'm not sure I really see the importance of keeping this functionality. However, as a compromise, if we change the submission format as discussed in #4734 (comment) I think we can keep the aux table structure relatively simple and make the necessary changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our ingest already submits GB of uncompressed sequences and it does so fine - simply by batching. If we run into problems we can see how to make changes that support use cases - that's not a case for keeping unnecessary complexity now.

If you want to submit large sequences just submit them as files - that's got no impact on the PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we merge this PR, is there a good way forward to support large uploads again, or would merging this mean that we decide against supporting large uploads (at least in the mid term)? We've had several issues (created by ourselves) regarding failing uploads of large files (#1613, #996, #1226) and to cite you, @corneliusroemer, "it's clearly something we want to address but it's not necessary for pathoplexus to work", so I think it's important to have a plan how to proceed.

You wrote "reductions of memory usage without reintroducing extra tables are possible but can be done later if considered necessary" – what do you have in mind?

As far as I can see, any really scalable solution would require performing an external memory join. Hereby, using the database/aux tables seems like a good approach to me. Alternatively, we could write temporary files to disk but this seems even more complicated to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you can submit files, you can also submit batches (natural approach).

Yes I've changed my mind on necessity of large uploads - they are not worth the current complexity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be interested to hear why the current approach is considered as overly complex and complicating further development.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping the aux table structure does make the switch to multi-pathogen upload where segment/subtype assignment is done in the preprocessing pipeline and sequences are just grouped by the backend more complex.

It is still doable but quite ugly: #4821

I wonder though if we should actually first switch the grouping of sequences in the backend to #4734 (comment) - this will still require aux table changes (and a change in the original-data structure as the segment will not be known in advance) but I do this will be slightly less complicated than my current PR proposal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unnecessary complexity, there's a single 100 line SQL command here, that's not easy to maintain.

Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,10 @@ object BackendSpringProperty {
const val DEBUG_MODE = "loculus.debug-mode"
const val ENABLE_SEQSETS = "loculus.enable-seqsets"

const val MAX_METADATA_FILE_SIZE = "loculus.submission.max-metadata-file-size-bytes"
const val MAX_SEQUENCE_FILE_SIZE = "loculus.submission.max-sequence-file-size-bytes"
const val MAX_UNCOMPRESSED_SEQUENCE_SIZE = "loculus.submission.max-uncompressed-sequence-size-bytes"

const val S3_ENABLED = "loculus.s3.enabled"
const val S3_BUCKET_ENDPOINT = "loculus.s3.bucket.endpoint"
const val S3_BUCKET_INTERNAL_ENDPOINT = "loculus.s3.bucket.internal-endpoint"
Expand Down Expand Up @@ -109,6 +113,17 @@ class BackendSpringConfig {
@Bean
fun openApi(backendConfig: BackendConfig) = buildOpenApiSchema(backendConfig)

@Bean
fun fileSizeConfig(
@Value("\${${BackendSpringProperty.MAX_METADATA_FILE_SIZE}}") maxMetadataFileSize: Long,
@Value("\${${BackendSpringProperty.MAX_SEQUENCE_FILE_SIZE}}") maxSequenceFileSize: Long,
@Value("\${${BackendSpringProperty.MAX_UNCOMPRESSED_SEQUENCE_SIZE}}") maxUncompressedSequenceSize: Long,
): FileSizeConfig = FileSizeConfig(
maxMetadataFileSize = maxMetadataFileSize,
maxSequenceFileSize = maxSequenceFileSize,
maxUncompressedSequenceSize = maxUncompressedSequenceSize,
)

@Bean
fun s3Config(
@Value("\${${BackendSpringProperty.S3_ENABLED}}") enabled: Boolean = false,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,9 @@ data class S3BucketConfig(
val accessKey: String,
val secretKey: String,
)

data class FileSizeConfig(
val maxMetadataFileSize: Long,
val maxSequenceFileSize: Long,
val maxUncompressedSequenceSize: Long,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @corneliusroemer! Would you like to outline your approach? Just curious because I also found the aux tables a bit complicated but didn't see an obvious solution without them.

Copy link
Member

@chaoran-chen chaoran-chen Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@corneliusroemer I see that this PR removes streaming and restricts the payload size to ~500MB and I'm concerned this goes in the wrong direction. While the current approach does have issues (e.g. #4852 and the zip bombs that you mentioned), I think they can be addressed without removing the aux tables or reducing scalability. We only recently improved upload scalability to support submissions over 65k sequences (#1613), and I believe continuing to build towards more scalability is important.

Copy link
Member

@theosanderson theosanderson Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely agree that discussion would be the first step towards moving on an area like this (though of course a draft PR can be part of the discussion)

Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ import org.loculus.backend.api.UnprocessedData
import org.loculus.backend.auth.AuthenticatedUser
import org.loculus.backend.auth.HiddenParam
import org.loculus.backend.config.BackendConfig
import org.loculus.backend.config.FileSizeConfig
import org.loculus.backend.controller.LoculusCustomHeaders.X_TOTAL_RECORDS
import org.loculus.backend.log.REQUEST_ID_MDC_KEY
import org.loculus.backend.log.RequestIdContext
Expand All @@ -47,11 +48,13 @@ import org.loculus.backend.model.SubmitModel
import org.loculus.backend.service.submission.SubmissionDatabaseService
import org.loculus.backend.utils.Accession
import org.loculus.backend.utils.IteratorStreamer
import org.loculus.backend.utils.formatBytesHuman
import org.slf4j.MDC
import org.springframework.http.HttpHeaders
import org.springframework.http.HttpStatus
import org.springframework.http.MediaType
import org.springframework.http.ResponseEntity
import org.springframework.util.unit.DataSize
import org.springframework.validation.annotation.Validated
import org.springframework.web.bind.annotation.DeleteMapping
import org.springframework.web.bind.annotation.GetMapping
Expand All @@ -65,6 +68,8 @@ import org.springframework.web.bind.annotation.RequestPart
import org.springframework.web.bind.annotation.ResponseStatus
import org.springframework.web.bind.annotation.RestController
import org.springframework.web.multipart.MultipartFile
import org.springframework.web.server.PayloadTooLargeException
import org.springframework.web.server.ResponseStatusException
import org.springframework.web.servlet.mvc.method.annotation.StreamingResponseBody
import java.util.UUID
import io.swagger.v3.oas.annotations.parameters.RequestBody as SwaggerRequestBody
Expand All @@ -82,11 +87,13 @@ open class SubmissionController(
private val iteratorStreamer: IteratorStreamer,
private val requestIdContext: RequestIdContext,
private val backendConfig: BackendConfig,
private val fileSizeConfig: FileSizeConfig,
private val objectMapper: ObjectMapper,
) {
@Operation(description = SUBMIT_DESCRIPTION)
@ApiResponse(responseCode = "200", description = SUBMIT_RESPONSE_DESCRIPTION)
@ApiResponse(responseCode = "400", description = SUBMIT_ERROR_RESPONSE)
@ApiResponse(responseCode = "413", description = PAYLOAD_TOO_LARGE_ERROR_RESPONSE)
@PostMapping("/submit", consumes = ["multipart/form-data"])
fun submit(
@PathVariable @Valid organism: Organism,
Expand All @@ -106,14 +113,16 @@ open class SubmissionController(
) @RequestParam restrictedUntil: String?,
@Parameter(description = FILE_MAPPING_DESCRIPTION) @RequestPart(required = false) fileMapping: String?,
): List<SubmissionIdMapping> {
var innerDataUseTermsType = DataUseTermsType.OPEN
if (backendConfig.dataUseTerms.enabled) {
if (dataUseTermsType == null) {
throw BadRequestException("the 'dataUseTermsType' needs to be provided.")
metadataFile.requireUnder(fileSizeConfig.maxMetadataFileSize, "metadata")
sequenceFile?.requireUnder(fileSizeConfig.maxSequenceFileSize, "sequence")
val innerDataUseTermsType =
if (backendConfig.dataUseTerms.enabled) {
dataUseTermsType ?: throw BadRequestException(
"the 'dataUseTermsType' needs to be provided.",
)
} else {
innerDataUseTermsType = dataUseTermsType
DataUseTermsType.OPEN
}
}
val fileMappingParsed = parseFileMapping(fileMapping, organism)

val params = SubmissionParams.OriginalSubmissionParams(
Expand All @@ -130,6 +139,7 @@ open class SubmissionController(

@Operation(description = REVISE_DESCRIPTION)
@ApiResponse(responseCode = "200", description = REVISE_RESPONSE_DESCRIPTION)
@ApiResponse(responseCode = "413", description = PAYLOAD_TOO_LARGE_ERROR_RESPONSE)
@PostMapping("/revise", consumes = ["multipart/form-data"])
fun revise(
@PathVariable @Valid organism: Organism,
Expand All @@ -138,6 +148,8 @@ open class SubmissionController(
@Parameter(description = SEQUENCE_FILE_DESCRIPTION) @RequestParam sequenceFile: MultipartFile?,
@Parameter(description = FILE_MAPPING_DESCRIPTION) @RequestPart(required = false) fileMapping: String?,
): List<SubmissionIdMapping> {
metadataFile.requireUnder(fileSizeConfig.maxMetadataFileSize, "metadata")
sequenceFile?.requireUnder(fileSizeConfig.maxSequenceFileSize, "sequence")
val fileMappingParsed = parseFileMapping(fileMapping, organism)
val params = SubmissionParams.RevisionSubmissionParams(
organism,
Expand Down Expand Up @@ -425,10 +437,11 @@ open class SubmissionController(
@HiddenParam authenticatedUser: AuthenticatedUser,
@RequestParam compression: CompressionFormat?,
): ResponseEntity<StreamingResponseBody> {
val stillProcessing = submitModel.checkIfStillProcessingSubmittedData()
if (stillProcessing) {
return ResponseEntity.status(HttpStatus.LOCKED).build()
}
// No longer works since we've removed the aux tables
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might still need to redefine checkIfStillProcessingSubmittedData?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think we can just remove this

// val stillProcessing = submitModel.checkIfStillProcessingSubmittedData()
// if (stillProcessing) {
// return ResponseEntity.status(HttpStatus.LOCKED).build()
// }

val headers = HttpHeaders()
headers.contentType = MediaType.parseMediaType(MediaType.APPLICATION_NDJSON_VALUE)
Expand Down Expand Up @@ -544,3 +557,13 @@ open class SubmissionController(
return fileMappingParsed
}
}

private fun MultipartFile.requireUnder(limitBytes: Long, name: String) {
if (size > limitBytes) {
val maxHuman = formatBytesHuman(limitBytes)
throw ResponseStatusException(
HttpStatus.PAYLOAD_TOO_LARGE,
"$name file is too large. Max $maxHuman.",
)
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,10 @@ const val SUBMIT_ERROR_RESPONSE = """
The data use terms type have not been provided, even though they are enabled for this Loculus instance.
"""

const val PAYLOAD_TOO_LARGE_ERROR_RESPONSE = """
Payload too large. File size exceeds configured limits.
"""

const val METADATA_FILE_DESCRIPTION = """
A TSV (tab separated values) file containing the metadata of the submitted sequence entries.
The file may be compressed with zstd, xz, zip, gzip, lzma, bzip2 (with common extensions).
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,9 @@ import org.loculus.backend.service.files.S3Service
import org.loculus.backend.service.groupmanagement.GROUPS_TABLE_NAME
import org.loculus.backend.service.submission.CURRENT_PROCESSING_PIPELINE_TABLE_NAME
import org.loculus.backend.service.submission.EXTERNAL_METADATA_TABLE_NAME
import org.loculus.backend.service.submission.METADATA_UPLOAD_AUX_TABLE_NAME
import org.loculus.backend.service.submission.RawProcessedData
import org.loculus.backend.service.submission.SEQUENCE_ENTRIES_PREPROCESSED_DATA_TABLE_NAME
import org.loculus.backend.service.submission.SEQUENCE_ENTRIES_TABLE_NAME
import org.loculus.backend.service.submission.SEQUENCE_UPLOAD_AUX_TABLE_NAME
import org.loculus.backend.service.submission.SubmissionDatabaseService
import org.loculus.backend.service.submission.UpdateTrackerTable
import org.loculus.backend.utils.Accession
Expand All @@ -47,10 +45,8 @@ val RELEASED_DATA_RELATED_TABLES: List<String> =
CURRENT_PROCESSING_PIPELINE_TABLE_NAME,
EXTERNAL_METADATA_TABLE_NAME,
GROUPS_TABLE_NAME,
METADATA_UPLOAD_AUX_TABLE_NAME,
SEQUENCE_ENTRIES_TABLE_NAME,
SEQUENCE_ENTRIES_PREPROCESSED_DATA_TABLE_NAME,
SEQUENCE_UPLOAD_AUX_TABLE_NAME,
DATA_USE_TERMS_TABLE_NAME,
)

Expand Down
Loading
Loading