-
Notifications
You must be signed in to change notification settings - Fork 27
Description
Hello,
I am working with a large dataset (~100M records) using pir-service-example.
When running ConstructDatabase, the serialized database exceeds 2GB, which triggers the Protobuf size limit enforced in swift-protobuf (single message < 2GB).
I noticed the following guard in swift-protobuf:
guard requiredSize < 0x7fff_ffff else {
throw BinaryEncodingError.missingRequiredFields
}
If I comment out this check, ConstructDatabase completes successfully.
However, when running PIRProcessDatabase, decoding fails with a malformedProtobuf error, which appears to be caused by the oversized serialized message.
Since Protobuf has an architectural 2GB limit per message, it seems the current implementation (serializing the entire database as one message) does not support very large datasets.
I have a few related questions:
-
Is there a recommended approach for handling large datasets (e.g., 100M+ records that results in the file identity.binbp to be > 2GB)?
-
Also is it possible to split the dataset to parts, and perform
ConstructDatabaseandPIRProcessDatabaseon each part and then finally renaming all the resulting files to consecutive numbers (let's say we have 5 parts and each part resulted 50 shards so the identity-x.bin would be from 0 to 49, and the combined files will have the number x from 0 to 249 with respect to the identity-x.params.txtpb files to have the same order), and in the end run thePIRServiceto read all the shards. -
Is it possible to run multiple instances of
PIRServicesat the same time, each responsible for a part from the dataset?
Thank you.