Handle large datasets that exceeds Protobuf 2GB size limit

Hello,

I am working with a large dataset (~100M records) using **`pir-service-example`**.

When running **`ConstructDatabase`**, the serialized database exceeds 2GB, which triggers the Protobuf size limit enforced in **`swift-protobuf`** (single message < 2GB).

I noticed the following guard in swift-protobuf:

```
guard requiredSize < 0x7fff_ffff else {
    throw BinaryEncodingError.missingRequiredFields
}
```

If I comment out this check, **`ConstructDatabase`** completes successfully.

However, when running **`PIRProcessDatabase`**, decoding fails with a **`malformedProtobuf`** error, which appears to be caused by the oversized serialized message.

Since **`Protobuf`** has an architectural 2GB limit per message, it seems the current implementation (serializing the entire database as one message) does not support very large datasets.

I have a few related questions:

1. Is there a recommended approach for handling large datasets (e.g., 100M+ records that results in the file identity.binbp to be > 2GB)?

2. Also is it possible to split the dataset to parts, and perform **`ConstructDatabase`** and **`PIRProcessDatabase`**  on each part and then finally renaming all the resulting files to consecutive numbers (let's say we have 5 parts and each part resulted 50 shards so the identity-x.bin would be from 0 to 49, and the combined files will have the number x from 0 to 249 with respect to the identity-x.params.txtpb files to have the same order), and in the end run the **`PIRService`** to read all the shards.

3. Is it possible to  run multiple instances of **`PIRServices`** at the same time, each responsible for a part from the dataset?

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle large datasets that exceeds Protobuf 2GB size limit #128

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Handle large datasets that exceeds Protobuf 2GB size limit #128

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions