Use a new schema for the data storage in Linera. #4814

MathieuDutSik · 2025-10-17T12:11:33Z

Motivation

Databases commonly use a partition_key which corresponds to root_key in our code.
The partition key is used for hashing purposes in order to spread the workload over nodes.

An unfortunate feature of the existing schema is that Blobs, BlobStates, Events, and certificates
are all in the same partition (the one corresponding to &[]). This causes some performance
problems. A common recommendation for schema design is that the partition key should be
spread out so that one bin does not receive too much data.

Fixes #4807

Proposal

The following proposal is implemented:

For all base keys except Event, the root key is determined from the serialization.
For the Events, we want to access several events at once. So, the base key is serialized by taking only the ChainId, StreamId. This led to the introduction of a fn root_key(&self) for the BaseKey type. The function does not return errors since for the types in question, BlobId, CryptoHash, ChainId, returning an error is impossible.

Why this change is the right one:

There is no limit to the number of partition keys in databases. On the other hand, there is a limit to the size of data for a specified partition key. So concentrating all data on one partition key creates potential problems above 100M and may fail completely for 2G.
We are already having a root-key formed from the ChainId for the application states. So, we already accept that we can have many many partition keys.

The Batch of linera-storage is replaced by a MultiPartitionBatch. It is unfortunate that we had the collision with the Batch of linera-views.

This PR does the requested job of changing only the linera-storage. However, there are some losses of parallelization for the read_multi_values/contains_keys operation. This is not irremediable:

We can add some functions read_multi_root_values(_, root_keys: Vec<Vec<u8>>, key: Vec<u8>) to the KeyValueDatabase. It is possible to implement this feature efficiently in ScyllaDb, which is our main database target.
We can add some write_multi_partition_batch to the KeyValueDatabase. Note that the existing write_batch in db_storage.rs is creating many futures, but the right solution is likely to group the entries. Of course, batch size is an issue, but it has to be addressed by measuring it not spreading over all.
It is a little bit problematic to see how those features could be implemented in the combinators like LruCaching, ValueSplitting, and so on.

Test Plan

The CI.

Release Plan

Hopefully, to put it into the main.

It is possible to write a migration tool that takes the existing storage of TestNet Conway and converts it to the new schema. But that is only if we really want to do that.

Before that, it would be good to see if the scalability works as expected for ScyllaDb runs.

Links

None.

to the one wanted.

ma2bd

Thanks for the PR. This looks very promising. My main comments are:

Go all the way with manual serialization (if that's what we want) and remove BaseKey
Objects that can naturally share the same partition should be grouped together

ma2bd · 2025-10-17T22:08:47Z

linera-storage/src/db_storage.rs

-struct Batch {
-    key_value_bytes: Vec<(Vec<u8>, Vec<u8>)>,
+struct MultiPartitionBatch {
+    keys_value_bytes: Vec<(Vec<u8>, Vec<u8>, Vec<u8>)>,


We may want to group the entries by partition: Vec<(Vec<u8>, SinglePartitionBatch)>

ma2bd · 2025-10-17T22:09:24Z

linera-storage/src/db_storage.rs


-    fn put_key_value_bytes(&mut self, key: Vec<u8>, value: Vec<u8>) {
-        self.key_value_bytes.push((key, value));
+    fn put_key_value_bytes(&mut self, root_key: Vec<u8>, key: Vec<u8>, value: Vec<u8>) {


"root key" is a little weird outside of Views. Can we just say "partition" or "partition_key"?

ma2bd · 2025-10-17T22:10:03Z

linera-storage/src/db_storage.rs

 }

+impl BaseKey {
+    fn root_key(&self) -> Vec<u8> {


fn partition ?

ma2bd · 2025-10-17T22:18:28Z

linera-storage/src/db_storage.rs

        metrics::WRITE_BLOB_COUNTER.with_label_values(&[]).inc();
-        let blob_key = bcs::to_bytes(&BaseKey::Blob(blob.id()))?;
-        self.put_key_value_bytes(blob_key.to_vec(), blob.bytes().to_vec());
+        let root_key = BaseKey::Blob(blob.id()).root_key();


It seems like we don't need the enum BaseKey at all.

We could define PartitionKey and serialize it but it seems like you prefer do things by hand.

ma2bd · 2025-10-17T22:24:09Z

linera-storage/src/db_storage.rs

+            BaseKey::Certificate(hash) => {
+                let mut key = vec![INDEX_CERTIFICATE];
+                key.extend_from_slice(hash.as_bytes().as_slice());
+                key
+            }
+            BaseKey::ConfirmedBlock(hash) => {
+                let mut key = vec![INDEX_CONFIRMED_BLOCK];
+                key.extend_from_slice(hash.as_bytes().as_slice());
+                key


These should have the same partition actually

I very much have an argument against that.

What we would like ideally is to be able to read (root_key_i, key_i) with the root_keys all different and the keys all different as well. But we cannot do that with ScyllaDb, which is our target system.

What we can read efficiently are two orthogonal kinds of read_keys:

Reading with the same root_key but varying key. This is what we have now with read_multi_values_bytes.

Reading with the same key but varying root_key. These are the functions I would like to introduce in the KeyValueStore trait.

If we put the certificates and the confirmed block on the same partition, then we are no longer able to use the second class of functions for read_certificates.

What we would like ideally is to be able to read (root_key_i, key_i) with the root_keys all different and the keys all different as well.

Why?

See other comment #4814 (comment)

ma2bd · 2025-10-17T22:24:41Z

linera-storage/src/db_storage.rs

+            BaseKey::Blob(blob_id) => {
+                let mut key = vec![INDEX_BLOB_ID];
+                key.push(blob_id.blob_type as u8);
+                key.extend_from_slice(blob_id.hash.as_bytes().as_slice());
+                key
+            }
+            BaseKey::BlobState(blob_id) => {
+                let mut key = vec![INDEX_BLOB_STATE];
+                key.push(blob_id.blob_type as u8);
+                key.extend_from_slice(blob_id.hash.as_bytes().as_slice());
+                key
+            }


ma2bd · 2025-10-17T22:29:07Z

linera-storage/src/db_storage.rs

+    root_key[0] == INDEX_CHAIN_ID
+}
+
 const INDEX_CHAIN_ID: u8 = 0;


nit: The expected name would be CHAIN_ID_TAG.

ma2bd · 2025-10-17T22:29:48Z

linera-storage/src/db_storage.rs

+const INDEX_CERTIFICATE: u8 = 1;
+const INDEX_CONFIRMED_BLOCK: u8 = 2;


BLOCK_HASH_TAG

ma2bd · 2025-10-17T22:29:58Z

linera-storage/src/db_storage.rs

 const INDEX_CHAIN_ID: u8 = 0;
+const INDEX_CERTIFICATE: u8 = 1;
+const INDEX_CONFIRMED_BLOCK: u8 = 2;
 const INDEX_BLOB_ID: u8 = 3;


BLOB_ID_TAG

ma2bd · 2025-10-17T23:32:45Z

I think we can also start thinking about have two modules co-exist on the testnet branch: the current partitioning and the new one. Then, we will auto-migrate data on DB startup. This will require an extra version number.

MathieuDutSik · 2025-10-18T07:10:39Z

I think we can also start thinking about have two modules co-exist on the testnet branch: the current partitioning and the new one. Then, we will auto-migrate data on DB startup. This will require an extra version number.

No problem.

To have the two schemas coexist on TestNet, we first need to have them coexist on main. So, my question is, how do we make the storage aware of in which storage it is running?

Two ideas that come to mind:

Passing an argument
Reading keys that would be present, for example, the NetworkDescription, and concluding from that read whether we are in the old schema or the new one.

MathieuDutSik added 8 commits October 17, 2025 11:27

Introduce a function "fn root_key" to the code.

aba9e39

Rename the Batch as MultiPartitionBatch and change the written schema

3e8cf12

to the one wanted.

Switch to the new schemes for other operations.

dea9406

Corrections after write.

b650970

Cleanup of Batch.

27203cc

Reformatting.

d03eaed

Address the dual-store case.

317dfe8

Some update.

8c02322

MathieuDutSik requested review from Twey, ma2bd and ndr-ds October 17, 2025 13:44

MathieuDutSik marked this pull request as ready for review October 17, 2025 13:44

ma2bd reviewed Oct 17, 2025

View reviewed changes

		const INDEX_CERTIFICATE: u8 = 1;
		const INDEX_CONFIRMED_BLOCK: u8 = 2;

Use a new schema for the data storage in Linera. #4814

Are you sure you want to change the base?

Use a new schema for the data storage in Linera. #4814

Conversation

MathieuDutSik commented Oct 17, 2025

Motivation

Proposal

Test Plan

Release Plan

Links

Uh oh!

ma2bd left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ma2bd commented Oct 17, 2025

Uh oh!

MathieuDutSik commented Oct 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ma2bd left a comment •

edited

Loading