Skip to content

Conversation

MathieuDutSik
Copy link
Contributor

Motivation

Databases commonly use a partition_key which corresponds to root_key in our code.
The partition key is used for hashing purposes in order to spread the workload over nodes.

An unfortunate feature of the existing schema is that Blobs, BlobStates, Events, and certificates
are all in the same partition (the one corresponding to &[]). This causes some performance
problems. A common recommendation for schema design is that the partition key should be
spread out so that one bin does not receive too much data.

Fixes #4807

Proposal

The following proposal is implemented:

  • For all base keys except Event, the root key is determined from the serialization.
  • For the Events, we want to access several events at once. So, the base key is serialized by taking only the ChainId, StreamId. This led to the introduction of a fn root_key(&self) for the BaseKey type. The function does not return errors since for the types in question, BlobId, CryptoHash, ChainId, returning an error is impossible.

Why this change is the right one:

  • There is no limit to the number of partition keys in databases. On the other hand, there is a limit to the size of data for a specified partition key. So concentrating all data on one partition key creates potential problems above 100M and may fail completely for 2G.
  • We are already having a root-key formed from the ChainId for the application states. So, we already accept that we can have many many partition keys.

The Batch of linera-storage is replaced by a MultiPartitionBatch. It is unfortunate that we had the collision with the Batch of linera-views.

This PR does the requested job of changing only the linera-storage. However, there are some losses of parallelization for the read_multi_values/contains_keys operation. This is not irremediable:

  • We can add some functions read_multi_root_values(_, root_keys: Vec<Vec<u8>>, key: Vec<u8>) to the KeyValueDatabase. It is possible to implement this feature efficiently in ScyllaDb, which is our main database target.
  • We can add some write_multi_partition_batch to the KeyValueDatabase. Note that the existing write_batch in db_storage.rs is creating many futures, but the right solution is likely to group the entries. Of course, batch size is an issue, but it has to be addressed by measuring it not spreading over all.
  • It is a little bit problematic to see how those features could be implemented in the combinators like LruCaching, ValueSplitting, and so on.

Test Plan

The CI.

Release Plan

Hopefully, to put it into the main.

It is possible to write a migration tool that takes the existing storage of TestNet Conway and converts it to the new schema. But that is only if we really want to do that.

Before that, it would be good to see if the scalability works as expected for ScyllaDb runs.

Links

None.

@MathieuDutSik MathieuDutSik marked this pull request as ready for review October 17, 2025 13:44
Copy link
Contributor

@ma2bd ma2bd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. This looks very promising. My main comments are:

  • Go all the way with manual serialization (if that's what we want) and remove BaseKey
  • Objects that can naturally share the same partition should be grouped together

struct Batch {
key_value_bytes: Vec<(Vec<u8>, Vec<u8>)>,
struct MultiPartitionBatch {
keys_value_bytes: Vec<(Vec<u8>, Vec<u8>, Vec<u8>)>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to group the entries by partition: Vec<(Vec<u8>, SinglePartitionBatch)>


fn put_key_value_bytes(&mut self, key: Vec<u8>, value: Vec<u8>) {
self.key_value_bytes.push((key, value));
fn put_key_value_bytes(&mut self, root_key: Vec<u8>, key: Vec<u8>, value: Vec<u8>) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"root key" is a little weird outside of Views. Can we just say "partition" or "partition_key"?

}

impl BaseKey {
fn root_key(&self) -> Vec<u8> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fn partition ?

metrics::WRITE_BLOB_COUNTER.with_label_values(&[]).inc();
let blob_key = bcs::to_bytes(&BaseKey::Blob(blob.id()))?;
self.put_key_value_bytes(blob_key.to_vec(), blob.bytes().to_vec());
let root_key = BaseKey::Blob(blob.id()).root_key();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we don't need the enum BaseKey at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could define PartitionKey and serialize it but it seems like you prefer do things by hand.

Comment on lines +341 to +349
BaseKey::Certificate(hash) => {
let mut key = vec![INDEX_CERTIFICATE];
key.extend_from_slice(hash.as_bytes().as_slice());
key
}
BaseKey::ConfirmedBlock(hash) => {
let mut key = vec![INDEX_CONFIRMED_BLOCK];
key.extend_from_slice(hash.as_bytes().as_slice());
key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should have the same partition actually

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I very much have an argument against that.

What we would like ideally is to be able to read (root_key_i, key_i) with the root_keys all different and the keys all different as well. But we cannot do that with ScyllaDb, which is our target system.

What we can read efficiently are two orthogonal kinds of read_keys:

  • Reading with the same root_key but varying key. This is what we have now with read_multi_values_bytes.
  • Reading with the same key but varying root_key. These are the functions I would like to introduce in the KeyValueStore trait.

If we put the certificates and the confirmed block on the same partition, then we are no longer able to use the second class of functions for read_certificates.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we would like ideally is to be able to read (root_key_i, key_i) with the root_keys all different and the keys all different as well.

Why?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See other comment #4814 (comment)

Comment on lines +351 to +362
BaseKey::Blob(blob_id) => {
let mut key = vec![INDEX_BLOB_ID];
key.push(blob_id.blob_type as u8);
key.extend_from_slice(blob_id.hash.as_bytes().as_slice());
key
}
BaseKey::BlobState(blob_id) => {
let mut key = vec![INDEX_BLOB_STATE];
key.push(blob_id.blob_type as u8);
key.extend_from_slice(blob_id.hash.as_bytes().as_slice());
key
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

root_key[0] == INDEX_CHAIN_ID
}

const INDEX_CHAIN_ID: u8 = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The expected name would be CHAIN_ID_TAG.

Comment on lines +397 to +398
const INDEX_CERTIFICATE: u8 = 1;
const INDEX_CONFIRMED_BLOCK: u8 = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BLOCK_HASH_TAG

const INDEX_CHAIN_ID: u8 = 0;
const INDEX_CERTIFICATE: u8 = 1;
const INDEX_CONFIRMED_BLOCK: u8 = 2;
const INDEX_BLOB_ID: u8 = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BLOB_ID_TAG

@ma2bd
Copy link
Contributor

ma2bd commented Oct 17, 2025

I think we can also start thinking about have two modules co-exist on the testnet branch: the current partitioning and the new one. Then, we will auto-migrate data on DB startup. This will require an extra version number.

@MathieuDutSik
Copy link
Contributor Author

I think we can also start thinking about have two modules co-exist on the testnet branch: the current partitioning and the new one. Then, we will auto-migrate data on DB startup. This will require an extra version number.

No problem.

To have the two schemas coexist on TestNet, we first need to have them coexist on main. So, my question is, how do we make the storage aware of in which storage it is running?

Two ideas that come to mind:

  • Passing an argument
  • Reading keys that would be present, for example, the NetworkDescription, and concluding from that read whether we are in the old schema or the new one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

All blobs live on the same Scylla partition

2 participants