-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Use a new schema for the data storage in Linera. #4814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
to the one wanted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. This looks very promising. My main comments are:
- Go all the way with manual serialization (if that's what we want) and remove
BaseKey
- Objects that can naturally share the same partition should be grouped together
struct Batch { | ||
key_value_bytes: Vec<(Vec<u8>, Vec<u8>)>, | ||
struct MultiPartitionBatch { | ||
keys_value_bytes: Vec<(Vec<u8>, Vec<u8>, Vec<u8>)>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to group the entries by partition: Vec<(Vec<u8>, SinglePartitionBatch)>
|
||
fn put_key_value_bytes(&mut self, key: Vec<u8>, value: Vec<u8>) { | ||
self.key_value_bytes.push((key, value)); | ||
fn put_key_value_bytes(&mut self, root_key: Vec<u8>, key: Vec<u8>, value: Vec<u8>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"root key" is a little weird outside of View
s. Can we just say "partition" or "partition_key"?
} | ||
|
||
impl BaseKey { | ||
fn root_key(&self) -> Vec<u8> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fn partition
?
metrics::WRITE_BLOB_COUNTER.with_label_values(&[]).inc(); | ||
let blob_key = bcs::to_bytes(&BaseKey::Blob(blob.id()))?; | ||
self.put_key_value_bytes(blob_key.to_vec(), blob.bytes().to_vec()); | ||
let root_key = BaseKey::Blob(blob.id()).root_key(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we don't need the enum BaseKey
at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could define PartitionKey
and serialize it but it seems like you prefer do things by hand.
BaseKey::Certificate(hash) => { | ||
let mut key = vec![INDEX_CERTIFICATE]; | ||
key.extend_from_slice(hash.as_bytes().as_slice()); | ||
key | ||
} | ||
BaseKey::ConfirmedBlock(hash) => { | ||
let mut key = vec![INDEX_CONFIRMED_BLOCK]; | ||
key.extend_from_slice(hash.as_bytes().as_slice()); | ||
key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should have the same partition actually
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I very much have an argument against that.
What we would like ideally is to be able to read (root_key_i, key_i)
with the root_keys all different and the keys all different as well. But we cannot do that with ScyllaDb, which is our target system.
What we can read efficiently are two orthogonal kinds of read_keys:
- Reading with the same root_key but varying key. This is what we have now with
read_multi_values_bytes
. - Reading with the same key but varying root_key. These are the functions I would like to introduce in the
KeyValueStore
trait.
If we put the certificates and the confirmed block on the same partition, then we are no longer able to use the second class of functions for read_certificates
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we would like ideally is to be able to read (root_key_i, key_i) with the root_keys all different and the keys all different as well.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See other comment #4814 (comment)
BaseKey::Blob(blob_id) => { | ||
let mut key = vec![INDEX_BLOB_ID]; | ||
key.push(blob_id.blob_type as u8); | ||
key.extend_from_slice(blob_id.hash.as_bytes().as_slice()); | ||
key | ||
} | ||
BaseKey::BlobState(blob_id) => { | ||
let mut key = vec![INDEX_BLOB_STATE]; | ||
key.push(blob_id.blob_type as u8); | ||
key.extend_from_slice(blob_id.hash.as_bytes().as_slice()); | ||
key | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
root_key[0] == INDEX_CHAIN_ID | ||
} | ||
|
||
const INDEX_CHAIN_ID: u8 = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: The expected name would be CHAIN_ID_TAG
.
const INDEX_CERTIFICATE: u8 = 1; | ||
const INDEX_CONFIRMED_BLOCK: u8 = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BLOCK_HASH_TAG
const INDEX_CHAIN_ID: u8 = 0; | ||
const INDEX_CERTIFICATE: u8 = 1; | ||
const INDEX_CONFIRMED_BLOCK: u8 = 2; | ||
const INDEX_BLOB_ID: u8 = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BLOB_ID_TAG
I think we can also start thinking about have two modules co-exist on the testnet branch: the current partitioning and the new one. Then, we will auto-migrate data on DB startup. This will require an extra version number. |
No problem. To have the two schemas coexist on TestNet, we first need to have them coexist on main. So, my question is, how do we make the storage aware of in which storage it is running? Two ideas that come to mind:
|
Motivation
Databases commonly use a
partition_key
which corresponds toroot_key
in our code.The partition key is used for hashing purposes in order to spread the workload over nodes.
An unfortunate feature of the existing schema is that Blobs, BlobStates, Events, and certificates
are all in the same partition (the one corresponding to
&[]
). This causes some performanceproblems. A common recommendation for schema design is that the partition key should be
spread out so that one bin does not receive too much data.
Fixes #4807
Proposal
The following proposal is implemented:
ChainId
,StreamId
. This led to the introduction of afn root_key(&self)
for theBaseKey
type. The function does not return errors since for the types in question, BlobId, CryptoHash, ChainId, returning an error is impossible.Why this change is the right one:
ChainId
for the application states. So, we already accept that we can have many many partition keys.The
Batch
oflinera-storage
is replaced by aMultiPartitionBatch
. It is unfortunate that we had the collision with theBatch
oflinera-views
.This PR does the requested job of changing only the
linera-storage
. However, there are some losses of parallelization for theread_multi_values/contains_keys
operation. This is not irremediable:read_multi_root_values(_, root_keys: Vec<Vec<u8>>, key: Vec<u8>)
to theKeyValueDatabase
. It is possible to implement this feature efficiently inScyllaDb
, which is our main database target.write_multi_partition_batch
to theKeyValueDatabase
. Note that the existingwrite_batch
indb_storage.rs
is creating many futures, but the right solution is likely to group the entries. Of course, batch size is an issue, but it has to be addressed by measuring it not spreading over all.Test Plan
The CI.
Release Plan
Hopefully, to put it into the main.
It is possible to write a migration tool that takes the existing storage of TestNet Conway and converts it to the new schema. But that is only if we really want to do that.
Before that, it would be good to see if the scalability works as expected for ScyllaDb runs.
Links
None.