Skip to content

sstable: fix estimation of empty sstable writer#5633

Open
annrpom wants to merge 1 commit intocockroachdb:masterfrom
annrpom:estimated-size
Open

sstable: fix estimation of empty sstable writer#5633
annrpom wants to merge 1 commit intocockroachdb:masterfrom
annrpom:estimated-size

Conversation

@annrpom
Copy link
Contributor

@annrpom annrpom commented Dec 8, 2025

This patch adds an estimation of the properties and filter blocks during the calculation of the estimated size of an sstable being written.

This patch adds an estimation of the properties and filter blocks
during the calculation of the estimated size of an sstable being
written.
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@annrpom
Copy link
Contributor Author

annrpom commented Dec 8, 2025

@jbowens how do u feel about this one

@annrpom annrpom marked this pull request as ready for review December 8, 2025 22:02
@annrpom annrpom requested a review from a team as a code owner December 8, 2025 22:02
@annrpom annrpom requested a review from sumeerbhola December 8, 2025 22:02
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sumeerbhola reviewed 11 of 11 files at r1.
Reviewable status: all files reviewed (commit messages unreviewed), 2 unresolved discussions (waiting on @annrpom)


bloom/bloom.go line 167 at r1 (raw file):

// FilterSize returns the size in bytes of a bloom filter with the given number
// of keys and bits per key. This can be used to estimate filter size before
// building it.

nit: the code comment only mentions one return value.


sstable/properties.go line 209 at r1 (raw file):

// overhead (e.g., varint lengths, column headers). The key and value sizes
// are the dominant factors, so this is typically close to the actual size.
func (p *Properties) EstimatedSize(tblFormat TableFormat) uint64 {

isn't this estimation being called after adding each key-value pair during a compaction/flush? If yes, it needs to be very cheap, and accumulateProps doesn't seem cheap. One possibility would be to do this every 1000 keys or so, but we would need to measure the performance impact, in which case best to leave it out of this PR and do it separately.

I don't remember -- are the properties up to date while writing the sstable, or are some of them made up-to-date when finishing the sstable? If they are up-to-date, is the increase in size mainly because of the increasing size of the varints?

@RaduBerinde
Copy link
Member

Can you provide some background? I don't immediately see the value in predicting the filter and prop blocks which should he small compared to the rest of the file. I think it's fine if the target file size doesn't cover them.

BTW I am working on adaptive bloom filter which might make it harder to estimate the final size

@annrpom
Copy link
Contributor Author

annrpom commented Dec 15, 2025

Can you provide some background? I don't immediately see the value in predicting the filter and prop blocks which should he small compared to the rest of the file. I think it's fine if the target file size doesn't cover them.

Yeah that's reasonable; I also didn't realize (at the time of putting up this PR) that we would call on estimate size for every key during a compaction. Instead of what we currently do for accumulateProps, we can get a sampling of properties size and use a constant (Jackson mentioned this).

I'll leave this PR here for now, though (I'd also be down to close it) - since this change isn't imminently necessary; I just wanted to try taking a stab at a TODO I saw in passing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants