db: add ApproximateMidKey for efficient range bisection#5778
db: add ApproximateMidKey for efficient range bisection#5778dt wants to merge 1 commit intocockroachdb:masterfrom
Conversation
RaduBerinde
left a comment
There was a problem hiding this comment.
Previously, finding a key that bisects a range by byte size required iterating all keys to locate the midpoint — an O(n) scan that is expensive for large ranges (~512MB).
Sound like the old way splits the range in terms of logical bytes; the new way splits in terms of current LSM usage. These could be very different for any particular range; is this not problematic?
@RaduBerinde made 1 comment.
Reviewable status: 0 of 5 files reviewed, all discussions resolved (waiting on @miraradeva).
I'm going to play with it on the CRDB side; we still, after we pick a split key, need to compute the logical mvcc stats for the LHS, and those can let us know if for a particular range they are in fact "very different". At that point KV can decide to go back and do the slow way instead. It costs pebble almost nothing to do the fast way though and if its mid-key is "close enough" for KV according to the computed stats, great. |
RaduBerinde
left a comment
There was a problem hiding this comment.
Nice, that's a good idea. It should be close in the common case.
@RaduBerinde made 5 comments.
Reviewable status: 0 of 5 files reviewed, 4 unresolved discussions (waiting on @dt and @miraradeva).
mid_key.go line 31 at r1 (raw file):
// bisected, ok is false and midKey is nil. func (d *DB) ApproximateMidKey( ctx context.Context, start, end []byte, minSize uint64,
[nit] use KeyRange
mid_key.go line 34 at r1 (raw file):
) (midKey []byte, ok bool, _ error) { if err := d.closed.Load(); err != nil { panic(err)
[nit] Consider returning the err in non-test builds, we've seen otherwise innocuous races during shutdown before.
mid_key.go line 61 at r1 (raw file):
ub := m.Largest().UserKey if d.cmp(ub, end) >= 0 { ub = end
Doesn't seem right that we're taking the entire file size even when it contains end. Upper level files can be very wide in terms of key space.
mid_key.go line 91 at r1 (raw file):
} // Sort SSTs by upper bound key. L0 SSTs may overlap in key space; we
This approximation is also sketchy.. L0 files can cover large swaths of keyspace.
A solution that is a bit slower but more accurate (and which reuses more code): we already have EstimateDiskUsage which estimates disk usage of a span. We could binary search among all sstable start and end keys in the span and find the two consecutive keys around the mid point. Then we could gather all the separator keys in the files that overlap this range and binary search again. This also only uses index blocks and they would be cached so I doubt the log factor would make a huge difference.
dt
left a comment
There was a problem hiding this comment.
@dt made 1 comment.
Reviewable status: 0 of 5 files reviewed, 4 unresolved discussions (waiting on @miraradeva and @RaduBerinde).
mid_key.go line 91 at r1 (raw file):
Previously, RaduBerinde wrote…
This approximation is also sketchy.. L0 files can cover large swaths of keyspace.
A solution that is a bit slower but more accurate (and which reuses more code): we already have
EstimateDiskUsagewhich estimates disk usage of a span. We could binary search among all sstable start and end keys in the span and find the two consecutive keys around the mid point. Then we could gather all the separator keys in the files that overlap this range and binary search again. This also only uses index blocks and they would be cached so I doubt the log factor would make a huge difference.
Heh, so, this was actually Claude's first design and I stepped in and told it not to use EstimateDiskUsage to open all the indexes of all the SSTs that overhang the edges of the span of interest, and instead just count the full size of any SST that intersects the span as in the span for which we want a mid-point.
We're calling this on a 512mb+ range. SSTs are generally much smaller than that, right? like 32mb? so even if an SST only intersects our actual span by a tiny amount -- say a single key even -- over-counting its whole size as in our span is only going to shift our mid-point, at worst, by its size away from the "true" mid-point. If we end up 32mb off the true center of a 512mb range that's fine. I'd be more inclined to take the hit of fetching and opening the index blocks for estimation if they'd be reused, but if an SST is only partially intersecting our span, it is pretty unlikely to be near the mid-point so we're not going to reuse that index block for refining center either.
I convinced Claude of this, but I guess it didn't reflect it in its comments.
|
L6 ssts are 128MB (L5 are 64MB and so on). Note that with value separation, index blocks tend to be very small and easily fit in the cache. You can operate under the assumption that loading index blocks is cheap. |
My concern isn’t their size, it’s their time to first byte if I open them at all. I’ll pay a round trip to get the index for the one that looks to span my mid-point but doing say, 40 round trips to get the indexes for all the ssts that intersect my span starts to hurt, particularly when I have lots of ranges all in the split queues at once. I’m pretty sure these splits don’t need to be precise at all: if one side is a little bigger than the other maybe it’ll just split again a tad sooner. As long as we’re picking a point that leaves some real room for growth on both sides it it’s a decent split point. |
|
We expect a straddling file on every level, not just one. The code looks inside a (more or less) arbitrary one. There is little point in looking inside a 2MB L0 file to get data block granularity while you're doing all-or-nothing on a straddling 128MB L6 file. There are so many handwaivy approximations here that I have no idea how useful this function is in practice. If you want to merge it to play with it, that's fine by me but add an Experimental suffix and document it as a rough approximation. |
Let me see if I can rework this to have a "err bar target" so it pays for index block loads as needed to, but not beyond, to hit that target. That seems like it could tie the amount of hand wave to what we really care about |
e51dc40 to
f546e68
Compare
dt
left a comment
There was a problem hiding this comment.
@dt made 1 comment.
Reviewable status: 0 of 5 files reviewed, 4 unresolved discussions (waiting on miraradeva and RaduBerinde).
mid_key.go line 61 at r1 (raw file):
Previously, RaduBerinde wrote…
Doesn't seem right that we're taking the entire file size even when it contains
end. Upper level files can be very wide in terms of key space.
I updated the logic to check the actual overlap of these partially included files until the uncertainty of not doing falls within the passed error bar budget.
RaduBerinde
left a comment
There was a problem hiding this comment.
I'm not too happy with the tedious code that's very hard to follow.
@RaduBerinde made 3 comments.
Reviewable status: 0 of 5 files reviewed, 6 unresolved discussions (waiting on dt and miraradeva).
sstable/reader.go line 979 at r2 (raw file):
// and the properties block. No data blocks are read. func (r *Reader) CollectBlockEntries( ctx context.Context, start []byte, env ReadEnv, transforms IterTransforms,
This API is strange in that it takes a start key but not an end key. It should either take a key range, or return all blocks.
mid_key_test.go line 300 at r2 (raw file):
rightSize, err := d.EstimateDiskUsage(midKey, []byte("key999")) require.NoError(t, err) t.Logf("tight epsilon: mid=%s left=%d right=%d", midKey, leftSize, rightSize)
This is not really a test if you're supposed to check the log..
In general, I find Claude Code very eager to generate tons of unit tests but they are pretty hard to look over and will become a maintenance burden. Please make a pass over them and make sure they're reasonable
Previously, finding a key that bisects a range by byte size required iterating all keys to locate the midpoint — an O(n) scan that is expensive for large ranges (~512MB). Add DB.ApproximateMidKey(ctx, start, end, minSize, errBar) which finds an approximate midpoint using only SST metadata and index blocks, avoiding data block I/O. The errBar parameter controls precision vs I/O: larger values need fewer index block reads, smaller values give tighter results. The algorithm works in three phases: Phase 1 — Determine total size: SSTs fully contained in [start, end) contribute exact sizes. Partially-overlapping edge SSTs are resolved from largest to smallest by reading their index blocks until the total size uncertainty is within 2×errBar (since target = totalSize/2, this ensures the target is known to within errBar). Phase 2 — Coarse walk: SSTs are walked in upper-bound key order, accumulating sizes. If the cumulative total lands within errBar of totalSize/2, the SST boundary where this happens is a good-enough mid key with no further I/O. Phase 3 — Merge-walk straddlers: SSTs that individually overshoot the target ± errBar window must be partially consumed. Their index blocks are read and block entries are merge-walked in key order across all straddling SSTs (at most one per LSM level), accumulating sizes until the deficit is reached. The set of indexes opened is provably minimal: if a file's size were ≤ errBar it could have been fully added or skipped, so its index read would never be needed.
Previously, finding a key that bisects a range by byte size required
iterating all keys to locate the midpoint — an O(n) scan that is
expensive for large ranges (~512MB).
Add DB.ApproximateMidKey(ctx, start, end, minSize, errBar) which finds
an approximate midpoint using only SST metadata and index blocks,
avoiding data block I/O. The errBar parameter controls precision vs I/O:
larger values need fewer index block reads, smaller values give tighter
results.
The algorithm works in three phases:
Phase 1 — Determine total size: SSTs fully contained in [start, end)
contribute exact sizes. Partially-overlapping edge SSTs are resolved
from largest to smallest by reading their index blocks until the total
size uncertainty is within 2×errBar (since target = totalSize/2, this
ensures the target is known to within errBar).
Phase 2 — Coarse walk: SSTs are walked in upper-bound key order,
accumulating sizes. If the cumulative total lands within errBar of
totalSize/2, the SST boundary where this happens is a good-enough mid
key with no further I/O.
Phase 3 — Merge-walk straddlers: SSTs that individually overshoot the
target ± errBar window must be partially consumed. Their index blocks
are read and block entries are merge-walked in key order across all
straddling SSTs (at most one per LSM level), accumulating sizes until
the deficit is reached. The set of indexes opened is provably minimal:
if a file's size were ≤ errBar it could have been fully added or
skipped, so its index read would never be needed.