hash: harden HRANDFIELD against expired-heavy hashes#3558
hash: harden HRANDFIELD against expired-heavy hashes#3558charsyam wants to merge 5 commits intovalkey-io:unstablefrom
Conversation
Handle expired-heavy hashes without letting stale entries break HRANDFIELD. Keep random probing as the fast path, do conservative bounded upfront cleanup, and fall back to a single validated pass when needed. The bounded cleanup may also reclaim expired hash fields from this read path, so the fix can trigger field deletion propagation when stale entries are encountered, but only on primaries. Use a small default prune limit of 32 to bound read-path cleanup work, and keep the expired-heavy regression coverage in the hash test file under slow tags while streamlining the fallback paths to avoid unnecessary collection or duplicate lookups. Signed-off-by: DaeMyung Kang <charsyam@gmail.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #3558 +/- ##
============================================
+ Coverage 76.46% 76.70% +0.24%
============================================
Files 159 160 +1
Lines 81675 80558 -1117
============================================
- Hits 62454 61795 -659
+ Misses 19221 18763 -458
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Thanks for tackling this. A few thoughts on direction:
Background. The 100 retry cap in hashTypeRandomElement is a known and intentional limitation from the hash-field-expiration work. At the time we accepted that if the physical hash becomes dominated by expired entries, HRANDFIELD may return fewer results than requested — with the reasoning that in realistic workloads the active expire cycle keeps the live/stale
ratio bounded well above the threshold where the 100-probe cap actually fails.
Concretely, the probability of all 100 probes missing is (1 - p)^100 where p is the live fraction. Bumping the cap to 1000 helps dramatically on realistic pathological cases:
| live fraction | P(all 100 miss) | P(all 1000 miss) |
|---|---|---|
| 10% | 2.7 × 10⁻⁵ | 1.7 × 10⁻⁴⁶ |
| 5% | 5.9 × 10⁻³ | 5.3 × 10⁻²³ |
| 1% | 0.366 | 4.3 × 10⁻⁵ |
| 0.5% | 0.606 | 6.6 × 10⁻³ |
| 0.1% | 0.905 | 0.368 |
| 0.01% (1 in 10k) | 0.990 | 0.905 |
So raising the cap alone eliminates the bug for any hash down to ~0.5% live, which covers essentially every realistic workload where active expire is merely lagging (rather than broken). 1000 random probes are still much cheaper than one O(N) validated scan for any non-tiny hash, and the happy path still hits on probe 1 or 2.
That said, hardening the tail behavior is fine and I'm supportive. A couple of design points I'd want addressed before merging:
-
Do not reclaim expired fields on a read path. hrandfieldMaybePruneExpiredFields calls dbReclaimExpiredFields, which deletes fields, propagates HDEL/DEL to replicas and AOF, and emits keyspace notifications. That's lazy expiration on a read path, and it's something we deliberately do not do for hash fields — hash field expiration is handled by the active
expire cycle only. Read paths skip expired fields via the hashtable validateEntry callback without touching them. Please drop this whole mitigation (and the HRANDFIELD_EXPIRED_PRUNE_LIMIT, the iAmPrimary/import_mode/lazy_expire_disabled/isPausedActionsWithUpdate guard, and the re-lookup dance). -
Prefer a much simpler fix inside hashTypeRandomElement:
- Raise maxtries from 100 to 1000. This one-line change covers the realistic failure regime per the probability table above.
- On top of that, if all 1000 probes miss, do a single-pass reservoir-of-one scan over the live entries (the hashtable iterator filters expired via validateEntry, so this naturally walks only live entries) and return that one. The caller loops as it does today with no changes.
This keeps the fix contained to hashTypeRandomElement and eliminates the need for hrandfieldMaybePruneExpiredFields, hrandfieldCollectLiveHashtableEntries, hrandfieldSelectRandomLiveHashtableEntry, hrandfieldReplyFromCollectedEntries, and hrandfieldReplyFromReservoirSample.
Worst-case cost for HRANDFIELD key on an expired-heavy hash becomes O(count × N_physical) in principle instead of this PR's O(N_physical), but with the retry cap at 1000 the fallback is essentially dead code for any workload short of adversarial. The degenerate >99%-expired case means active expire is already broken and HRANDFIELD latency isn't the
primary problem.
|
@ranshid Thanks for you review. After the review, I tried two contained approaches inside Attempt 1: maxtries 100→1000 with reservoir-of-one fallbackAfter 1000 random probes miss, do a single validated pass and pick one live entry via reservoir sampling. No Benchmark on 20k expired / 4 live (~0.02% live):
The hang in Attempt 2: only bump maxtries 100→1000 (no fallback)
Cost stays bounded at 1000 probes per call and the hang is gone. But the originally reported 0.02% live case Where 100→1000 actually helps
The bump only meaningfully covers the 1–5% live transient regime. Above 5%, the existing 100 cap already |
@charsyam You are right. I did not think through the 'case 4' potential hang. I would still liike to give it a try though. we could do you original proposal while only removing the prune steps. |
|
@ranshid Thanks for your advice. I will try it soon. |
Bound HRANDFIELD on expired-heavy hashes without letting the read path delete expired fields. Per @ranshid's review, hash field expiration is handled by the active expire cycle only — read paths must skip expired fields via validateEntry without reclaiming them. Drop the upfront prune step introduced in abdd7dc (hrandfieldMaybePruneExpiredFields and the HRANDFIELD_EXPIRED_PRUNE_LIMIT guard) and instead bound CASE 4 by detecting wasted random probing in the caller. Switch to a single validated reservoir-sample pass when either: - cumulative duplicates reach clamp(count / 2, 32, 256), or - 8 consecutive duplicates occur (early signal for expired-heavy hashes). Placing the guard in the caller (rather than in hashTypeRandomElement) avoids the CASE 4 hang that an always-succeeding helper would trigger: the unique-collection loop would otherwise spin once the live pool is exhausted. The cumulative budget is the main bound; the consecutive-duplicate guard adds tail stability and does not affect the common path. Unify the CASE 4 fallback and the negative-count fallback through a new hrandfieldReplyFromValidatedHashtableEntries helper. Add a regression test that exercises the case where live fields are fewer than the requested count under paused expiration, forcing CASE 4 through the new fallback path. Signed-off-by: DaeMyung Kang <charsyam@gmail.com>
|
I removed the prune step and tried a few variants for bounding CASE 4 before settling on the current version. The current approach keeps
I measured the pathological CASE 4 shape with
So the main latency improvement came from moving from a consecutive duplicate streak to a cumulative duplicate I also reran the normal/non-pathological hashtable benchmark. This is a 3-run average with
This does not show a noticeable regression in the normal hashtable cases, while the pathological CASE 4 path is |
|
@charsyam thank you! I just want to clean it up a bit (before reviewing the tests).
Then CASE 4 becomes just a C_ERR fallback — no duplicate budget, no new constants: The ht that CASE 4 already maintains for dedup serves double duty as the ignore_entries set for the reservoir fallback — no extra data structure needed. |
Commit 348e946 simplified CASE 4 to rely solely on hashTypeRandomElement's C_ERR signal for the fallback trigger. That signal only fires after 100 *consecutive* expired probes (maxtries=100), which leaves a hang window when the expired ratio is roughly 67~95% AND count > live_count: every random pick resolves to a live but already-collected entry, the duplicate loop continues indefinitely, and C_ERR has probability ~1e-10 per call. Reproduced on a 900-field hash (700 expired + 200 live, 78% expired) with HRANDFIELD key 250: server thread spun at 98.5% CPU for 17+ minutes, PING from another client timed out — the entire instance was unresponsive. Restore the cumulative + consecutive duplicate budget on top of the helper structure introduced in 348e946 (the helper signature, ht double-duty as ignore_entries, and the reply/sampling decoupling are all preserved). Only CASE 4's loop exit condition changes. The constants are count-proportional with bounded floor and cap so neither small nor large counts misbehave: - MIN floor (32): small counts have natural coupon-collector dups above 4 (e.g. count=4, live=4 expects ~4 dups) - MAX cap (256): large counts approach the cost of one O(size) iteration once cumulative dups reach this many - STREAK (8): early signal for very-expired-heavy hashes where dup streaks appear before the cumulative budget Benchmark on the patched build (same hash shapes, 50-200 iterations): healthy hashes (count up to 1000): 2.1-2.3 ms/op (no regression) mildly expired (count << live): 2.2-2.3 ms/op edge (heavy expired, count <= live): 2.6-3.3 ms/op hang zone (count > live, 67-95%): 5-7 ms/op (was: server-blocking) safe extreme (>99% expired): 3.6 ms/op (C_ERR path) Add a regression test exercising the hang zone (700 expired + 200 live, count=250) with a wall-clock assertion that completes in under one second — a regression would block until the test runner's overall timeout fires. Signed-off-by: DaeMyung Kang <charsyam@gmail.com>
|
Thanks, I kept the reservoir helper as a pure data-collection helper, as suggested. I initially tried to follow the C_ERR-only fallback shape, but while testing the When the hash is moderately expired and The duplicate budget is not a performance optimization and is not part of the I also reproduced this with the same hash shapes on the same machine
The key point is not the timing improvement. The important rows are the |
@charsyam, true, I was more focused on the first part of case4. I would still rather choose a simpler solution rather than complicate this (already very complicated function). I think it is fine to do a reservoir-of-one in the CASE 1, for case 4 I think I might even prefer to just do a The cost is O(N_physical) when volatile fields exist, but that's unavoidable — any correct approach needs to find the live entries. As a further refinement, you could narrow the check using |
Per @ranshid's review, replace the duplicate-budget probing loop with a much simpler split: when the hash carries any volatile (TTL'd) fields, walk it once via hrandfieldCollectLiveEntries (Fisher-Yates shuffled) and emit the first count entries. When the hash has no volatile fields the original random-probing loop is preserved verbatim, since no expired slots can ever be encountered. This eliminates the budget machinery and its three constants (HRANDFIELD_RANDOM_DUPLICATE_{MIN,MAX,STREAK}_LIMIT), the duplicate counters, and the hrandfieldSampleLiveEntries helper. The hang window addressed in 011b08a is structurally impossible in the new shape since the volatile path does not loop on hashTypeRandomElement. Trade-off: hashes with all-TTL fields whose expirations are still in the future no longer use random probing. Measured cost on a non- expired all-TTL hash: total=1K count=100 collect=2.48ms vs probing=2.58ms total=10K count=100 collect=2.74ms vs probing=2.52ms total=100K count=100 collect=5.60ms vs probing=2.57ms The optional vsetEstimatedEarliestExpiry refinement that would gate the collect path on actual expired entries cannot be used here: that function is documented as approximate, and empirically returns false negatives for our workload (the test suite's HRANDFIELD-on-expired-heavy cases regressed when gated on it), which would silently fall through to the probing path and lose correctness. Test surface unchanged (91/91 pass), including the count-greater- than-live regression test added in 011b08a. Bench (same harness, 50-200 iters per row): | scenario | expired% | count | result | | --- | ---: | ---: | ---: | | healthy 1K live | 0% | 100 | 2.23 ms | | healthy 10K live | 0% | 1000 | 2.20 ms | | 30% expired, count<<live | 30% | 10 | 2.18 ms | | 78% expired, count=live | 78% | 200 | 2.28 ms | | 78% expired, count>live | 78% | 201 | 2.10 ms | | 90% expired, count>live | 90% | 250 | 2.20 ms | | 99% expired, count>live | 99% | 250 | 2.20 ms | Signed-off-by: DaeMyung Kang <charsyam@gmail.com>
|
Adopted your simpler shape — CASE 4 now splits on whether the hash carries volatile fields. When it does, The duplicate-budget machinery from About the optional
|
| total fields | collect path (this PR) | probing (no TTL baseline) |
|---|---|---|
| 1,000 | 2.48 ms | 2.58 ms |
| 10,000 | 2.74 ms | 2.52 ms |
| 100,000 | 5.60 ms | 2.57 ms |
Below ~10K total fields the difference is in the noise; at 100K it's roughly 2× but still single-digit milliseconds.
Benchmark across the design
Same harness as before, 50–200 iters/row.
| scenario | expired % | count | result |
|---|---|---|---|
| healthy 1K live | 0 % | 10 | 2.15 ms |
| healthy 1K live | 0 % | 100 | 2.23 ms |
| healthy 10K live | 0 % | 100 | 2.09 ms |
| healthy 10K live | 0 % | 1000 | 2.20 ms |
| 30 % expired, count << live | 30 % | 10 | 2.18 ms |
| 50 % expired, count << live | 50 % | 10 | 2.14 ms |
| 78 % expired, count = live | 78 % | 200 | 2.28 ms |
| 99 % expired, count ≤ live | 99 % | 100 | 2.44 ms |
| 78 % expired, count > live | 78 % | 201 | 2.10 ms (was: hang on 348e946d) |
| 78 % expired, count > live | 78 % | 250 | 2.30 ms (was: hang) |
| 67 % expired, count > live | 67 % | 250 | 2.25 ms (was: hang) |
| 90 % expired, count > live | 90 % | 250 | 2.20 ms (was: hang) |
| 99 % expired, count > live (probing) | 99 % | 250 | 2.20 ms |
Tests
tests/unit/type/hash.tcl keeps the regression test from the previous push (700 expired + 200 live, count=250, wall-clock <
1 s). All 91 tests in unit/type/hash pass, including the five expired-heavy slow tests (~4.3 s each) and the new hang
regression (185 ms).
Before this change,
HRANDFIELDcould returnNULLor too few results even when live hash fieldsstill existed, if random probing kept landing on stale expired entries in a volatile hashtable-
encoded hash.
This change fixes that correctness issue while keeping the common random path fast.
The new behavior has two parts:
HRANDFIELDchooses its sampling strategy, it opportunistically prunes a small boundednumber of expired hash fields from a volatile hashtable hash.
HRANDFIELDswitches to a bounded validated live-entry fallback for the rest of the command, instead of treating the hash as empty or continuing to
rely on repeated random misses.
The important property of the fallback is that it avoids degenerating into repeated per-sample
full scans. Depending on the affected
HRANDFIELDvariant, the command pays for only a smallnumber of validated bulk passes before replying, which keeps the worst-case recovery path
predictable while still guaranteeing correct results when the physical hash size is dominated by
stale entries.
The bounded cleanup is intentionally conservative. The default prune limit is
32, which keepsread-path cleanup work small and predictable while still allowing the validated fallback to
guarantee correctness when stale entries continue to dominate the backing hashtable after pruning.
Because this runs on a read path, it can also reclaim expired hash fields and propagate the
resulting field deletions when stale entries are encountered. With the current default, a single
HRANDFIELDcall can reclaim and propagate up to32expired hash fields before switching to thevalidated fallback path.
This change also adds expired-heavy regression coverage for all four affected
HRANDFIELDpaths:WITHVALUESI also compared several prune limits (
32,64,128,256,512) on the expired-heavybenchmark case. There was no large difference overall, but
32stayed competitive while alsokeeping the reclaim side effect bounded to at most
32field deletions per call, so this changekeeps the default conservative.
| Limit |
HRANDFIELD key|HRANDFIELD key -16|HRANDFIELD key 4|HRANDFIELD key 16|HRANDFIELD key 4 WITHVALUES||---|---:|---:|---:|---:|---:|
|
32| 0.244 ms | 0.732 ms | 0.300 ms | 0.415 ms | 0.456 ms ||
64| 0.246 ms | 0.755 ms | 0.429 ms | 0.415 ms | 0.490 ms ||
128| 0.256 ms | 0.714 ms | 0.414 ms | 0.419 ms | 0.466 ms ||
256| 0.265 ms | 0.756 ms | 0.337 ms | 0.428 ms | 0.469 ms ||
512| 0.258 ms | 0.714 ms | 0.402 ms | 0.454 ms | 0.471 ms |I also reran the benchmark comparison against
valkey/unstableafter clean rebuilds of both trees(
make distclean && make) and 5 repeated runs per case.| Case | Current Mean ms | Current SD ms | Unstable Mean ms | Unstable SD ms | Delta ms | Delta %
|
|---|---:|---:|---:|---:|---:|---:|
|
HRANDFIELD baseline_hash| 0.234 | 0.002 | 0.234 | 0.003 | -0.001 | -0.3% ||
HRANDFIELD expired_heavy_hash| 0.238 | 0.004 | 0.233 | 0.006 | +0.005 | +2.2% ||
HRANDFIELD baseline_hash -16| 0.672 | 0.004 | 0.675 | 0.006 | -0.003 | -0.4% ||
HRANDFIELD expired_heavy_hash -16| 0.249 | 0.004 | 0.248 | 0.008 | +0.001 | +0.5% ||
HRANDFIELD baseline_hash 4| 0.434 | 0.001 | 0.435 | 0.004 | -0.001 | -0.3% ||
HRANDFIELD expired_heavy_hash 4| 0.235 | 0.003 | 0.234 | 0.004 | +0.001 | +0.3% ||
HRANDFIELD baseline_hash 16| 0.715 | 0.004 | 0.720 | 0.005 | -0.005 | -0.7% ||
HRANDFIELD expired_heavy_hash 16| 0.238 | 0.004 | 0.234 | 0.001 | +0.004 | +1.6% ||
HRANDFIELD baseline_hash 4 WITHVALUES| 0.444 | 0.005 | 0.448 | 0.001 | -0.005 | -1.0% ||
HRANDFIELD expired_heavy_hash 4 WITHVALUES| 0.241 | 0.003 | 0.240 | 0.007 | +0.002 | +0.7% |Across those repeated runs, the current branch stayed within about
-1.0%to+2.2%ofunstable, which suggests no meaningful performance regression while correcting the expired-heavybehavior.