Skip to content

hash: harden HRANDFIELD against expired-heavy hashes#3558

Open
charsyam wants to merge 5 commits intovalkey-io:unstablefrom
charsyam:feature/hrandfield-bounded-fallback
Open

hash: harden HRANDFIELD against expired-heavy hashes#3558
charsyam wants to merge 5 commits intovalkey-io:unstablefrom
charsyam:feature/hrandfield-bounded-fallback

Conversation

@charsyam
Copy link
Copy Markdown
Contributor

Before this change, HRANDFIELD could return NULL or too few results even when live hash fields
still existed, if random probing kept landing on stale expired entries in a volatile hashtable-
encoded hash.

This change fixes that correctness issue while keeping the common random path fast.

The new behavior has two parts:

  1. Before HRANDFIELD chooses its sampling strategy, it opportunistically prunes a small bounded
    number of expired hash fields from a volatile hashtable hash.
  2. If random probing still fails after that, HRANDFIELD switches to a bounded validated live-
    entry fallback for the rest of the command, instead of treating the hash as empty or continuing to
    rely on repeated random misses.

The important property of the fallback is that it avoids degenerating into repeated per-sample
full scans. Depending on the affected HRANDFIELD variant, the command pays for only a small
number of validated bulk passes before replying, which keeps the worst-case recovery path
predictable while still guaranteeing correct results when the physical hash size is dominated by
stale entries.

The bounded cleanup is intentionally conservative. The default prune limit is 32, which keeps
read-path cleanup work small and predictable while still allowing the validated fallback to
guarantee correctness when stale entries continue to dominate the backing hashtable after pruning.

Because this runs on a read path, it can also reclaim expired hash fields and propagate the
resulting field deletions when stale entries are encountered. With the current default, a single
HRANDFIELD call can reclaim and propagate up to 32 expired hash fields before switching to the
validated fallback path.

This change also adds expired-heavy regression coverage for all four affected HRANDFIELD paths:

  • single-field
  • negative-count
  • unique count
  • WITHVALUES

I also compared several prune limits (32, 64, 128, 256, 512) on the expired-heavy
benchmark case. There was no large difference overall, but 32 stayed competitive while also
keeping the reclaim side effect bounded to at most 32 field deletions per call, so this change
keeps the default conservative.

| Limit | HRANDFIELD key | HRANDFIELD key -16 | HRANDFIELD key 4 | HRANDFIELD key 16 |
HRANDFIELD key 4 WITHVALUES |
|---|---:|---:|---:|---:|---:|
| 32 | 0.244 ms | 0.732 ms | 0.300 ms | 0.415 ms | 0.456 ms |
| 64 | 0.246 ms | 0.755 ms | 0.429 ms | 0.415 ms | 0.490 ms |
| 128 | 0.256 ms | 0.714 ms | 0.414 ms | 0.419 ms | 0.466 ms |
| 256 | 0.265 ms | 0.756 ms | 0.337 ms | 0.428 ms | 0.469 ms |
| 512 | 0.258 ms | 0.714 ms | 0.402 ms | 0.454 ms | 0.471 ms |

I also reran the benchmark comparison against valkey/unstable after clean rebuilds of both trees
(make distclean && make) and 5 repeated runs per case.

| Case | Current Mean ms | Current SD ms | Unstable Mean ms | Unstable SD ms | Delta ms | Delta %
|
|---|---:|---:|---:|---:|---:|---:|
| HRANDFIELD baseline_hash | 0.234 | 0.002 | 0.234 | 0.003 | -0.001 | -0.3% |
| HRANDFIELD expired_heavy_hash | 0.238 | 0.004 | 0.233 | 0.006 | +0.005 | +2.2% |
| HRANDFIELD baseline_hash -16 | 0.672 | 0.004 | 0.675 | 0.006 | -0.003 | -0.4% |
| HRANDFIELD expired_heavy_hash -16 | 0.249 | 0.004 | 0.248 | 0.008 | +0.001 | +0.5% |
| HRANDFIELD baseline_hash 4 | 0.434 | 0.001 | 0.435 | 0.004 | -0.001 | -0.3% |
| HRANDFIELD expired_heavy_hash 4 | 0.235 | 0.003 | 0.234 | 0.004 | +0.001 | +0.3% |
| HRANDFIELD baseline_hash 16 | 0.715 | 0.004 | 0.720 | 0.005 | -0.005 | -0.7% |
| HRANDFIELD expired_heavy_hash 16 | 0.238 | 0.004 | 0.234 | 0.001 | +0.004 | +1.6% |
| HRANDFIELD baseline_hash 4 WITHVALUES | 0.444 | 0.005 | 0.448 | 0.001 | -0.005 | -1.0% |
| HRANDFIELD expired_heavy_hash 4 WITHVALUES | 0.241 | 0.003 | 0.240 | 0.007 | +0.002 | +0.7% |

Across those repeated runs, the current branch stayed within about -1.0% to +2.2% of
unstable, which suggests no meaningful performance regression while correcting the expired-heavy
behavior.

Handle expired-heavy hashes without letting stale entries break HRANDFIELD.

Keep random probing as the fast path, do conservative bounded upfront cleanup, and fall back to a single validated pass when needed. The bounded cleanup may also reclaim expired hash fields from this read path, so the fix can trigger field deletion propagation when stale entries are encountered, but only on primaries. Use a small default prune limit of 32 to bound read-path cleanup work, and keep the expired-heavy regression coverage in the hash test file under slow tags while streamlining the fallback paths to avoid unnecessary collection or duplicate lookups.

Signed-off-by: DaeMyung Kang <charsyam@gmail.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 24, 2026

Codecov Report

❌ Patch coverage is 81.81818% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.70%. Comparing base (9709843) to head (67ed6e9).
⚠️ Report is 27 commits behind head on unstable.

Files with missing lines Patch % Lines
src/t_hash.c 81.81% 18 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #3558      +/-   ##
============================================
+ Coverage     76.46%   76.70%   +0.24%     
============================================
  Files           159      160       +1     
  Lines         81675    80558    -1117     
============================================
- Hits          62454    61795     -659     
+ Misses        19221    18763     -458     
Files with missing lines Coverage Δ
src/t_hash.c 94.65% <81.81%> (-0.83%) ⬇️

... and 127 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@hpatro hpatro requested a review from ranshid April 24, 2026 20:11
Copy link
Copy Markdown
Member

@ranshid ranshid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this. A few thoughts on direction:

Background. The 100 retry cap in hashTypeRandomElement is a known and intentional limitation from the hash-field-expiration work. At the time we accepted that if the physical hash becomes dominated by expired entries, HRANDFIELD may return fewer results than requested — with the reasoning that in realistic workloads the active expire cycle keeps the live/stale
ratio bounded well above the threshold where the 100-probe cap actually fails.

Concretely, the probability of all 100 probes missing is (1 - p)^100 where p is the live fraction. Bumping the cap to 1000 helps dramatically on realistic pathological cases:

live fraction P(all 100 miss) P(all 1000 miss)
10% 2.7 × 10⁻⁵ 1.7 × 10⁻⁴⁶
5% 5.9 × 10⁻³ 5.3 × 10⁻²³
1% 0.366 4.3 × 10⁻⁵
0.5% 0.606 6.6 × 10⁻³
0.1% 0.905 0.368
0.01% (1 in 10k) 0.990 0.905

So raising the cap alone eliminates the bug for any hash down to ~0.5% live, which covers essentially every realistic workload where active expire is merely lagging (rather than broken). 1000 random probes are still much cheaper than one O(N) validated scan for any non-tiny hash, and the happy path still hits on probe 1 or 2.

That said, hardening the tail behavior is fine and I'm supportive. A couple of design points I'd want addressed before merging:

  1. Do not reclaim expired fields on a read path. hrandfieldMaybePruneExpiredFields calls dbReclaimExpiredFields, which deletes fields, propagates HDEL/DEL to replicas and AOF, and emits keyspace notifications. That's lazy expiration on a read path, and it's something we deliberately do not do for hash fields — hash field expiration is handled by the active
    expire cycle only. Read paths skip expired fields via the hashtable validateEntry callback without touching them. Please drop this whole mitigation (and the HRANDFIELD_EXPIRED_PRUNE_LIMIT, the iAmPrimary/import_mode/lazy_expire_disabled/isPausedActionsWithUpdate guard, and the re-lookup dance).

  2. Prefer a much simpler fix inside hashTypeRandomElement:

  • Raise maxtries from 100 to 1000. This one-line change covers the realistic failure regime per the probability table above.
  • On top of that, if all 1000 probes miss, do a single-pass reservoir-of-one scan over the live entries (the hashtable iterator filters expired via validateEntry, so this naturally walks only live entries) and return that one. The caller loops as it does today with no changes.

This keeps the fix contained to hashTypeRandomElement and eliminates the need for hrandfieldMaybePruneExpiredFields, hrandfieldCollectLiveHashtableEntries, hrandfieldSelectRandomLiveHashtableEntry, hrandfieldReplyFromCollectedEntries, and hrandfieldReplyFromReservoirSample.

Worst-case cost for HRANDFIELD key on an expired-heavy hash becomes O(count × N_physical) in principle instead of this PR's O(N_physical), but with the retry cap at 1000 the fallback is essentially dead code for any workload short of adversarial. The degenerate >99%-expired case means active expire is already broken and HRANDFIELD latency isn't the
primary problem.

@charsyam
Copy link
Copy Markdown
Contributor Author

@ranshid Thanks for you review.
I think it is better to close this PR. It can't resolve the original intent to harden the tail behavior. What do you think?

After the review, I tried two contained approaches inside hashTypeRandomElement and benchmarked them. The
numbers showed neither delivers the intent, so closing makes more sense than merging.

Attempt 1: maxtries 100→1000 with reservoir-of-one fallback

After 1000 random probes miss, do a single validated pass and pick one live entry via reservoir sampling. No
caller changes.

Benchmark on 20k expired / 4 live (~0.02% live):

command original reservoir fallback
HRANDFIELD key 21.3k rps 2.27k rps (9.4x slower)
HRANDFIELD -16 20.4k rps 152 rps
HRANDFIELD 4 20.8k rps 285 rps
HRANDFIELD 16 returned 1 3s timeout (hang) <-- this is critical

The hang in HRANDFIELD 16 is because the new fallback always returns a live entry, so CASE 4's while (added < count) loop adds the 4 live entries and then receives only duplicates forever. Both correctness
and performance regress.

Attempt 2: only bump maxtries 100→1000 (no fallback)

command original 1000-only
HRANDFIELD key 19.2k rps 2.97k rps
HRANDFIELD -16 20.8k rps 2.51k rps
HRANDFIELD 4 20.0k rps 2.49k rps
HRANDFIELD 16 returned 1 returned 1, no hang

Cost stays bounded at 1000 probes per call and the hang is gone. But the originally reported 0.02% live case
is still ~90% miss — the bump does not fix it.

Where 100→1000 actually helps

live ratio P(100 miss) P(1000 miss)
5% 0.6% 5.3e-23
1% 36.6% 4.3e-5
0.5% 60.6% 0.66%
0.1% 90.5% 36.8%
0.02% (reported) 99% 90%

The bump only meaningfully covers the 1–5% live transient regime. Above 5%, the existing 100 cap already
suffices. Below 0.5%, 1000 also fails. The reported case sits well below the regime where the bump helps.

@ranshid
Copy link
Copy Markdown
Member

ranshid commented Apr 28, 2026

@ranshid Thanks for you review. I think it is better to close this PR. It can't resolve the original intent to harden the tail behavior. What do you think?

After the review, I tried two contained approaches inside hashTypeRandomElement and benchmarked them. The numbers showed neither delivers the intent, so closing makes more sense than merging.

Attempt 1: maxtries 100→1000 with reservoir-of-one fallback

After 1000 random probes miss, do a single validated pass and pick one live entry via reservoir sampling. No caller changes.

Benchmark on 20k expired / 4 live (~0.02% live):

command original reservoir fallback
HRANDFIELD key 21.3k rps 2.27k rps (9.4x slower)
HRANDFIELD -16 20.4k rps 152 rps
HRANDFIELD 4 20.8k rps 285 rps
HRANDFIELD 16 returned 1 3s timeout (hang) <-- this is critical
The hang in HRANDFIELD 16 is because the new fallback always returns a live entry, so CASE 4's while (added < count) loop adds the 4 live entries and then receives only duplicates forever. Both correctness and performance regress.

Attempt 2: only bump maxtries 100→1000 (no fallback)

command original 1000-only
HRANDFIELD key 19.2k rps 2.97k rps
HRANDFIELD -16 20.8k rps 2.51k rps
HRANDFIELD 4 20.0k rps 2.49k rps
HRANDFIELD 16 returned 1 returned 1, no hang
Cost stays bounded at 1000 probes per call and the hang is gone. But the originally reported 0.02% live case is still ~90% miss — the bump does not fix it.

Where 100→1000 actually helps

live ratio P(100 miss) P(1000 miss)
5% 0.6% 5.3e-23
1% 36.6% 4.3e-5
0.5% 60.6% 0.66%
0.1% 90.5% 36.8%
0.02% (reported) 99% 90%
The bump only meaningfully covers the 1–5% live transient regime. Above 5%, the existing 100 cap already suffices. Below 0.5%, 1000 also fails. The reported case sits well below the regime where the bump helps.

@charsyam You are right. I did not think through the 'case 4' potential hang. I would still liike to give it a try though. we could do you original proposal while only removing the prune steps.

@daemyung-lablup
Copy link
Copy Markdown

@ranshid Thanks for your advice. I will try it soon.

Bound HRANDFIELD on expired-heavy hashes without letting the read
path delete expired fields.

Per @ranshid's review, hash field expiration is handled by the
active expire cycle only — read paths must skip expired fields via
validateEntry without reclaiming them. Drop the upfront prune step
introduced in abdd7dc (hrandfieldMaybePruneExpiredFields and the
HRANDFIELD_EXPIRED_PRUNE_LIMIT guard) and instead bound CASE 4 by
detecting wasted random probing in the caller.

Switch to a single validated reservoir-sample pass when either:
  - cumulative duplicates reach clamp(count / 2, 32, 256), or
  - 8 consecutive duplicates occur (early signal for expired-heavy
    hashes).

Placing the guard in the caller (rather than in
hashTypeRandomElement) avoids the CASE 4 hang that an
always-succeeding helper would trigger: the unique-collection loop
would otherwise spin once the live pool is exhausted.

The cumulative budget is the main bound; the consecutive-duplicate
guard adds tail stability and does not affect the common path.

Unify the CASE 4 fallback and the negative-count fallback through a
new hrandfieldReplyFromValidatedHashtableEntries helper.

Add a regression test that exercises the case where live fields are
fewer than the requested count under paused expiration, forcing
CASE 4 through the new fallback path.

Signed-off-by: DaeMyung Kang <charsyam@gmail.com>
@charsyam
Copy link
Copy Markdown
Contributor Author

charsyam commented Apr 28, 2026

@ranshid

I removed the prune step and tried a few variants for bounding CASE 4 before settling on the current version.

The current approach keeps HRANDFIELD read-only with respect to expired-field cleanup, and only changes the
CASE 4 no-progress guard. The guard now switches to the validated reservoir fallback when either:

  • cumulative duplicates reach clamp(count / 2, 32, 256), or
  • consecutive duplicates reach 8.

I measured the pathological CASE 4 shape with HLEN=300, 201 expired fields still physically present, 99 live
fields, and HRANDFIELD key 100. This forces CASE 4 even though only 99 live fields can be returned.

Variant Clients Avg Latency P50 P99 Max Notes
Consecutive duplicate guard only 50 ~26 ms ~26.3 ms ~30 ms ~32 ms Bounded, but fallback happens too late.
Cumulative duplicate budget, max(32, count / 2) 50 ~7.0 ms ~7.05 ms ~8.4-8.7 ms ~10-12 ms Main improvement; avoids waiting for a long duplicate streak.
Current: cumulative budget capped + short streak guard 50 6.547 ms 6.607 ms 7.623 ms 10.375 ms Modest extra tail improvement and prevents large-count budgets from growing unboundedly.
Current: cumulative budget capped + short streak guard 1 0.153 ms 0.159 ms 0.223 ms 1.151 ms Single-client command cost.

So the main latency improvement came from moving from a consecutive duplicate streak to a cumulative duplicate
budget. The cap and short-streak guard are smaller improvements, mostly for tail stability and large-count
safety.

I also reran the normal/non-pathological hashtable benchmark. This is a 3-run average with requests=20000,
clients=50, pipeline=1, and baseline_hash with 20k live fields:

Command Avg RPS Avg Latency P50 P99
HRANDFIELD key 121704.85 0.213 ms 0.207 ms 0.364 ms
HRANDFIELD key -16 72640.71 0.637 ms 0.610 ms 1.034 ms
HRANDFIELD key 4 111320.31 0.407 ms 0.388 ms 0.708 ms
HRANDFIELD key 16 68494.75 0.678 ms 0.660 ms 1.068 ms
HRANDFIELD key 4 WITHVALUES 108122.75 0.415 ms 0.399 ms 0.703 ms

This does not show a noticeable regression in the normal hashtable cases, while the pathological CASE 4 path is
now bounded much earlier.

@ranshid
Copy link
Copy Markdown
Member

ranshid commented Apr 30, 2026

@charsyam thank you! I just want to clean it up a bit (before reviewing the tests).
Several comments:

  1. All the static functions need some kind of describing comments. what they do, what are the parameters they accept etc..
  2. Lets try to decouple the Reply and the reservoir+R algo. I would rather the new helper functions just fillup arrays and/or sort them and try to manage the reply logic as part of the command handling.
  3. CASE 4 handling is correct, but I am trying to think if we can clean it up a bit.
    I was thinking that maybe The only thing CASE 4 needs is: on the first C_ERR from hashTypeRandomElement, collect the remaining entries via reservoir sampling and reply with them. No duplicate budget, no constants.
    You can introduce a pure data-collection helper that returns an array of entry pointers without touching the client output buffer:
/* Returns up to `count` unique live entries not in `ignore_entries`, via
 * reservoir sampling. Caller owns the returned array (zfree it).
 * Actual count stored in *out_count. */
static entry **hrandfieldSampleLiveEntries(robj *hash, unsigned long count,
                                            hashtable *ignore_entries,
                                            unsigned long *out_count) {
    serverAssert(hash->encoding == OBJ_ENCODING_HASHTABLE);
    entry **reservoir = zmalloc(sizeof(*reservoir) * count);
    unsigned long seen = 0, kept = 0;
    hashTypeIterator hi;
    hashTypeInitIterator(hash, &hi);

    while (hashTypeNext(&hi) != C_ERR) {
        entry *e = hi.next;
        if (ignore_entries && hashtableFind(ignore_entries, entryGetField(e), NULL))
            continue;
        seen++;
        if (kept < count) {
            reservoir[kept++] = e;
        } else {
            unsigned long j = random() % seen;
            if (j < count)
                reservoir[j] = e;
        }
    }
    hashTypeResetIterator(&hi);

    /* Shuffle to randomize output order */
    for (unsigned long i = 0; i < kept; i++) {
        unsigned long j = i + (random() % (kept - i));
        entry *tmp = reservoir[i];
        reservoir[i] = reservoir[j];
        reservoir[j] = tmp;
    }

    *out_count = kept;
    return reservoir;
}

Then CASE 4 becomes just a C_ERR fallback — no duplicate budget, no new constants:

while (added < count) {
    if (hashTypeRandomElement(hash, size, &field, withvalues ? &value : NULL) != C_OK) {
        unsigned long kept;
        entry **entries = hrandfieldSampleLiveEntries(hash, count - added, ht, &kept);
        for (unsigned long i = 0; i < kept; i++) {
            sds f = entryGetField(entries[i]);
            size_t vlen;
            char *v = entryGetValue(entries[i], &vlen);
            hashtableAdd(ht, sdsdup(f));
            if (withvalues && c->resp > 2) addWritePreparedReplyArrayLen(wpc, 2);
            addWritePreparedReplyBulkCBuffer(wpc, f, sdslen(f));
            if (withvalues) addWritePreparedReplyBulkCBuffer(wpc, v, vlen);
            added++;
        }
        zfree(entries);
        break;
    }

    /* ... existing dedup + reply code unchanged ... */
}

The ht that CASE 4 already maintains for dedup serves double duty as the ignore_entries set for the reservoir fallback — no extra data structure needed.

charsyam added 2 commits May 1, 2026 00:34
Commit 348e946 simplified CASE 4 to rely solely on hashTypeRandomElement's
C_ERR signal for the fallback trigger. That signal only fires after 100
*consecutive* expired probes (maxtries=100), which leaves a hang window
when the expired ratio is roughly 67~95% AND count > live_count: every
random pick resolves to a live but already-collected entry, the duplicate
loop continues indefinitely, and C_ERR has probability ~1e-10 per call.

Reproduced on a 900-field hash (700 expired + 200 live, 78% expired) with
HRANDFIELD key 250: server thread spun at 98.5% CPU for 17+ minutes,
PING from another client timed out — the entire instance was unresponsive.

Restore the cumulative + consecutive duplicate budget on top of the
helper structure introduced in 348e946 (the helper signature, ht
double-duty as ignore_entries, and the reply/sampling decoupling are all
preserved). Only CASE 4's loop exit condition changes.

The constants are count-proportional with bounded floor and cap so
neither small nor large counts misbehave:
  - MIN floor (32):  small counts have natural coupon-collector dups
                     above 4 (e.g. count=4, live=4 expects ~4 dups)
  - MAX cap  (256):  large counts approach the cost of one O(size)
                     iteration once cumulative dups reach this many
  - STREAK    (8):   early signal for very-expired-heavy hashes where
                     dup streaks appear before the cumulative budget

Benchmark on the patched build (same hash shapes, 50-200 iterations):
  healthy hashes (count up to 1000):  2.1-2.3 ms/op (no regression)
  mildly expired (count << live):     2.2-2.3 ms/op
  edge (heavy expired, count <= live): 2.6-3.3 ms/op
  hang zone (count > live, 67-95%):    5-7 ms/op (was: server-blocking)
  safe extreme (>99% expired):         3.6 ms/op (C_ERR path)

Add a regression test exercising the hang zone (700 expired + 200 live,
count=250) with a wall-clock assertion that completes in under one
second — a regression would block until the test runner's overall
timeout fires.

Signed-off-by: DaeMyung Kang <charsyam@gmail.com>
@charsyam
Copy link
Copy Markdown
Contributor Author

@ranshid

Thanks, I kept the reservoir helper as a pure data-collection helper, as suggested.
The command handler still owns the reply logic, and the helper only returns live
entries excluding the already-selected fields.

I initially tried to follow the C_ERR-only fallback shape, but while testing the
CASE 4 edge cases I found one liveness issue. CASE 4 is the path that keeps
random-probing until it collects count unique fields. A C_ERR-only fallback is
not sufficient there because hashTypeRandomElement() returns C_ERR only after
100 consecutive expired probes.

When the hash is moderately expired and count is larger than the number of
live fields, CASE 4 can collect all live fields and then keep sampling
already-selected live fields. Those duplicate live hits prevent the
100-consecutive-expired condition from being reached, so the loop can spin
indefinitely while added < count. I reproduced this as a server-thread hang,
so I kept the duplicate budget only as a CASE 4 progress bound.

The duplicate budget is not a performance optimization and is not part of the
reservoir sampling algorithm. It only decides when CASE 4 should stop random
probing and switch to the pure reservoir helper. Once CASE 4 stops making
progress because of duplicates, it collects the remaining live entries through
the helper and exits.

I also reproduced this with the same hash shapes on the same machine
(50-200 iterations per row):

scenario expired % count C_ERR-only fallback (348e946d) patched
healthy 1K live 0% 10 3.55 ms 2.15 ms
healthy 1K live 0% 100 4.18 ms 2.29 ms
healthy 10K live 0% 100 3.27 ms 2.13 ms
healthy 10K live 0% 1000 3.34 ms 2.24 ms
30% expired, count << live 30% 10 4.25 ms 2.17 ms
50% expired, count << live 50% 10 4.58 ms 2.18 ms
78% expired, count = live 78% 200 8.38 ms 3.28 ms
99% expired, count <= live 99% 100 5.55 ms 2.55 ms
78% expired, count > live 78% 201 HANG 6.00 ms
78% expired, count > live 78% 250 HANG 7.00 ms
67% expired, count > live 67% 250 HANG 6.00 ms
90% expired, count > live 90% 250 HANG 6.00 ms
99% expired, count > live (C_ERR path) 99% 250 3.60 ms 3.60 ms

The key point is not the timing improvement. The important rows are the
moderately-expired count > live cases: with a C_ERR-only fallback the server
thread can hang, while the patched CASE 4 exits through the bounded duplicate
fallback. The 99% expired row also matches the explanation, because C_ERR is
likely to fire quickly only when almost every probe is expired.

@ranshid
Copy link
Copy Markdown
Member

ranshid commented Apr 30, 2026

@ranshid

Thanks, I kept the reservoir helper as a pure data-collection helper, as suggested. The command handler still owns the reply logic, and the helper only returns live entries excluding the already-selected fields.

I initially tried to follow the C_ERR-only fallback shape, but while testing the CASE 4 edge cases I found one liveness issue. CASE 4 is the path that keeps random-probing until it collects count unique fields. A C_ERR-only fallback is not sufficient there because hashTypeRandomElement() returns C_ERR only after 100 consecutive expired probes.

When the hash is moderately expired and count is larger than the number of live fields, CASE 4 can collect all live fields and then keep sampling already-selected live fields. Those duplicate live hits prevent the 100-consecutive-expired condition from being reached, so the loop can spin indefinitely while added < count. I reproduced this as a server-thread hang, so I kept the duplicate budget only as a CASE 4 progress bound.

The duplicate budget is not a performance optimization and is not part of the reservoir sampling algorithm. It only decides when CASE 4 should stop random probing and switch to the pure reservoir helper. Once CASE 4 stops making progress because of duplicates, it collects the remaining live entries through the helper and exits.

I also reproduced this with the same hash shapes on the same machine (50-200 iterations per row):

scenario expired % count C_ERR-only fallback (348e946d) patched
healthy 1K live 0% 10 3.55 ms 2.15 ms
healthy 1K live 0% 100 4.18 ms 2.29 ms
healthy 10K live 0% 100 3.27 ms 2.13 ms
healthy 10K live 0% 1000 3.34 ms 2.24 ms
30% expired, count << live 30% 10 4.25 ms 2.17 ms
50% expired, count << live 50% 10 4.58 ms 2.18 ms
78% expired, count = live 78% 200 8.38 ms 3.28 ms
99% expired, count <= live 99% 100 5.55 ms 2.55 ms
78% expired, count > live 78% 201 HANG 6.00 ms
78% expired, count > live 78% 250 HANG 7.00 ms
67% expired, count > live 67% 250 HANG 6.00 ms
90% expired, count > live 90% 250 HANG 6.00 ms
99% expired, count > live (C_ERR path) 99% 250 3.60 ms 3.60 ms
The key point is not the timing improvement. The important rows are the moderately-expired count > live cases: with a C_ERR-only fallback the server thread can hang, while the patched CASE 4 exits through the bounded duplicate fallback. The 99% expired row also matches the explanation, because C_ERR is likely to fire quickly only when almost every probe is expired.

@charsyam, true, I was more focused on the first part of case4. I would still rather choose a simpler solution rather than complicate this (already very complicated function). I think it is fine to do a reservoir-of-one in the CASE 1, for case 4 I think I might even prefer to just do a hrandfieldCollectLiveEntries where hrandfieldCollectLiveEntries is a pure data helper — collects all live entries into an array and Fisher-Yates shuffles them like:

else {
    if (hashTypeHasVolatileFields(hash)) {
        unsigned long out_nr;
        entry **entries = hrandfieldCollectLiveEntries(hash, &out_nr);
        unsigned long to_add = count < out_nr ? count : out_nr;
        for (unsigned long i = 0; i < to_add; i++) {
            sds f = entryGetField(entries[i]);
            size_t vlen;
            char *v = entryGetValue(entries[i], &vlen);
            if (withvalues && c->resp > 2) addWritePreparedReplyArrayLen(wpc, 2);
            addWritePreparedReplyBulkCBuffer(wpc, f, sdslen(f));
            if (withvalues) addWritePreparedReplyBulkCBuffer(wpc, v, vlen);
        }
        reply_size = to_add;
        zfree(entries);
    } else {
        /* Original CASE 4 — no volatile fields */
        ...
    }
}

The cost is O(N_physical) when volatile fields exist, but that's unavoidable — any correct approach needs to find the live entries. As a further refinement, you could narrow the check using
vsetEstimatedEarliestExpiry to only take this path when expired entries actually exist (not just future TTLs), but that's optional.

Per @ranshid's review, replace the duplicate-budget probing loop with a
much simpler split: when the hash carries any volatile (TTL'd) fields,
walk it once via hrandfieldCollectLiveEntries (Fisher-Yates shuffled)
and emit the first count entries. When the hash has no volatile fields
the original random-probing loop is preserved verbatim, since no
expired slots can ever be encountered.

This eliminates the budget machinery and its three constants
(HRANDFIELD_RANDOM_DUPLICATE_{MIN,MAX,STREAK}_LIMIT), the duplicate
counters, and the hrandfieldSampleLiveEntries helper. The hang
window addressed in 011b08a is structurally impossible in the new
shape since the volatile path does not loop on hashTypeRandomElement.

Trade-off: hashes with all-TTL fields whose expirations are still in
the future no longer use random probing. Measured cost on a non-
expired all-TTL hash:
  total=1K   count=100  collect=2.48ms vs probing=2.58ms
  total=10K  count=100  collect=2.74ms vs probing=2.52ms
  total=100K count=100  collect=5.60ms vs probing=2.57ms
The optional vsetEstimatedEarliestExpiry refinement that would gate
the collect path on actual expired entries cannot be used here: that
function is documented as approximate, and empirically returns false
negatives for our workload (the test suite's
HRANDFIELD-on-expired-heavy cases regressed when gated on it),
which would silently fall through to the probing path and lose
correctness.

Test surface unchanged (91/91 pass), including the count-greater-
than-live regression test added in 011b08a.

Bench (same harness, 50-200 iters per row):

| scenario                     | expired% | count | result   |
| ---                          |   ---:   |  ---: |    ---:  |
| healthy 1K live              |    0%    |  100  |  2.23 ms |
| healthy 10K live             |    0%    | 1000  |  2.20 ms |
| 30% expired, count<<live     |   30%    |   10  |  2.18 ms |
| 78% expired, count=live      |   78%    |  200  |  2.28 ms |
| 78% expired, count>live      |   78%    |  201  |  2.10 ms |
| 90% expired, count>live      |   90%    |  250  |  2.20 ms |
| 99% expired, count>live      |   99%    |  250  |  2.20 ms |

Signed-off-by: DaeMyung Kang <charsyam@gmail.com>
@charsyam
Copy link
Copy Markdown
Contributor Author

charsyam commented May 1, 2026

@ranshid

Adopted your simpler shape — CASE 4 now splits on whether the hash carries volatile fields. When it does,
hrandfieldCollectLiveEntries walks the hash once, Fisher-Yates shuffles the live entries, and the caller emits the first
count. When it doesn't, the original random-probing loop is kept verbatim (no expired slots can ever appear, so the probing
loop has no termination concern).

The duplicate-budget machinery from 011b08ac is gone — three constants, the dup counters, and hrandfieldSampleLiveEntries
all removed. The hang window is structurally impossible in this shape since the volatile path doesn't loop on
hashTypeRandomElement.

About the optional vsetEstimatedEarliestExpiry refinement

I tried it first to keep probing on hashes whose TTLs are all in the future. The test suite's expired-heavy cases regressed:

[err]: HRANDFIELD WITHVALUES returns all live fields on an expired-heavy hash
  Expected '8' to be equal to '0'
[err]: HRANDFIELD returns unique live fields on an expired-heavy hash
  Expected '4' to be equal to '0'

A direct sweep showed vsetEstimatedEarliestExpiry is approximate in a way that bites here: it sometimes reports an expiry
beyond commandTimeSnapshot() even when many fields are already past their expiry. The gate then routed the call to the
probing path on an effectively all-expired hash, where hashTypeRandomElement returns C_ERR before finding any of the few
live entries — yielding a fast but wrong answer (a small count, or zero):

expired=100   live=4: hrandfield 4 -> 4 entries
expired=500   live=4: hrandfield 4 -> 3 entries     # under-returned
...
expired=15000 live=4: hrandfield 4 -> 1 entry       # under-returned
expired=20000 live=4: hrandfield 4 -> 4 entries

Since the failure mode is silent correctness loss rather than hang, and the function's docstring already declares itself
approximate, I dropped the refinement. The gate is now just hashTypeHasVolatileFields, which is an exact emptiness check on
the volatile set.

Trade-off

A hash whose fields all carry future-only TTLs now goes through the collect path instead of probing. Measured cost on a
non-expired all-TTL hash, count=100, 30 iters/row:

total fields collect path (this PR) probing (no TTL baseline)
1,000 2.48 ms 2.58 ms
10,000 2.74 ms 2.52 ms
100,000 5.60 ms 2.57 ms

Below ~10K total fields the difference is in the noise; at 100K it's roughly 2× but still single-digit milliseconds.

Benchmark across the design

Same harness as before, 50–200 iters/row.

scenario expired % count result
healthy 1K live 0 % 10 2.15 ms
healthy 1K live 0 % 100 2.23 ms
healthy 10K live 0 % 100 2.09 ms
healthy 10K live 0 % 1000 2.20 ms
30 % expired, count << live 30 % 10 2.18 ms
50 % expired, count << live 50 % 10 2.14 ms
78 % expired, count = live 78 % 200 2.28 ms
99 % expired, count ≤ live 99 % 100 2.44 ms
78 % expired, count > live 78 % 201 2.10 ms (was: hang on 348e946d)
78 % expired, count > live 78 % 250 2.30 ms (was: hang)
67 % expired, count > live 67 % 250 2.25 ms (was: hang)
90 % expired, count > live 90 % 250 2.20 ms (was: hang)
99 % expired, count > live (probing) 99 % 250 2.20 ms

Tests

tests/unit/type/hash.tcl keeps the regression test from the previous push (700 expired + 200 live, count=250, wall-clock <
1 s). All 91 tests in unit/type/hash pass, including the five expired-heavy slow tests (~4.3 s each) and the new hang
regression (185 ms).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants