Fix fsmeta chaos and nightly correctness checks by feichai0017 · Pull Request #225 · feichai0017/NoKV

feichai0017 · 2026-05-05T02:46:13Z

Summary

make fsmeta concurrent history checking keep bounded alternative serializations instead of committing to the first legal order
model Docker chaos availability errors as indeterminate mutation outcomes where the request may have crossed the service boundary
isolate external fsmeta history seeds by shifting generated inode IDs into the per-seed scope
fix dirpage cache identity so ReadDirPlus pages are keyed by mount, parent, StartAfter, and Limit while invalidation remains directory-scoped
validate storage-backed coordinator root events against the durable root storage snapshot rather than a potentially stale local view

Tests

go test ./coordinator/server ./coordinator/catalog ./fsmeta/contract ./fsmeta/client ./cmd/nokv-fsmeta-history ./cmd/nokv-fsmeta-soak -count=1
NOKV_CONTRACT_HISTORY_SEEDS=64 NOKV_CONTRACT_HISTORY_STEPS=240 NOKV_CONTRACT_HISTORY_BATCH=3 go test ./fsmeta/contract -run TestFSMetaExecutorConcurrentHistoryContract -count=1
go test ./engine/slab/dirpage ./fsmeta/exec -count=1
go test ./fsmeta/... ./engine/slab/dirpage ./cmd/nokv-fsmeta-history -count=1
NOKV_DOCKER_CHAOS_SEEDS=2 NOKV_DOCKER_CHAOS_STEPS=32 NOKV_DOCKER_CHAOS_BATCH=3 NOKV_DOCKER_CHAOS_TIMEOUT=120s NOKV_DOCKER_CHAOS_INTERVAL=2 NOKV_DOCKER_CHAOS_DOWN=1 ./scripts/chaos/docker_fsmeta_history.sh

Workflow Verification

Public Docker Chaos: passed on 957ea6c5
Private Docker Chaos: passed on 957ea6c5
Public Nightly Correctness: passed on 957ea6c5
Private Nightly Correctness: passed on 957ea6c5

Summary by CodeRabbit

New Features
- Added snapshot-based validation for root events
- Enhanced error handling for indeterminate scenarios in concurrent operations
Bug Fixes
- Improved error code mapping for retry exhaustion conditions
Tests
- Added validation tests for snapshot-backed event verification
- Expanded cache pagination behavior verification
Chores
- Optimized Docker build context
- Updated chaos testing configuration

github-actions

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.15.

Benchmark suite	Current: `55c9e3a`	Previous: `3b73389`	Ratio
`BenchmarkARTGet (github.com/feichai0017/NoKV/engine/index)`	`344.7` ns/op 0 B/op 0 allocs/op	`292.1` ns/op 0 B/op 0 allocs/op	`1.18`
`BenchmarkARTGet (github.com/feichai0017/NoKV/engine/index) - ns/op`	`344.7` ns/op	`292.1` ns/op	`1.18`
`BenchmarkARTIteratorNext (github.com/feichai0017/NoKV/engine/index)`	`94.91` ns/op 0 B/op 0 allocs/op	`56.29` ns/op 0 B/op 0 allocs/op	`1.69`
`BenchmarkARTIteratorNext (github.com/feichai0017/NoKV/engine/index) - ns/op`	`94.91` ns/op	`56.29` ns/op	`1.69`
`BenchmarkSkiplistInsert (github.com/feichai0017/NoKV/engine/index)`	`1900` ns/op 33.69 MB/s 161 B/op 1 allocs/op	`1047` ns/op 61.12 MB/s 159 B/op 1 allocs/op	`1.81`
`BenchmarkSkiplistInsert (github.com/feichai0017/NoKV/engine/index) - ns/op`	`1900` ns/op	`1047` ns/op	`1.81`
`BenchmarkL0SelectTablesForKeyLinear (github.com/feichai0017/NoKV/engine/lsm)`	`702139895` ns/op	`1067` ns/op	`658050.51`
`BenchmarkDirPageMaterializeAsync/entries=10000 (github.com/feichai0017/NoKV/engine/slab/dirpage)`	`3928460` ns/op 6766991 B/op 2074 allocs/op	`3396147` ns/op 6766466 B/op 2072 allocs/op	`1.16`
`BenchmarkDirPageMaterializeAsync/entries=10000 (github.com/feichai0017/NoKV/engine/slab/dirpage) - ns/op`	`3928460` ns/op	`3396147` ns/op	`1.16`
`BenchmarkDirPageInvalidate (github.com/feichai0017/NoKV/engine/slab/dirpage)`	`55.59` ns/op 0 B/op 0 allocs/op	`46.94` ns/op 0 B/op 0 allocs/op	`1.18`
`BenchmarkDirPageInvalidate (github.com/feichai0017/NoKV/engine/slab/dirpage) - ns/op`	`55.59` ns/op	`46.94` ns/op	`1.18`
`BenchmarkWALReplay (github.com/feichai0017/NoKV/engine/wal)`	`42084047` ns/op 5991621 B/op 83377 allocs/op	`35406397` ns/op 5991546 B/op 83376 allocs/op	`1.19`
`BenchmarkWALReplay (github.com/feichai0017/NoKV/engine/wal) - ns/op`	`42084047` ns/op	`35406397` ns/op	`1.19`
`BenchmarkDBBatchSet/NoSync (github.com/feichai0017/NoKV/local) - MB/s`	`156.6` MB/s	`133.25` MB/s	`1.18`
`BenchmarkDBBatchSet/SyncInline (github.com/feichai0017/NoKV/local) - MB/s`	`36.61` MB/s	`29.96` MB/s	`1.22`
`BenchmarkDBBatchSet/SyncPipeline (github.com/feichai0017/NoKV/local) - MB/s`	`36.8` MB/s	`31.3` MB/s	`1.18`
`BenchmarkDBCommitInlineValueSizes/Inline_64B (github.com/feichai0017/NoKV/local) - MB/s`	`45.31` MB/s	`36.76` MB/s	`1.23`
`BenchmarkDBCommitInlineValueSizes/Inline_4KB (github.com/feichai0017/NoKV/local) - MB/s`	`866.84` MB/s	`732.9` MB/s	`1.18`
`BenchmarkDBIteratorScan (github.com/feichai0017/NoKV/local/internal/iterator) - B/op`	`93` B/op	`33` B/op	`2.82`
`BenchmarkMPSCQueuePushPop/producers=4 (github.com/feichai0017/NoKV/utils)`	`189.7` ns/op	`155.3` ns/op	`1.22`
`BenchmarkMPSCQueuePushPop/producers=8 (github.com/feichai0017/NoKV/utils)`	`221.5` ns/op	`173.7` ns/op	`1.28`
`BenchmarkMPSCQueuePushPop/producers=16 (github.com/feichai0017/NoKV/utils)`	`239.1` ns/op	`194.2` ns/op	`1.23`
`BenchmarkMPSCQueueConsumerSessionPushPop/producers=8 (github.com/feichai0017/NoKV/utils)`	`160.1` ns/op	`130.2` ns/op	`1.23`
`BenchmarkMPSCQueuePushOnlyContention/producers=8 (github.com/feichai0017/NoKV/utils)`	`218.3` ns/op	`183.6` ns/op	`1.19`
`BenchmarkMPSCQueuePushOnlyContention/producers=16 (github.com/feichai0017/NoKV/utils)`	`232.1` ns/op	`195.4` ns/op	`1.19`
`BenchmarkRingPushPop/producers=4 (github.com/feichai0017/NoKV/utils)`	`89.86` ns/op	`76.15` ns/op	`1.18`
`BenchmarkRingPushPop/producers=8 (github.com/feichai0017/NoKV/utils)`	`88.16` ns/op	`74.04` ns/op	`1.19`
`BenchmarkRingPushPop/producers=16 (github.com/feichai0017/NoKV/utils)`	`87.85` ns/op	`75.35` ns/op	`1.17`

This comment was automatically generated by workflow using github-action-benchmark.

Copilot

Pull request overview

This PR improves fsmeta chaos/nightly correctness validation by making the concurrent history checker more robust to ambiguous outcomes, fixing dirpage cache identity for paginated ReadDirPlus, and ensuring storage-backed coordinator root-event validation uses the same durable snapshot authority used for lifecycle assessment.

Changes:

Extend the fsmeta concurrent history contract runner to retain a bounded set of alternative valid serializations and optionally treat certain availability failures as indeterminate commit outcomes.
Fix dirpage cache keying so ReadDirPlus pages are cached per (mount, parent, StartAfter, Limit) while invalidation remains directory-scoped.
Validate coordinator root events against the durable storage snapshot when running in storage-backed mode (to avoid stale in-memory cache validation).

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
scripts/chaos/docker_fsmeta_history.sh	Adds a toggle to allow indeterminate error handling for Docker chaos history runs.
fsmeta/server/service_test.go	Adds a gRPC error-mapping test case for retry-exhausted errors.
fsmeta/server/errors.go	Maps `KindRetryExhausted` to `codes.Unavailable`.
fsmeta/integration/history_contract_test.go	Updates concurrent history runner invocation to pass the new options struct.
fsmeta/exec/runner.go	Fixes dirpage cache identity to include pagination; invalidation now uses a directory-scoped key.
fsmeta/exec/runner_test.go	Strengthens cache-hit assertions and adds pagination-keying coverage for ReadDirPlus caching.
fsmeta/contract/model.go	Updates operation string formatting to include StartAfter/Limit for read operations.
fsmeta/contract/history.go	Adds HistoryOptions, bounded multi-candidate linearization, and optional indeterminate-error handling.
fsmeta/contract/history_test.go	Updates concurrent history runner invocation to pass HistoryOptions.
engine/slab/dirpage/codec.go	Extends the on-disk page format and key identity to include StartAfter/Limit; adds DirectoryKey abstraction.
engine/slab/dirpage/cache.go	Changes epochs to be directory-scoped and ensures lookups validate pagination identity.
engine/slab/dirpage/cache_test.go	Adds tests for pagination keying and directory-wide invalidation behavior.
engine/slab/dirpage/cache_bench_test.go	Updates invalidate benchmark to match directory-scoped invalidation.
coordinator/server/transition_service.go	Adjusts AssessRootEvent to match new lifecycle assessment return signature.
coordinator/server/service_test.go	Ensures fake storage snapshot cloning includes additional rooted state (subtrees/quotas/snapshot epochs).
coordinator/server/service_gateway.go	Validates storage-backed root events against the storage snapshot used for lifecycle assessment.
coordinator/catalog/cluster.go	Exposes a public `ValidateRootEventAgainstSnapshot` helper for storage-backed validation.
cmd/nokv-fsmeta-soak/main.go	Updates concurrent history runner invocation to pass HistoryOptions.
cmd/nokv-fsmeta-history/main.go	Adds allow-indeterminate flag, adds scope creation retry barrier, and isolates per-seed inode IDs.
cmd/nokv-fsmeta-history/main_test.go	Updates tests to match per-seed inode scoping and new scope creation operation helper.
.dockerignore	Ignores benchmark/ycsb/data in Docker builds.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

coderabbitai · 2026-05-05T02:52:52Z

Warning

Rate limit exceeded

@feichai0017 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 41 minutes and 55 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f24a11ce-e651-4d35-9b6e-b33efe5cd255

📥 Commits

Reviewing files that changed from the base of the PR and between 957ea6c and 55c9e3a.

📒 Files selected for processing (9)

cmd/nokv-fsmeta-history/main.go
engine/slab/dirpage/cache.go
engine/slab/dirpage/cache_bench_test.go
engine/slab/dirpage/cache_test.go
engine/slab/dirpage/codec.go
fsmeta/contract/history.go
fsmeta/contract/history_test.go
fsmeta/exec/runner.go
fsmeta/exec/runner_test.go

📝 Walkthrough

Walkthrough

The PR introduces pagination-aware dirpage caching, multi-candidate filesystem history execution for improved test coverage, storage-backed root event validation, and per-seed scope isolation for history testing alongside error mapping and Docker build optimization.

Changes

Pagination-aware Dirpage Cache

Layer / File(s)	Summary
Data Shape `engine/slab/dirpage/codec.go`	`PageKey` expanded from `(Mount, Parent)` to `(Mount, Parent, StartAfter, Limit)`. New `DirectoryKey` type represents `(Mount, Parent)` for invalidation scope. Page headers now encode/decode `StartAfter` and `Limit`.
Cache Semantics `engine/slab/dirpage/cache.go`	Epoch tracking and invalidation now operate per-`DirectoryKey` instead of per-`PageKey`. Lookup validates full `PageKey` header match including pagination. `Invalidate` drops all pages for a directory; materialization re-checks epoch and splits pages using `StartAfter`.
Page Partitioning `engine/slab/dirpage/codec.go`	`splitIntoPages` updated to accept `startAfter` and adjust page sizing and boundary calculations to reflect pagination identity.
Executor Integration `fsmeta/exec/runner.go`	`DirPageCache` interface invalidation changes to accept `DirectoryKey`. Cache lookup and materialization now use cursor-specific `dirPageKey` with `StartAfter`/`Limit`. Decoding reconstructs parent context from `PageKey`.
Tests & Documentation `engine/slab/dirpage/cache_test.go`, `engine/slab/dirpage/cache_bench_test.go`, `fsmeta/exec/runner_test.go`	New tests verify pagination is part of cache identity and that directory-scoped invalidation drops all pages. Updated benchmarks and existing tests to use `Directory()` for epoch/invalidation operations. Added integration test for cursor-aware caching.

Storage-backed Root Event Validation

Layer / File(s)	Summary
API Exposure `coordinator/catalog/cluster.go`	New exported method `ValidateRootEventAgainstSnapshot` delegates to existing unexported helper, enabling snapshot-based validation.
Lifecycle Assessment `coordinator/server/service_gateway.go`	`assessRootEventLifecycle` now loads durable snapshot state and returns `(assessment, snapshot, storageBacked, error)`. `PublishRootEvent` branches validation: uses `ValidateRootEventAgainstSnapshot` when storage-backed, otherwise `ValidateRootEvent`.
Mock Storage `coordinator/server/service_test.go`	`fakeStorage.AppendRootEvent` now clones/preserves `SnapshotEpochs`, `Subtrees`, and `Quotas` alongside `Mounts` when building intermediate snapshot.
Transition Service `coordinator/server/transition_service.go`	`AssessRootEvent` updated to destructure additional return values from `assessRootEventLifecycle`; validation logic unchanged.
Tests `coordinator/server/service_test.go`	New test `TestServicePublishRootEventValidatesAgainstStorageSnapshot` verifies snapshot-backed validation for `SubtreeHandoffCompleted` events.

Multi-candidate Filesystem History Execution

Layer / File(s)	Summary
Contract Engine `fsmeta/contract/history.go`	`RunConcurrentBatches` now accepts `HistoryOptions` with `AllowIndeterminateErrors` and `MaxCandidates`. Refactored to maintain/advance multiple candidate model states instead of a single linearized model. Added candidate deduplication via `modelFingerprint` and new linearization helpers (`linearizeCandidateBatch`, `advanceOneCandidate`) that treat selected errors as indeterminate when enabled.
History CLI `cmd/nokv-fsmeta-history/main.go`	Per-seed scope creation is now explicit: generates scope operation, creates via `createScopeWithRetry` (with exponential backoff), applies to local model, then runs history with `HistoryOptions`. External operations no longer include scope creation; instead inode IDs are shifted via `scopeGeneratedInode` per seed to avoid cross-seed namespace pollution. New `--allow-indeterminate-errors` flag controls indeterminate error handling.
Test & Integration `cmd/nokv-fsmeta-history/main_test.go`, `fsmeta/integration/history_contract_test.go`	Updated tests to call `RunConcurrentBatches` with explicit `HistoryOptions{}`. Adjusted expectations for scope creation placement and inode remapping.
Soak & Chaos `cmd/nokv-fsmeta-soak/main.go`, `scripts/chaos/docker_fsmeta_history.sh`	Soak test and chaos script updated to pass `HistoryOptions{}` to `RunConcurrentBatches`. Chaos script reads `NOKV_DOCKER_CHAOS_ALLOW_INDETERMINATE` (default `1`) and conditionally appends `--allow-indeterminate-errors` flag via constructed argument array.
Logging `fsmeta/contract/model.go`	`Operation.String()` now includes `start_after`, `limit`, and `advance_ns` fields in formatted output.

Error Mapping for Retry Exhaustion

Layer / File(s)	Summary
Error Mapping `fsmeta/server/errors.go`	`rpcCodeForKind` now maps `nokverrors.KindRetryExhausted` to `codes.Unavailable` alongside other retry/availability errors.
Tests `fsmeta/server/service_test.go`	Added test case in `TestGRPCServiceErrorMapping` to verify `KindRetryExhausted` maps to `codes.Unavailable`.

Docker Build Optimization

Layer / File(s)	Summary
Build Context `.dockerignore`	Added `benchmark/ycsb/data` to ignore patterns, excluding benchmark data from Docker build context.

Sequence Diagram(s)

sequenceDiagram
    participant Executor
    participant CandidateSet as Candidate Models
    participant Linearizer
    participant Barrier as History Barrier

    Executor->>Executor: Initialize candidates = [base_model]
    
    loop For each batch of operations
        Executor->>Executor: Execute ops once (real execution)
        
        Executor->>Linearizer: linearizeCandidateBatch(ops, candidates, constraints)
        
        loop For each candidate model
            Linearizer->>Linearizer: Try op against candidate
            alt Success
                Linearizer->>CandidateSet: Advance candidate to next model
            else Indeterminate Error (if allowed)
                Linearizer->>CandidateSet: Branch candidate (produce alternatives)
            else Hard Error
                Linearizer->>CandidateSet: Discard candidate
            end
        end
        
        Linearizer->>CandidateSet: Deduplicate via modelFingerprint
        Linearizer->>CandidateSet: Truncate to MaxCandidates
        Linearizer->>Executor: Return next candidate set
        
        Executor->>CandidateSet: Replace working models
        Executor->>Executor: Log batch with first_candidate_order
    end
    
    alt Observed History Barrier
        Barrier->>Executor: Report observed result
        Executor->>Barrier: runSequentialObservedCandidates(candidates, barrier)
        loop Advance all candidates
            Barrier->>CandidateSet: Try barrier against each candidate
        end
        alt Any candidate matches
            Barrier->>Executor: Return matching candidates
        else No match
            Barrier->>Executor: Error: no candidate accepts barrier
        end
    end
    
    Executor->>Executor: Return final history (first candidate linearization)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

feichai0017/NoKV#195: Introduces matching dirpage cache keying and pagination-aware page key architecture alongside executor/runner cache integration.
feichai0017/NoKV#150: Introduces rooted control-plane protocol and storage-backed snapshot validation that this PR directly extends in coordinator lifecycle validation.
feichai0017/NoKV#206: Foundational changes to concurrent history contract and multi-candidate model architecture that this PR refines with indeterminate error handling and per-seed scope isolation.

Poem

🐰 A rabbit hops through history's trees,
With candidates dancing in harmony—
Pagination cached with directory key,
Snapshots now validate what used to be,
And scopes are born with retry's spree! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 29.63% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main purpose of the changeset: fixing fsmeta chaos and nightly correctness checks through multiple related improvements.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/chaos-nightly-correctness

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

fsmeta/exec/runner.go (1)

565-578: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t publish a partial cache page when one entry fails to encode.

Skipping a single pair here turns a valid ReadDirPlus result into a truncated cached listing, and later cache hits will keep serving the shorter page as if it were complete. If any entry cannot be encoded, abort materialization for the whole page instead of dropping that entry.

Possible fix

-func encodeDirPageEntries(pairs []fsmeta.DentryAttrPair) []dirpage.Entry {
+func encodeDirPageEntries(pairs []fsmeta.DentryAttrPair) ([]dirpage.Entry, error) {
 	out := make([]dirpage.Entry, 0, len(pairs))
 	for _, p := range pairs {
 		blob, err := fsmeta.EncodeInodeValue(p.Inode)
 		if err != nil {
-			continue
+			return nil, err
 		}
 		out = append(out, dirpage.Entry{
 			Name:     []byte(p.Dentry.Name),
 			Inode:    uint64(p.Dentry.Inode),
 			AttrBlob: blob,
 		})
 	}
-	return out
+	return out, nil
 }

And in ReadDirPlus, only call MaterializeAsync when that encoding step succeeds.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fsmeta/exec/runner.go` around lines 565 - 578, The encoder currently drops
entries on EncodeInodeValue errors which produces partial cached pages; change
encodeDirPageEntries to fail fast and propagate the error (e.g., change
signature to return ([]dirpage.Entry, error)), stopping and returning an error
if any fsmeta.EncodeInodeValue fails instead of continuing, and update callers
(notably ReadDirPlus and any MaterializeAsync invocation) to check that error
and only call MaterializeAsync when encodeDirPageEntries succeeds.

fsmeta/contract/history.go (1)

64-66: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Honor AllowIndeterminateErrors when batchSize == 1.

This fast path skips the new candidate-based logic entirely. A run like --allow-indeterminate-errors --batch 1 still falls back to the strict Run(...) path, so retryable Unavailable/Aborted results will be reported as mismatches even though the option says to treat them as indeterminate.
Proposed fix
-	if batchSize <= 1 {
+	if batchSize <= 1 && !opts.AllowIndeterminateErrors {
 		return Run(ctx, exec, model, ops)
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fsmeta/contract/history.go` around lines 64 - 66, The fast-path "if batchSize
<= 1 { return Run(ctx, exec, model, ops) }" bypasses the candidate-based logic
and ignores the AllowIndeterminateErrors option; change the condition so the
shortcut only happens when AllowIndeterminateErrors is false. In other words,
only call Run(...) immediately when batchSize <= 1 AND AllowIndeterminateErrors
is false; otherwise fall through into the candidate-based handling so
AllowIndeterminateErrors is honored. Ensure you reference the
AllowIndeterminateErrors flag (options struct) alongside batchSize and keep
Run(ctx, exec, model, ops) as the fallback.

cmd/nokv-fsmeta-history/main.go (1)

132-145: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Remap parent inode references for nested operations too.

op.Inode is shifted into the per-seed namespace, but Parent, FromParent, and ToParent are only rewritten when they equal RootInode. After the script creates a nested directory, later ops that target that directory still use the old generated inode id, so external runs can address the wrong parent and break seed isolation.

Proposed fix

 			op.Mount = mount
 			// The generated inodes are unique only within one in-memory script.
 			// Docker chaos runs multiple seeds against the same mounted system,
 			// so external histories must shift inode ids into the per-seed scope
 			// to avoid cross-seed namespace pollution.
 			op.Inode = scopeGeneratedInode(inodeBase, op.Inode)
 			if op.Parent == fsmeta.RootInode {
 				op.Parent = scopeInode
+			} else {
+				op.Parent = scopeGeneratedInode(inodeBase, op.Parent)
 			}
 			if op.FromParent == fsmeta.RootInode {
 				op.FromParent = scopeInode
+			} else {
+				op.FromParent = scopeGeneratedInode(inodeBase, op.FromParent)
 			}
 			if op.ToParent == fsmeta.RootInode {
 				op.ToParent = scopeInode
+			} else {
+				op.ToParent = scopeGeneratedInode(inodeBase, op.ToParent)
 			}
 			out = append(out, op)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/nokv-fsmeta-history/main.go` around lines 132 - 145, The code shifts
op.Inode into the per-seed namespace but only rewrites
Parent/FromParent/ToParent when they equal fsmeta.RootInode which leaves nested
targets unscoped; update the block that sets op.Inode (and uses
scopeGeneratedInode and scopeInode) to also remap op.Parent, op.FromParent and
op.ToParent by calling scopeGeneratedInode(inodeBase, <field>) whenever the
field is not fsmeta.RootInode (or simply always) and keep the existing
RootInode->scopeInode substitution for those fields; use the same helper
scopeGeneratedInode and fsmeta.RootInode/scopeInode symbols so nested directory
targets use the per-seed scoped inode IDs.

🧹 Nitpick comments (1)

engine/slab/dirpage/cache_bench_test.go (1)
121-132: ⚡ Quick win

This benchmark no longer measures the expensive part of invalidation.

The timer starts before any pages are materialized, so the loop only benchmarks epoch bookkeeping on an empty index. That misses the new page-deletion sweep described in the updated comment. Preload at least one cached page per directory before ResetTimer() if you want numbers for the current implementation.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@engine/slab/dirpage/cache_bench_test.go` around lines 121 - 132,
BenchmarkDirPageInvalidate currently starts the timer before any pages are
materialized so it only measures lightweight epoch bookkeeping; before calling
b.ResetTimer() preload at least one cached page per directory by using
newBenchCache to create and populate entries for the PageKey values (e.g.
materialize/cache a DirPage for each keys[i].Directory()) so that
c.Invalidate(key.Directory()) triggers the full page-deletion sweep; ensure the
preloading loop runs before b.ResetTimer()/b.ReportAllocs() and uses the same
PageKey slice used in the benchmark.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@engine/slab/dirpage/cache.go`:
- Around line 269-277: The Invalidate implementation on Cache currently scans
the whole c.pages map under c.pageMu to remove pages for a single DirectoryKey,
causing O(total cached pages) cost and global lock contention; add a secondary
index (e.g., map[DirectoryKey]map[PageKey]struct{} or
map[DirectoryKey][]PageKey) maintained alongside c.pages so that
Cache.Invalidate(key DirectoryKey) can look up only that directory’s PageKeys
and remove them without scanning all entries, update mutations (e.g., in methods
that insert/remove pages/MaterializeAsync/Lookup) to keep both the primary
c.pages and the new directoryIndex consistent under appropriate locking (or use
a fine-grained per-directory lock) and ensure the epoch/next logic using
epochFor/ep.Add(1) remains applied to the targeted PageKeys only.

In `@engine/slab/dirpage/codec.go`:
- Around line 38-41: The on-disk dirpage wire format changed but dirPageVersion
remains 1, so update the code to enforce an upgrade boundary: either increment
the constant dirPageVersion to a new value (e.g. 2) to reflect the incompatible
layout, or add explicit invalidation logic in reloadSegment that reads the
segment header/version and drops/renames the segment when it detects an older
version before attempting page decoding; modify the code paths that rely on
dirPageVersion and the segment header check in reloadSegment so segments with
the previous format are never scanned/partially decoded.

In `@fsmeta/contract/history.go`:
- Around line 190-203: linearizeCandidateBatch currently stops collecting once
next.len >= maxCandidates before deduping, which lets duplicate fingerprints
consume the budget; instead track unique candidate fingerprints while collecting
by maintaining a seen map (keyed on the historyCandidate fingerprint/ID) and
only append a historyCandidate when its fingerprint is not in seen, incrementing
the unique count until reaching maxCandidates; use the same seen-based
collection fix for the other collector block mentioned (the analogous code
around lines 305-326) and you can optionally reuse a shared seen map to avoid
repeated fingerprinting across calls to linearizeBatchCandidates and the second
collector.

In `@fsmeta/exec/runner.go`:
- Around line 503-505: The cache-hit path currently returns whatever
decodeDirPageEntries(pageKey, entries) produces (including errors), which lets
corrupt cached dirpage decode failures bubble up; change the logic in the
dirPages.Lookup handling so you call decodeDirPageEntries(pageKey, entries) and
only return the cached result when decode succeeds (no error), otherwise treat
it as a cache miss and continue down the authoritative scan path (i.e., do not
return the error from decodeDirPageEntries; fall through to the
ReadDirPlus/runner scan). Ensure you reference the existing
dirPages.Lookup(pageKey, frontier) branch and the decodeDirPageEntries call when
implementing this conditional return behavior.

---

Outside diff comments:
In `@cmd/nokv-fsmeta-history/main.go`:
- Around line 132-145: The code shifts op.Inode into the per-seed namespace but
only rewrites Parent/FromParent/ToParent when they equal fsmeta.RootInode which
leaves nested targets unscoped; update the block that sets op.Inode (and uses
scopeGeneratedInode and scopeInode) to also remap op.Parent, op.FromParent and
op.ToParent by calling scopeGeneratedInode(inodeBase, <field>) whenever the
field is not fsmeta.RootInode (or simply always) and keep the existing
RootInode->scopeInode substitution for those fields; use the same helper
scopeGeneratedInode and fsmeta.RootInode/scopeInode symbols so nested directory
targets use the per-seed scoped inode IDs.

In `@fsmeta/contract/history.go`:
- Around line 64-66: The fast-path "if batchSize <= 1 { return Run(ctx, exec,
model, ops) }" bypasses the candidate-based logic and ignores the
AllowIndeterminateErrors option; change the condition so the shortcut only
happens when AllowIndeterminateErrors is false. In other words, only call
Run(...) immediately when batchSize <= 1 AND AllowIndeterminateErrors is false;
otherwise fall through into the candidate-based handling so
AllowIndeterminateErrors is honored. Ensure you reference the
AllowIndeterminateErrors flag (options struct) alongside batchSize and keep
Run(ctx, exec, model, ops) as the fallback.

In `@fsmeta/exec/runner.go`:
- Around line 565-578: The encoder currently drops entries on EncodeInodeValue
errors which produces partial cached pages; change encodeDirPageEntries to fail
fast and propagate the error (e.g., change signature to return ([]dirpage.Entry,
error)), stopping and returning an error if any fsmeta.EncodeInodeValue fails
instead of continuing, and update callers (notably ReadDirPlus and any
MaterializeAsync invocation) to check that error and only call MaterializeAsync
when encodeDirPageEntries succeeds.

---

Nitpick comments:
In `@engine/slab/dirpage/cache_bench_test.go`:
- Around line 121-132: BenchmarkDirPageInvalidate currently starts the timer
before any pages are materialized so it only measures lightweight epoch
bookkeeping; before calling b.ResetTimer() preload at least one cached page per
directory by using newBenchCache to create and populate entries for the PageKey
values (e.g. materialize/cache a DirPage for each keys[i].Directory()) so that
c.Invalidate(key.Directory()) triggers the full page-deletion sweep; ensure the
preloading loop runs before b.ResetTimer()/b.ReportAllocs() and uses the same
PageKey slice used in the benchmark.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3eded6c8-51e1-4c02-9d76-9453f92a9d3f

📥 Commits

Reviewing files that changed from the base of the PR and between 3b73389 and 957ea6c.

📒 Files selected for processing (21)

.dockerignore
cmd/nokv-fsmeta-history/main.go
cmd/nokv-fsmeta-history/main_test.go
cmd/nokv-fsmeta-soak/main.go
coordinator/catalog/cluster.go
coordinator/server/service_gateway.go
coordinator/server/service_test.go
coordinator/server/transition_service.go
engine/slab/dirpage/cache.go
engine/slab/dirpage/cache_bench_test.go
engine/slab/dirpage/cache_test.go
engine/slab/dirpage/codec.go
fsmeta/contract/history.go
fsmeta/contract/history_test.go
fsmeta/contract/model.go
fsmeta/exec/runner.go
fsmeta/exec/runner_test.go
fsmeta/integration/history_contract_test.go
fsmeta/server/errors.go
fsmeta/server/service_test.go
scripts/chaos/docker_fsmeta_history.sh

feichai0017 added 2 commits May 5, 2026 12:13

Fix chaos and nightly correctness checks

58cafd2

Fix fsmeta chaos history cache isolation

957ea6c

Copilot AI review requested due to automatic review settings May 5, 2026 02:46

feichai0017 moved this to In Progress in NoKV Delivery May 5, 2026

feichai0017 added this to NoKV Delivery May 5, 2026

Copilot started reviewing on behalf of feichai0017 May 5, 2026 02:46 View session

github-actions Bot reviewed May 5, 2026

View reviewed changes

Copilot AI reviewed May 5, 2026

View reviewed changes

Comment thread fsmeta/contract/history.go Outdated

Comment thread engine/slab/dirpage/codec.go Outdated

Apply go fix formatting

3b6b388

coderabbitai Bot reviewed May 5, 2026

View reviewed changes

Comment thread engine/slab/dirpage/cache.go

Comment thread engine/slab/dirpage/codec.go

Comment thread fsmeta/contract/history.go Outdated

Comment thread fsmeta/exec/runner.go

Address fsmeta chaos review comments

55c9e3a

feichai0017 merged commit d51cae7 into main May 5, 2026
12 checks passed

feichai0017 deleted the fix/chaos-nightly-correctness branch May 5, 2026 03:23

github-project-automation Bot moved this from In Progress to Done in NoKV Delivery May 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix fsmeta chaos and nightly correctness checks#225

Fix fsmeta chaos and nightly correctness checks#225
feichai0017 merged 4 commits intomainfrom
fix/chaos-nightly-correctness

feichai0017 commented May 5, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

github-actions Bot left a comment •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot commented May 5, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

feichai0017 commented May 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tests

Workflow Verification

Summary by CodeRabbit

Uh oh!

github-actions Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

⚠️ Performance Alert ⚠️

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feichai0017 commented May 5, 2026 •

edited by coderabbitai Bot

Loading

github-actions Bot left a comment •

edited

Loading

coderabbitai Bot commented May 5, 2026 •

edited

Loading