Skip to content

Fix fsmeta chaos and nightly correctness checks#225

Merged
feichai0017 merged 4 commits intomainfrom
fix/chaos-nightly-correctness
May 5, 2026
Merged

Fix fsmeta chaos and nightly correctness checks#225
feichai0017 merged 4 commits intomainfrom
fix/chaos-nightly-correctness

Conversation

@feichai0017
Copy link
Copy Markdown
Owner

@feichai0017 feichai0017 commented May 5, 2026

Summary

  • make fsmeta concurrent history checking keep bounded alternative serializations instead of committing to the first legal order
  • model Docker chaos availability errors as indeterminate mutation outcomes where the request may have crossed the service boundary
  • isolate external fsmeta history seeds by shifting generated inode IDs into the per-seed scope
  • fix dirpage cache identity so ReadDirPlus pages are keyed by mount, parent, StartAfter, and Limit while invalidation remains directory-scoped
  • validate storage-backed coordinator root events against the durable root storage snapshot rather than a potentially stale local view

Tests

  • go test ./coordinator/server ./coordinator/catalog ./fsmeta/contract ./fsmeta/client ./cmd/nokv-fsmeta-history ./cmd/nokv-fsmeta-soak -count=1
  • NOKV_CONTRACT_HISTORY_SEEDS=64 NOKV_CONTRACT_HISTORY_STEPS=240 NOKV_CONTRACT_HISTORY_BATCH=3 go test ./fsmeta/contract -run TestFSMetaExecutorConcurrentHistoryContract -count=1
  • go test ./engine/slab/dirpage ./fsmeta/exec -count=1
  • go test ./fsmeta/... ./engine/slab/dirpage ./cmd/nokv-fsmeta-history -count=1
  • NOKV_DOCKER_CHAOS_SEEDS=2 NOKV_DOCKER_CHAOS_STEPS=32 NOKV_DOCKER_CHAOS_BATCH=3 NOKV_DOCKER_CHAOS_TIMEOUT=120s NOKV_DOCKER_CHAOS_INTERVAL=2 NOKV_DOCKER_CHAOS_DOWN=1 ./scripts/chaos/docker_fsmeta_history.sh

Workflow Verification

  • Public Docker Chaos: passed on 957ea6c5
  • Private Docker Chaos: passed on 957ea6c5
  • Public Nightly Correctness: passed on 957ea6c5
  • Private Nightly Correctness: passed on 957ea6c5

Summary by CodeRabbit

  • New Features

    • Added snapshot-based validation for root events
    • Enhanced error handling for indeterminate scenarios in concurrent operations
  • Bug Fixes

    • Improved error code mapping for retry exhaustion conditions
  • Tests

    • Added validation tests for snapshot-backed event verification
    • Expanded cache pagination behavior verification
  • Chores

    • Optimized Docker build context
    • Updated chaos testing configuration

Copilot AI review requested due to automatic review settings May 5, 2026 02:46
@feichai0017 feichai0017 moved this to In Progress in NoKV Delivery May 5, 2026
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.15.

Benchmark suite Current: 55c9e3a Previous: 3b73389 Ratio
BenchmarkARTGet (github.com/feichai0017/NoKV/engine/index) 344.7 ns/op 0 B/op 0 allocs/op 292.1 ns/op 0 B/op 0 allocs/op 1.18
BenchmarkARTGet (github.com/feichai0017/NoKV/engine/index) - ns/op 344.7 ns/op 292.1 ns/op 1.18
BenchmarkARTIteratorNext (github.com/feichai0017/NoKV/engine/index) 94.91 ns/op 0 B/op 0 allocs/op 56.29 ns/op 0 B/op 0 allocs/op 1.69
BenchmarkARTIteratorNext (github.com/feichai0017/NoKV/engine/index) - ns/op 94.91 ns/op 56.29 ns/op 1.69
BenchmarkSkiplistInsert (github.com/feichai0017/NoKV/engine/index) 1900 ns/op 33.69 MB/s 161 B/op 1 allocs/op 1047 ns/op 61.12 MB/s 159 B/op 1 allocs/op 1.81
BenchmarkSkiplistInsert (github.com/feichai0017/NoKV/engine/index) - ns/op 1900 ns/op 1047 ns/op 1.81
BenchmarkL0SelectTablesForKeyLinear (github.com/feichai0017/NoKV/engine/lsm) 702139895 ns/op 1067 ns/op 658050.51
BenchmarkDirPageMaterializeAsync/entries=10000 (github.com/feichai0017/NoKV/engine/slab/dirpage) 3928460 ns/op 6766991 B/op 2074 allocs/op 3396147 ns/op 6766466 B/op 2072 allocs/op 1.16
BenchmarkDirPageMaterializeAsync/entries=10000 (github.com/feichai0017/NoKV/engine/slab/dirpage) - ns/op 3928460 ns/op 3396147 ns/op 1.16
BenchmarkDirPageInvalidate (github.com/feichai0017/NoKV/engine/slab/dirpage) 55.59 ns/op 0 B/op 0 allocs/op 46.94 ns/op 0 B/op 0 allocs/op 1.18
BenchmarkDirPageInvalidate (github.com/feichai0017/NoKV/engine/slab/dirpage) - ns/op 55.59 ns/op 46.94 ns/op 1.18
BenchmarkWALReplay (github.com/feichai0017/NoKV/engine/wal) 42084047 ns/op 5991621 B/op 83377 allocs/op 35406397 ns/op 5991546 B/op 83376 allocs/op 1.19
BenchmarkWALReplay (github.com/feichai0017/NoKV/engine/wal) - ns/op 42084047 ns/op 35406397 ns/op 1.19
BenchmarkDBBatchSet/NoSync (github.com/feichai0017/NoKV/local) - MB/s 156.6 MB/s 133.25 MB/s 1.18
BenchmarkDBBatchSet/SyncInline (github.com/feichai0017/NoKV/local) - MB/s 36.61 MB/s 29.96 MB/s 1.22
BenchmarkDBBatchSet/SyncPipeline (github.com/feichai0017/NoKV/local) - MB/s 36.8 MB/s 31.3 MB/s 1.18
BenchmarkDBCommitInlineValueSizes/Inline_64B (github.com/feichai0017/NoKV/local) - MB/s 45.31 MB/s 36.76 MB/s 1.23
BenchmarkDBCommitInlineValueSizes/Inline_4KB (github.com/feichai0017/NoKV/local) - MB/s 866.84 MB/s 732.9 MB/s 1.18
BenchmarkDBIteratorScan (github.com/feichai0017/NoKV/local/internal/iterator) - B/op 93 B/op 33 B/op 2.82
BenchmarkMPSCQueuePushPop/producers=4 (github.com/feichai0017/NoKV/utils) 189.7 ns/op 155.3 ns/op 1.22
BenchmarkMPSCQueuePushPop/producers=8 (github.com/feichai0017/NoKV/utils) 221.5 ns/op 173.7 ns/op 1.28
BenchmarkMPSCQueuePushPop/producers=16 (github.com/feichai0017/NoKV/utils) 239.1 ns/op 194.2 ns/op 1.23
BenchmarkMPSCQueueConsumerSessionPushPop/producers=8 (github.com/feichai0017/NoKV/utils) 160.1 ns/op 130.2 ns/op 1.23
BenchmarkMPSCQueuePushOnlyContention/producers=8 (github.com/feichai0017/NoKV/utils) 218.3 ns/op 183.6 ns/op 1.19
BenchmarkMPSCQueuePushOnlyContention/producers=16 (github.com/feichai0017/NoKV/utils) 232.1 ns/op 195.4 ns/op 1.19
BenchmarkRingPushPop/producers=4 (github.com/feichai0017/NoKV/utils) 89.86 ns/op 76.15 ns/op 1.18
BenchmarkRingPushPop/producers=8 (github.com/feichai0017/NoKV/utils) 88.16 ns/op 74.04 ns/op 1.19
BenchmarkRingPushPop/producers=16 (github.com/feichai0017/NoKV/utils) 87.85 ns/op 75.35 ns/op 1.17

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves fsmeta chaos/nightly correctness validation by making the concurrent history checker more robust to ambiguous outcomes, fixing dirpage cache identity for paginated ReadDirPlus, and ensuring storage-backed coordinator root-event validation uses the same durable snapshot authority used for lifecycle assessment.

Changes:

  • Extend the fsmeta concurrent history contract runner to retain a bounded set of alternative valid serializations and optionally treat certain availability failures as indeterminate commit outcomes.
  • Fix dirpage cache keying so ReadDirPlus pages are cached per (mount, parent, StartAfter, Limit) while invalidation remains directory-scoped.
  • Validate coordinator root events against the durable storage snapshot when running in storage-backed mode (to avoid stale in-memory cache validation).

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
scripts/chaos/docker_fsmeta_history.sh Adds a toggle to allow indeterminate error handling for Docker chaos history runs.
fsmeta/server/service_test.go Adds a gRPC error-mapping test case for retry-exhausted errors.
fsmeta/server/errors.go Maps KindRetryExhausted to codes.Unavailable.
fsmeta/integration/history_contract_test.go Updates concurrent history runner invocation to pass the new options struct.
fsmeta/exec/runner.go Fixes dirpage cache identity to include pagination; invalidation now uses a directory-scoped key.
fsmeta/exec/runner_test.go Strengthens cache-hit assertions and adds pagination-keying coverage for ReadDirPlus caching.
fsmeta/contract/model.go Updates operation string formatting to include StartAfter/Limit for read operations.
fsmeta/contract/history.go Adds HistoryOptions, bounded multi-candidate linearization, and optional indeterminate-error handling.
fsmeta/contract/history_test.go Updates concurrent history runner invocation to pass HistoryOptions.
engine/slab/dirpage/codec.go Extends the on-disk page format and key identity to include StartAfter/Limit; adds DirectoryKey abstraction.
engine/slab/dirpage/cache.go Changes epochs to be directory-scoped and ensures lookups validate pagination identity.
engine/slab/dirpage/cache_test.go Adds tests for pagination keying and directory-wide invalidation behavior.
engine/slab/dirpage/cache_bench_test.go Updates invalidate benchmark to match directory-scoped invalidation.
coordinator/server/transition_service.go Adjusts AssessRootEvent to match new lifecycle assessment return signature.
coordinator/server/service_test.go Ensures fake storage snapshot cloning includes additional rooted state (subtrees/quotas/snapshot epochs).
coordinator/server/service_gateway.go Validates storage-backed root events against the storage snapshot used for lifecycle assessment.
coordinator/catalog/cluster.go Exposes a public ValidateRootEventAgainstSnapshot helper for storage-backed validation.
cmd/nokv-fsmeta-soak/main.go Updates concurrent history runner invocation to pass HistoryOptions.
cmd/nokv-fsmeta-history/main.go Adds allow-indeterminate flag, adds scope creation retry barrier, and isolates per-seed inode IDs.
cmd/nokv-fsmeta-history/main_test.go Updates tests to match per-seed inode scoping and new scope creation operation helper.
.dockerignore Ignores benchmark/ycsb/data in Docker builds.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread fsmeta/contract/history.go Outdated
Comment thread engine/slab/dirpage/codec.go Outdated
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 5, 2026

Warning

Rate limit exceeded

@feichai0017 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 41 minutes and 55 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f24a11ce-e651-4d35-9b6e-b33efe5cd255

📥 Commits

Reviewing files that changed from the base of the PR and between 957ea6c and 55c9e3a.

📒 Files selected for processing (9)
  • cmd/nokv-fsmeta-history/main.go
  • engine/slab/dirpage/cache.go
  • engine/slab/dirpage/cache_bench_test.go
  • engine/slab/dirpage/cache_test.go
  • engine/slab/dirpage/codec.go
  • fsmeta/contract/history.go
  • fsmeta/contract/history_test.go
  • fsmeta/exec/runner.go
  • fsmeta/exec/runner_test.go
📝 Walkthrough

Walkthrough

The PR introduces pagination-aware dirpage caching, multi-candidate filesystem history execution for improved test coverage, storage-backed root event validation, and per-seed scope isolation for history testing alongside error mapping and Docker build optimization.

Changes

Pagination-aware Dirpage Cache

Layer / File(s) Summary
Data Shape
engine/slab/dirpage/codec.go
PageKey expanded from (Mount, Parent) to (Mount, Parent, StartAfter, Limit). New DirectoryKey type represents (Mount, Parent) for invalidation scope. Page headers now encode/decode StartAfter and Limit.
Cache Semantics
engine/slab/dirpage/cache.go
Epoch tracking and invalidation now operate per-DirectoryKey instead of per-PageKey. Lookup validates full PageKey header match including pagination. Invalidate drops all pages for a directory; materialization re-checks epoch and splits pages using StartAfter.
Page Partitioning
engine/slab/dirpage/codec.go
splitIntoPages updated to accept startAfter and adjust page sizing and boundary calculations to reflect pagination identity.
Executor Integration
fsmeta/exec/runner.go
DirPageCache interface invalidation changes to accept DirectoryKey. Cache lookup and materialization now use cursor-specific dirPageKey with StartAfter/Limit. Decoding reconstructs parent context from PageKey.
Tests & Documentation
engine/slab/dirpage/cache_test.go, engine/slab/dirpage/cache_bench_test.go, fsmeta/exec/runner_test.go
New tests verify pagination is part of cache identity and that directory-scoped invalidation drops all pages. Updated benchmarks and existing tests to use Directory() for epoch/invalidation operations. Added integration test for cursor-aware caching.

Storage-backed Root Event Validation

Layer / File(s) Summary
API Exposure
coordinator/catalog/cluster.go
New exported method ValidateRootEventAgainstSnapshot delegates to existing unexported helper, enabling snapshot-based validation.
Lifecycle Assessment
coordinator/server/service_gateway.go
assessRootEventLifecycle now loads durable snapshot state and returns (assessment, snapshot, storageBacked, error). PublishRootEvent branches validation: uses ValidateRootEventAgainstSnapshot when storage-backed, otherwise ValidateRootEvent.
Mock Storage
coordinator/server/service_test.go
fakeStorage.AppendRootEvent now clones/preserves SnapshotEpochs, Subtrees, and Quotas alongside Mounts when building intermediate snapshot.
Transition Service
coordinator/server/transition_service.go
AssessRootEvent updated to destructure additional return values from assessRootEventLifecycle; validation logic unchanged.
Tests
coordinator/server/service_test.go
New test TestServicePublishRootEventValidatesAgainstStorageSnapshot verifies snapshot-backed validation for SubtreeHandoffCompleted events.

Multi-candidate Filesystem History Execution

Layer / File(s) Summary
Contract Engine
fsmeta/contract/history.go
RunConcurrentBatches now accepts HistoryOptions with AllowIndeterminateErrors and MaxCandidates. Refactored to maintain/advance multiple candidate model states instead of a single linearized model. Added candidate deduplication via modelFingerprint and new linearization helpers (linearizeCandidateBatch, advanceOneCandidate) that treat selected errors as indeterminate when enabled.
History CLI
cmd/nokv-fsmeta-history/main.go
Per-seed scope creation is now explicit: generates scope operation, creates via createScopeWithRetry (with exponential backoff), applies to local model, then runs history with HistoryOptions. External operations no longer include scope creation; instead inode IDs are shifted via scopeGeneratedInode per seed to avoid cross-seed namespace pollution. New --allow-indeterminate-errors flag controls indeterminate error handling.
Test & Integration
cmd/nokv-fsmeta-history/main_test.go, fsmeta/integration/history_contract_test.go
Updated tests to call RunConcurrentBatches with explicit HistoryOptions{}. Adjusted expectations for scope creation placement and inode remapping.
Soak & Chaos
cmd/nokv-fsmeta-soak/main.go, scripts/chaos/docker_fsmeta_history.sh
Soak test and chaos script updated to pass HistoryOptions{} to RunConcurrentBatches. Chaos script reads NOKV_DOCKER_CHAOS_ALLOW_INDETERMINATE (default 1) and conditionally appends --allow-indeterminate-errors flag via constructed argument array.
Logging
fsmeta/contract/model.go
Operation.String() now includes start_after, limit, and advance_ns fields in formatted output.

Error Mapping for Retry Exhaustion

Layer / File(s) Summary
Error Mapping
fsmeta/server/errors.go
rpcCodeForKind now maps nokverrors.KindRetryExhausted to codes.Unavailable alongside other retry/availability errors.
Tests
fsmeta/server/service_test.go
Added test case in TestGRPCServiceErrorMapping to verify KindRetryExhausted maps to codes.Unavailable.

Docker Build Optimization

Layer / File(s) Summary
Build Context
.dockerignore
Added benchmark/ycsb/data to ignore patterns, excluding benchmark data from Docker build context.

Sequence Diagram(s)

sequenceDiagram
    participant Executor
    participant CandidateSet as Candidate Models
    participant Linearizer
    participant Barrier as History Barrier

    Executor->>Executor: Initialize candidates = [base_model]
    
    loop For each batch of operations
        Executor->>Executor: Execute ops once (real execution)
        
        Executor->>Linearizer: linearizeCandidateBatch(ops, candidates, constraints)
        
        loop For each candidate model
            Linearizer->>Linearizer: Try op against candidate
            alt Success
                Linearizer->>CandidateSet: Advance candidate to next model
            else Indeterminate Error (if allowed)
                Linearizer->>CandidateSet: Branch candidate (produce alternatives)
            else Hard Error
                Linearizer->>CandidateSet: Discard candidate
            end
        end
        
        Linearizer->>CandidateSet: Deduplicate via modelFingerprint
        Linearizer->>CandidateSet: Truncate to MaxCandidates
        Linearizer->>Executor: Return next candidate set
        
        Executor->>CandidateSet: Replace working models
        Executor->>Executor: Log batch with first_candidate_order
    end
    
    alt Observed History Barrier
        Barrier->>Executor: Report observed result
        Executor->>Barrier: runSequentialObservedCandidates(candidates, barrier)
        loop Advance all candidates
            Barrier->>CandidateSet: Try barrier against each candidate
        end
        alt Any candidate matches
            Barrier->>Executor: Return matching candidates
        else No match
            Barrier->>Executor: Error: no candidate accepts barrier
        end
    end
    
    Executor->>Executor: Return final history (first candidate linearization)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • feichai0017/NoKV#195: Introduces matching dirpage cache keying and pagination-aware page key architecture alongside executor/runner cache integration.
  • feichai0017/NoKV#150: Introduces rooted control-plane protocol and storage-backed snapshot validation that this PR directly extends in coordinator lifecycle validation.
  • feichai0017/NoKV#206: Foundational changes to concurrent history contract and multi-candidate model architecture that this PR refines with indeterminate error handling and per-seed scope isolation.

Poem

🐰 A rabbit hops through history's trees,
With candidates dancing in harmony—
Pagination cached with directory key,
Snapshots now validate what used to be,
And scopes are born with retry's spree!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 29.63% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main purpose of the changeset: fixing fsmeta chaos and nightly correctness checks through multiple related improvements.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/chaos-nightly-correctness

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
fsmeta/exec/runner.go (1)

565-578: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t publish a partial cache page when one entry fails to encode.

Skipping a single pair here turns a valid ReadDirPlus result into a truncated cached listing, and later cache hits will keep serving the shorter page as if it were complete. If any entry cannot be encoded, abort materialization for the whole page instead of dropping that entry.

Possible fix
-func encodeDirPageEntries(pairs []fsmeta.DentryAttrPair) []dirpage.Entry {
+func encodeDirPageEntries(pairs []fsmeta.DentryAttrPair) ([]dirpage.Entry, error) {
 	out := make([]dirpage.Entry, 0, len(pairs))
 	for _, p := range pairs {
 		blob, err := fsmeta.EncodeInodeValue(p.Inode)
 		if err != nil {
-			continue
+			return nil, err
 		}
 		out = append(out, dirpage.Entry{
 			Name:     []byte(p.Dentry.Name),
 			Inode:    uint64(p.Dentry.Inode),
 			AttrBlob: blob,
 		})
 	}
-	return out
+	return out, nil
 }

And in ReadDirPlus, only call MaterializeAsync when that encoding step succeeds.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fsmeta/exec/runner.go` around lines 565 - 578, The encoder currently drops
entries on EncodeInodeValue errors which produces partial cached pages; change
encodeDirPageEntries to fail fast and propagate the error (e.g., change
signature to return ([]dirpage.Entry, error)), stopping and returning an error
if any fsmeta.EncodeInodeValue fails instead of continuing, and update callers
(notably ReadDirPlus and any MaterializeAsync invocation) to check that error
and only call MaterializeAsync when encodeDirPageEntries succeeds.
fsmeta/contract/history.go (1)

64-66: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Honor AllowIndeterminateErrors when batchSize == 1.

This fast path skips the new candidate-based logic entirely. A run like --allow-indeterminate-errors --batch 1 still falls back to the strict Run(...) path, so retryable Unavailable/Aborted results will be reported as mismatches even though the option says to treat them as indeterminate.

Proposed fix
-	if batchSize <= 1 {
+	if batchSize <= 1 && !opts.AllowIndeterminateErrors {
 		return Run(ctx, exec, model, ops)
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fsmeta/contract/history.go` around lines 64 - 66, The fast-path "if batchSize
<= 1 { return Run(ctx, exec, model, ops) }" bypasses the candidate-based logic
and ignores the AllowIndeterminateErrors option; change the condition so the
shortcut only happens when AllowIndeterminateErrors is false. In other words,
only call Run(...) immediately when batchSize <= 1 AND AllowIndeterminateErrors
is false; otherwise fall through into the candidate-based handling so
AllowIndeterminateErrors is honored. Ensure you reference the
AllowIndeterminateErrors flag (options struct) alongside batchSize and keep
Run(ctx, exec, model, ops) as the fallback.
cmd/nokv-fsmeta-history/main.go (1)

132-145: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Remap parent inode references for nested operations too.

op.Inode is shifted into the per-seed namespace, but Parent, FromParent, and ToParent are only rewritten when they equal RootInode. After the script creates a nested directory, later ops that target that directory still use the old generated inode id, so external runs can address the wrong parent and break seed isolation.

Proposed fix
 			op.Mount = mount
 			// The generated inodes are unique only within one in-memory script.
 			// Docker chaos runs multiple seeds against the same mounted system,
 			// so external histories must shift inode ids into the per-seed scope
 			// to avoid cross-seed namespace pollution.
 			op.Inode = scopeGeneratedInode(inodeBase, op.Inode)
 			if op.Parent == fsmeta.RootInode {
 				op.Parent = scopeInode
+			} else {
+				op.Parent = scopeGeneratedInode(inodeBase, op.Parent)
 			}
 			if op.FromParent == fsmeta.RootInode {
 				op.FromParent = scopeInode
+			} else {
+				op.FromParent = scopeGeneratedInode(inodeBase, op.FromParent)
 			}
 			if op.ToParent == fsmeta.RootInode {
 				op.ToParent = scopeInode
+			} else {
+				op.ToParent = scopeGeneratedInode(inodeBase, op.ToParent)
 			}
 			out = append(out, op)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/nokv-fsmeta-history/main.go` around lines 132 - 145, The code shifts
op.Inode into the per-seed namespace but only rewrites
Parent/FromParent/ToParent when they equal fsmeta.RootInode which leaves nested
targets unscoped; update the block that sets op.Inode (and uses
scopeGeneratedInode and scopeInode) to also remap op.Parent, op.FromParent and
op.ToParent by calling scopeGeneratedInode(inodeBase, <field>) whenever the
field is not fsmeta.RootInode (or simply always) and keep the existing
RootInode->scopeInode substitution for those fields; use the same helper
scopeGeneratedInode and fsmeta.RootInode/scopeInode symbols so nested directory
targets use the per-seed scoped inode IDs.
🧹 Nitpick comments (1)
engine/slab/dirpage/cache_bench_test.go (1)

121-132: ⚡ Quick win

This benchmark no longer measures the expensive part of invalidation.

The timer starts before any pages are materialized, so the loop only benchmarks epoch bookkeeping on an empty index. That misses the new page-deletion sweep described in the updated comment. Preload at least one cached page per directory before ResetTimer() if you want numbers for the current implementation.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@engine/slab/dirpage/cache_bench_test.go` around lines 121 - 132,
BenchmarkDirPageInvalidate currently starts the timer before any pages are
materialized so it only measures lightweight epoch bookkeeping; before calling
b.ResetTimer() preload at least one cached page per directory by using
newBenchCache to create and populate entries for the PageKey values (e.g.
materialize/cache a DirPage for each keys[i].Directory()) so that
c.Invalidate(key.Directory()) triggers the full page-deletion sweep; ensure the
preloading loop runs before b.ResetTimer()/b.ReportAllocs() and uses the same
PageKey slice used in the benchmark.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@engine/slab/dirpage/cache.go`:
- Around line 269-277: The Invalidate implementation on Cache currently scans
the whole c.pages map under c.pageMu to remove pages for a single DirectoryKey,
causing O(total cached pages) cost and global lock contention; add a secondary
index (e.g., map[DirectoryKey]map[PageKey]struct{} or
map[DirectoryKey][]PageKey) maintained alongside c.pages so that
Cache.Invalidate(key DirectoryKey) can look up only that directory’s PageKeys
and remove them without scanning all entries, update mutations (e.g., in methods
that insert/remove pages/MaterializeAsync/Lookup) to keep both the primary
c.pages and the new directoryIndex consistent under appropriate locking (or use
a fine-grained per-directory lock) and ensure the epoch/next logic using
epochFor/ep.Add(1) remains applied to the targeted PageKeys only.

In `@engine/slab/dirpage/codec.go`:
- Around line 38-41: The on-disk dirpage wire format changed but dirPageVersion
remains 1, so update the code to enforce an upgrade boundary: either increment
the constant dirPageVersion to a new value (e.g. 2) to reflect the incompatible
layout, or add explicit invalidation logic in reloadSegment that reads the
segment header/version and drops/renames the segment when it detects an older
version before attempting page decoding; modify the code paths that rely on
dirPageVersion and the segment header check in reloadSegment so segments with
the previous format are never scanned/partially decoded.

In `@fsmeta/contract/history.go`:
- Around line 190-203: linearizeCandidateBatch currently stops collecting once
next.len >= maxCandidates before deduping, which lets duplicate fingerprints
consume the budget; instead track unique candidate fingerprints while collecting
by maintaining a seen map (keyed on the historyCandidate fingerprint/ID) and
only append a historyCandidate when its fingerprint is not in seen, incrementing
the unique count until reaching maxCandidates; use the same seen-based
collection fix for the other collector block mentioned (the analogous code
around lines 305-326) and you can optionally reuse a shared seen map to avoid
repeated fingerprinting across calls to linearizeBatchCandidates and the second
collector.

In `@fsmeta/exec/runner.go`:
- Around line 503-505: The cache-hit path currently returns whatever
decodeDirPageEntries(pageKey, entries) produces (including errors), which lets
corrupt cached dirpage decode failures bubble up; change the logic in the
dirPages.Lookup handling so you call decodeDirPageEntries(pageKey, entries) and
only return the cached result when decode succeeds (no error), otherwise treat
it as a cache miss and continue down the authoritative scan path (i.e., do not
return the error from decodeDirPageEntries; fall through to the
ReadDirPlus/runner scan). Ensure you reference the existing
dirPages.Lookup(pageKey, frontier) branch and the decodeDirPageEntries call when
implementing this conditional return behavior.

---

Outside diff comments:
In `@cmd/nokv-fsmeta-history/main.go`:
- Around line 132-145: The code shifts op.Inode into the per-seed namespace but
only rewrites Parent/FromParent/ToParent when they equal fsmeta.RootInode which
leaves nested targets unscoped; update the block that sets op.Inode (and uses
scopeGeneratedInode and scopeInode) to also remap op.Parent, op.FromParent and
op.ToParent by calling scopeGeneratedInode(inodeBase, <field>) whenever the
field is not fsmeta.RootInode (or simply always) and keep the existing
RootInode->scopeInode substitution for those fields; use the same helper
scopeGeneratedInode and fsmeta.RootInode/scopeInode symbols so nested directory
targets use the per-seed scoped inode IDs.

In `@fsmeta/contract/history.go`:
- Around line 64-66: The fast-path "if batchSize <= 1 { return Run(ctx, exec,
model, ops) }" bypasses the candidate-based logic and ignores the
AllowIndeterminateErrors option; change the condition so the shortcut only
happens when AllowIndeterminateErrors is false. In other words, only call
Run(...) immediately when batchSize <= 1 AND AllowIndeterminateErrors is false;
otherwise fall through into the candidate-based handling so
AllowIndeterminateErrors is honored. Ensure you reference the
AllowIndeterminateErrors flag (options struct) alongside batchSize and keep
Run(ctx, exec, model, ops) as the fallback.

In `@fsmeta/exec/runner.go`:
- Around line 565-578: The encoder currently drops entries on EncodeInodeValue
errors which produces partial cached pages; change encodeDirPageEntries to fail
fast and propagate the error (e.g., change signature to return ([]dirpage.Entry,
error)), stopping and returning an error if any fsmeta.EncodeInodeValue fails
instead of continuing, and update callers (notably ReadDirPlus and any
MaterializeAsync invocation) to check that error and only call MaterializeAsync
when encodeDirPageEntries succeeds.

---

Nitpick comments:
In `@engine/slab/dirpage/cache_bench_test.go`:
- Around line 121-132: BenchmarkDirPageInvalidate currently starts the timer
before any pages are materialized so it only measures lightweight epoch
bookkeeping; before calling b.ResetTimer() preload at least one cached page per
directory by using newBenchCache to create and populate entries for the PageKey
values (e.g. materialize/cache a DirPage for each keys[i].Directory()) so that
c.Invalidate(key.Directory()) triggers the full page-deletion sweep; ensure the
preloading loop runs before b.ResetTimer()/b.ReportAllocs() and uses the same
PageKey slice used in the benchmark.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3eded6c8-51e1-4c02-9d76-9453f92a9d3f

📥 Commits

Reviewing files that changed from the base of the PR and between 3b73389 and 957ea6c.

📒 Files selected for processing (21)
  • .dockerignore
  • cmd/nokv-fsmeta-history/main.go
  • cmd/nokv-fsmeta-history/main_test.go
  • cmd/nokv-fsmeta-soak/main.go
  • coordinator/catalog/cluster.go
  • coordinator/server/service_gateway.go
  • coordinator/server/service_test.go
  • coordinator/server/transition_service.go
  • engine/slab/dirpage/cache.go
  • engine/slab/dirpage/cache_bench_test.go
  • engine/slab/dirpage/cache_test.go
  • engine/slab/dirpage/codec.go
  • fsmeta/contract/history.go
  • fsmeta/contract/history_test.go
  • fsmeta/contract/model.go
  • fsmeta/exec/runner.go
  • fsmeta/exec/runner_test.go
  • fsmeta/integration/history_contract_test.go
  • fsmeta/server/errors.go
  • fsmeta/server/service_test.go
  • scripts/chaos/docker_fsmeta_history.sh

Comment thread engine/slab/dirpage/cache.go
Comment thread engine/slab/dirpage/codec.go
Comment thread fsmeta/contract/history.go Outdated
Comment thread fsmeta/exec/runner.go
@feichai0017 feichai0017 merged commit d51cae7 into main May 5, 2026
12 checks passed
@feichai0017 feichai0017 deleted the fix/chaos-nightly-correctness branch May 5, 2026 03:23
@github-project-automation github-project-automation Bot moved this from In Progress to Done in NoKV Delivery May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants