Fix leveldb / Chromium kernel-cache coherency on RAM drive#9
Merged
Conversation
…mium) Chromium's leveldb (used for Local Storage, Sync Data, GCM Store, ...) does an atomic-rename + immediate read on the CURRENT file every DB open. With WinFsp's FileInfoTimeout set to uint.MaxValue (the broken default), the kernel kept the negative-cached "CURRENT does not exist" entry forever, so the read after MoveFile(dbtmp -> CURRENT) returned 0 bytes. Leveldb reported "Corruption: CURRENT does not end with newline" and Chromium crashed STATUS_BREAKPOINT on launch with --user-data-dir on the RAM drive. Three coordinated changes: 1. Adapter (WinFspRamAdapter.cs): every path-mutating callback now sends an FspFileSystemNotify after the user-mode mutation commits. Full notification matrix documented in the class XML doc and design.md. - CreateFile (file/dir): ChangeFileName/ChangeDirName, ActionAdded - OverwriteFile: ChangeSize|ChangeLastWrite, ActionModified - MoveFile: ChangeFileName, ActionRenamedOldName + ActionRenamedNewName - Cleanup(Delete) (file/dir): ChangeFileName/ChangeDirName, ActionRemoved - SetFileSize: ChangeSize|ChangeLastWrite, ActionModified - SetFileAttributes: ChangeAttributes|ChangeLastWrite, ActionModified Notification failures are logged at Trace and never fail the originating IRP. 2. Config (RamDriveOptions.FileInfoTimeoutMs, default 1000): replaces the unconditional uint.MaxValue and acts as defence in depth for any path that escapes the notification matrix. EnableKernelCache=false still forces 0. 3. Tests (LevelDbReproTests + CacheCoherencyTests): exercises the exact Win32 sequence captured from Chromium's leveldb env (cached WriteFile + Flush + MoveFileEx + Open + ReadFile). Fixture pinned at FileInfoTimeout=uint.MaxValue so any missing notification fails CI rather than only against real Chromium with the production default. All 28 integration tests pass. Requires WinFsp.Native 0.1.2-pre.2 (FileSystemHost.Notify API). See docs/leveldb-cache-coherency-postmortem.md for full diagnostic history, procmon usage, smoking-gun trace, and TLA+ modeling extension plan (follow-up). Note: a separate pre-existing RamDrive bug surfaces under chrome --remote-debugging-pipe + EnableKernelCache=true; documented in the postmortem §9.1 and out of scope for this change. The leveldb fix is verified correct by hex-dumping CURRENT files post-rename and by the integration tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6.3 binding committed and tagged (v0.1.2-pre.2 — version bumped twice during dev iterations to invalidate nuget local cache). 6.4 RamDrive change committed in 148fb0e. 6.5 archive deferred until user reviews. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5.4 (timeout=uint.MaxValue): chrome launches cleanly via --remote-debugging-pipe.
A new lower-severity warning surfaces ("Corruption: no meta-nextfile entry in
descriptor") on the leveldb MANIFEST, but chrome stays alive. Logged for
follow-up; not a regression of this change.
5.6 (benchmark spot-check): captured to benchmark-onread.md. Read ~9.2 ms,
write ~17 ms steady-state (block >= 64 KB). Allocated=0 across all rows —
zero-alloc hot path preserved by the new Notify matrix.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Directory.Packages.props: 0.1.2-pre.2 -> 0.1.2-pre.3 (now on nuget.org) - Drop NuGet.config (no longer need the local artifacts feed) - Move openspec/changes/fix-leveldb-cache-coherency/ -> archive/2026-05-03-... - Sync delta specs into openspec/specs/cache-invalidation and openspec/specs/file-info-timeout-config All 28 integration tests pass against the published nuget package. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI on the RamDrive PR hung on TortureTests.DirectoryTreeStress (5 concurrent
threads doing recursive directory delete). Local 32-core box passed every
time; the GitHub windows-latest runner has fewer cores, which exposes a
dispatcher-pool deadlock:
- Cleanup(Delete) for a dir runs on a WinFsp dispatcher thread.
- The adapter calls FspFileSystemNotify synchronously from that callback.
- FspFsctlNotify is a kernel IOCTL that can block on rename-in-progress.
- Concurrent recursive deletes saturate the dispatcher pool: every thread
is in Notify waiting for kernel state that another (already-blocked)
dispatcher would release.
Fix: Notify now fire-and-forgets via ThreadPool.UnsafeQueueUserWorkItem with
preferLocal:false so the IOCTL runs on a worker outside the dispatcher pool.
The IRP completes immediately. Notifications can be reordered relative to
the originating IRP, but the matrix is path-scoped and the kernel
revalidates on the next open, so ordering does not affect correctness.
Same fix applied to TestAdapter.Notify in the integration fixture.
Documented in docs/leveldb-cache-coherency-postmortem.md §9.0.1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hooyao
added a commit
that referenced
this pull request
May 3, 2026
…POINT) (#12) Self-contained context dump so a fresh Claude Code session post-compaction can pick up the diagnosis without re-deriving anything. Captures: - Status of bugs #1 (shipped #9), #2 (shipped #10), and #3 (open, this doc) - Symptoms (STATUS_BREAKPOINT 0x80000003 + early death; degraded "Profile error" dialog variant) - Repro recipe (5-flag minimum, deterministic to ~80% on H:\) - 8 already-falsified hypotheses (don't redo) - Procmon evidence captured in F:\procmon_chrome2.csv (gitignored) - Four ranked working hypotheses with concrete next-test-steps - Cheat sheet of mount/repro/bisect commands - Pointers to all relevant code, specs, archived changes, and external refs (winfsp source paths) This file is meant to be read first by any session continuing this work. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
WinFspRamAdapter: every path-mutating callback now sendsFspFileSystemNotifyafter the user-mode mutation commits (full matrix in the class XML doc andarchive/.../design.mdDecision 4)RamDriveOptions.FileInfoTimeoutMs(default1000) replaces the unconditionaluint.MaxValue; defence in depth on top of the notification matrix.EnableKernelCache=falsestill forces0— backout switchLevelDbReproTests(3) +CacheCoherencyTests(7) — fixture pinned atFileInfoTimeout=uint.MaxValueso missing notifications fail CI rather than only against real ChromiumWinFsp.Native 0.1.2-pre.3(https://www.nuget.org/packages/WinFsp.Native/0.1.2-pre.3)Why
Chromium's leveldb does atomic-rename + immediate read on the
CURRENTfile every DB open. WithFileInfoTimeout=uint.MaxValue, the kernel's negative cache forCURRENTsurvives the rename and the post-rename read returns 0 bytes. Leveldb reportsCorruption: CURRENT does not end with newlineand Chromium crashesSTATUS_BREAKPOINT (0x80000003)on launch with--user-data-diron the RAM drive.Full diagnostic history, procmon usage, smoking-gun trace, and TLA+ modeling extension plan:
docs/leveldb-cache-coherency-postmortem.md.OpenSpec change:
openspec/changes/archive/2026-05-03-fix-leveldb-cache-coherency/(proposal, design, specs, tasks). Specs synced toopenspec/specs/cache-invalidation/andopenspec/specs/file-info-timeout-config/.Test plan
dotnet test tests/RamDrive.IntegrationTests— 28/28 pass against the publishedWinFsp.Native 0.1.2-pre.3(no local feed)FileInfoTimeoutMs=1000): chrome reachesabout:blank, leveldbCURRENTfiles hex-dump-verified intactFileInfoTimeoutMs=4294967295: chrome stays alive — proves notifications alone are sufficientEnableKernelCache=false: chrome runs cleanly — backout config worksdotnet run --project tests/RamDrive.Benchmarks -c Release -- onread: read ~9.2 ms / write ~17 ms / Allocated=0 — zero-alloc hot path preserved (results inarchive/.../benchmark-onread.md)Known follow-ups (out of scope, documented in postmortem §9.1 and §10)
chrome --remote-debugging-pipe+EnableKernelCache=trueproduces a different early crash. Independent of this fix; will be filed as a new openspec change.🤖 Generated with Claude Code