Skip to content

Fix leveldb / Chromium kernel-cache coherency on RAM drive#9

Merged
hooyao merged 5 commits intomainfrom
fix/leveldb-cache-coherency
May 3, 2026
Merged

Fix leveldb / Chromium kernel-cache coherency on RAM drive#9
hooyao merged 5 commits intomainfrom
fix/leveldb-cache-coherency

Conversation

@hooyao
Copy link
Copy Markdown
Owner

@hooyao hooyao commented May 3, 2026

Summary

  • WinFspRamAdapter: every path-mutating callback now sends FspFileSystemNotify after the user-mode mutation commits (full matrix in the class XML doc and archive/.../design.md Decision 4)
  • New RamDriveOptions.FileInfoTimeoutMs (default 1000) replaces the unconditional uint.MaxValue; defence in depth on top of the notification matrix. EnableKernelCache=false still forces 0 — backout switch
  • New integration tests LevelDbReproTests (3) + CacheCoherencyTests (7) — fixture pinned at FileInfoTimeout=uint.MaxValue so missing notifications fail CI rather than only against real Chromium
  • Consumes WinFsp.Native 0.1.2-pre.3 (https://www.nuget.org/packages/WinFsp.Native/0.1.2-pre.3)

Why

Chromium's leveldb does atomic-rename + immediate read on the CURRENT file every DB open. With FileInfoTimeout=uint.MaxValue, the kernel's negative cache for CURRENT survives the rename and the post-rename read returns 0 bytes. Leveldb reports Corruption: CURRENT does not end with newline and Chromium crashes STATUS_BREAKPOINT (0x80000003) on launch with --user-data-dir on the RAM drive.

Full diagnostic history, procmon usage, smoking-gun trace, and TLA+ modeling extension plan: docs/leveldb-cache-coherency-postmortem.md.

OpenSpec change: openspec/changes/archive/2026-05-03-fix-leveldb-cache-coherency/ (proposal, design, specs, tasks). Specs synced to openspec/specs/cache-invalidation/ and openspec/specs/file-info-timeout-config/.

Test plan

  • dotnet test tests/RamDrive.IntegrationTests — 28/28 pass against the published WinFsp.Native 0.1.2-pre.3 (no local feed)
  • Manual e2e on H: with default config (FileInfoTimeoutMs=1000): chrome reaches about:blank, leveldb CURRENT files hex-dump-verified intact
  • Manual e2e with FileInfoTimeoutMs=4294967295: chrome stays alive — proves notifications alone are sufficient
  • Manual e2e with EnableKernelCache=false: chrome runs cleanly — backout config works
  • dotnet run --project tests/RamDrive.Benchmarks -c Release -- onread: read ~9.2 ms / write ~17 ms / Allocated=0 — zero-alloc hot path preserved (results in archive/.../benchmark-onread.md)

Known follow-ups (out of scope, documented in postmortem §9.1 and §10)

  • Separate pre-existing RamDrive bug under chrome --remote-debugging-pipe + EnableKernelCache=true produces a different early crash. Independent of this fix; will be filed as a new openspec change.
  • TLA+ modeling extension for the kernel cache (postmortem §10) — deferred; user picking up on a different machine.

🤖 Generated with Claude Code

hooyao and others added 5 commits May 3, 2026 12:43
…mium)

Chromium's leveldb (used for Local Storage, Sync Data, GCM Store, ...) does an
atomic-rename + immediate read on the CURRENT file every DB open. With WinFsp's
FileInfoTimeout set to uint.MaxValue (the broken default), the kernel kept the
negative-cached "CURRENT does not exist" entry forever, so the read after
MoveFile(dbtmp -> CURRENT) returned 0 bytes. Leveldb reported "Corruption:
CURRENT does not end with newline" and Chromium crashed STATUS_BREAKPOINT
on launch with --user-data-dir on the RAM drive.

Three coordinated changes:

1. Adapter (WinFspRamAdapter.cs): every path-mutating callback now sends an
   FspFileSystemNotify after the user-mode mutation commits. Full notification
   matrix documented in the class XML doc and design.md.
   - CreateFile (file/dir): ChangeFileName/ChangeDirName, ActionAdded
   - OverwriteFile: ChangeSize|ChangeLastWrite, ActionModified
   - MoveFile: ChangeFileName, ActionRenamedOldName + ActionRenamedNewName
   - Cleanup(Delete) (file/dir): ChangeFileName/ChangeDirName, ActionRemoved
   - SetFileSize: ChangeSize|ChangeLastWrite, ActionModified
   - SetFileAttributes: ChangeAttributes|ChangeLastWrite, ActionModified
   Notification failures are logged at Trace and never fail the originating IRP.

2. Config (RamDriveOptions.FileInfoTimeoutMs, default 1000): replaces the
   unconditional uint.MaxValue and acts as defence in depth for any path that
   escapes the notification matrix. EnableKernelCache=false still forces 0.

3. Tests (LevelDbReproTests + CacheCoherencyTests): exercises the exact Win32
   sequence captured from Chromium's leveldb env (cached WriteFile + Flush +
   MoveFileEx + Open + ReadFile). Fixture pinned at FileInfoTimeout=uint.MaxValue
   so any missing notification fails CI rather than only against real Chromium
   with the production default. All 28 integration tests pass.

Requires WinFsp.Native 0.1.2-pre.2 (FileSystemHost.Notify API). See
docs/leveldb-cache-coherency-postmortem.md for full diagnostic history,
procmon usage, smoking-gun trace, and TLA+ modeling extension plan
(follow-up).

Note: a separate pre-existing RamDrive bug surfaces under
chrome --remote-debugging-pipe + EnableKernelCache=true; documented in the
postmortem §9.1 and out of scope for this change. The leveldb fix is verified
correct by hex-dumping CURRENT files post-rename and by the integration tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6.3 binding committed and tagged (v0.1.2-pre.2 — version bumped twice during
dev iterations to invalidate nuget local cache).
6.4 RamDrive change committed in 148fb0e.
6.5 archive deferred until user reviews.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5.4 (timeout=uint.MaxValue): chrome launches cleanly via --remote-debugging-pipe.
A new lower-severity warning surfaces ("Corruption: no meta-nextfile entry in
descriptor") on the leveldb MANIFEST, but chrome stays alive. Logged for
follow-up; not a regression of this change.

5.6 (benchmark spot-check): captured to benchmark-onread.md. Read ~9.2 ms,
write ~17 ms steady-state (block >= 64 KB). Allocated=0 across all rows —
zero-alloc hot path preserved by the new Notify matrix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Directory.Packages.props: 0.1.2-pre.2 -> 0.1.2-pre.3 (now on nuget.org)
- Drop NuGet.config (no longer need the local artifacts feed)
- Move openspec/changes/fix-leveldb-cache-coherency/ -> archive/2026-05-03-...
- Sync delta specs into openspec/specs/cache-invalidation and openspec/specs/file-info-timeout-config

All 28 integration tests pass against the published nuget package.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI on the RamDrive PR hung on TortureTests.DirectoryTreeStress (5 concurrent
threads doing recursive directory delete). Local 32-core box passed every
time; the GitHub windows-latest runner has fewer cores, which exposes a
dispatcher-pool deadlock:

  - Cleanup(Delete) for a dir runs on a WinFsp dispatcher thread.
  - The adapter calls FspFileSystemNotify synchronously from that callback.
  - FspFsctlNotify is a kernel IOCTL that can block on rename-in-progress.
  - Concurrent recursive deletes saturate the dispatcher pool: every thread
    is in Notify waiting for kernel state that another (already-blocked)
    dispatcher would release.

Fix: Notify now fire-and-forgets via ThreadPool.UnsafeQueueUserWorkItem with
preferLocal:false so the IOCTL runs on a worker outside the dispatcher pool.
The IRP completes immediately. Notifications can be reordered relative to
the originating IRP, but the matrix is path-scoped and the kernel
revalidates on the next open, so ordering does not affect correctness.

Same fix applied to TestAdapter.Notify in the integration fixture.

Documented in docs/leveldb-cache-coherency-postmortem.md §9.0.1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hooyao hooyao merged commit 2c3823d into main May 3, 2026
2 checks passed
@hooyao hooyao deleted the fix/leveldb-cache-coherency branch May 3, 2026 07:19
hooyao added a commit that referenced this pull request May 3, 2026
…POINT) (#12)

Self-contained context dump so a fresh Claude Code session post-compaction
can pick up the diagnosis without re-deriving anything. Captures:

- Status of bugs #1 (shipped #9), #2 (shipped #10), and #3 (open, this doc)
- Symptoms (STATUS_BREAKPOINT 0x80000003 + early death; degraded "Profile
  error" dialog variant)
- Repro recipe (5-flag minimum, deterministic to ~80% on H:\)
- 8 already-falsified hypotheses (don't redo)
- Procmon evidence captured in F:\procmon_chrome2.csv (gitignored)
- Four ranked working hypotheses with concrete next-test-steps
- Cheat sheet of mount/repro/bisect commands
- Pointers to all relevant code, specs, archived changes, and external refs
  (winfsp source paths)

This file is meant to be read first by any session continuing this work.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant