Whitespace Project Plan

Vision

Build the fastest, safest, developer-centric whitespace normalisation toolkit as a tightly scoped, one-off experiment destined for a deep-dive blog post. Deliver a Rust core library (whitespace-core) with reusable primitives, a high-performance CLI (whitespace-cli), and thorough documentation. Focus on deterministic behaviour, precise whitespace handling, and exceptional DX so teams like DocSpring can adopt it confidently.

Core Capabilities

Trim trailing whitespace at line ends.
Guarantee a trailing newline at EOF.
Convert line endings between LF and CRLF (bidirectional, configurable default).
Convert indentation between tabs and spaces with configurable widths.
Detect binary files quickly and skip them.
Respect project ignore rules (.gitignore, .git/info/exclude).
Provide a JSON-friendly API surface (via the core crate) that higher level tools can bind to.

Architecture Overview

Core Library (whitespace-core)
- Pure Rust transformation engine with SIMD-accelerated byte scanning (build dedicated packed-compare routines on top of memchr/memrchr primitives).
- Case/encoding agnostic: operate on raw bytes while preserving UTF-8 correctness.
- Deterministic transforms with idempotent operations (re-running should produce identical output).
CLI (whitespace-cli)
- Clap-based command surface with subcommands for trim/apply/check.
- Option parsing maps directly to TransformOptions in the core crate.
- Gitignore-aware walker and safety confirmations before writes.
- --check mode for CI hooks.
Cache + Metadata Layer
- Repo-specific cache of known-clean files keyed by exact stat/info signature.
- OS-specific incremental change tracking (FSEvents, USN Journal, fallback walk on Linux/others).
Tooling + Docs
- Taskfile-based workflows (task lint, task test, task build).
- Starlight docs covering installation, guides, CLI reference, cache internals.

Performance Strategy

One-pass scanning: Compile ignore globs once, walk the repo a single time, perform transformations streaming per file.
SIMD and byte-search primitives: Build dedicated SIMD routines (e.g., byteset masks for whitespace, newline detection) backed by memchr/memrchr fallbacks.
Zero-copy unchanged paths: Skip writes when transformations produce identical buffers.
Temporary file staging: Only rewrite when necessary; leverage atomic rename for durability.
Parallel directory traversal: Use ignore + rayon for candidate enumeration; keep per-file IO single-threaded to leverage OS caches.
Cache warm-up: Combine stat-based fingerprinting with OS journals to avoid scanning unchanged files.

Cache Model (`whitespace-cache`)

Locations

Linux: ${XDG_CACHE_HOME:-$HOME/.cache}/whitespace/<repo_id>/
macOS: $HOME/Library/Caches/whitespace/<repo_id>/
Windows: %LOCALAPPDATA%\whitespace\Cache\<repo_id>\

<repo_id> = blake3-128(canonical_repo_root_path || volume_id || fs_type)

Files

keys.bin – sorted u128 array of known-clean keys.
meta.bin – structured header (see below) plus write cursors.
lock – single-writer guard file (advisory lock).

Key Definition

key = blake3-128(path || dev || ino || size || mtime_ns || ctime_ns || mode)

Represents a file that is verified clean for that exact stat snapshot.
Lookup is a binary search over keys.bin.

Racy Rule

meta.index_write_time_sec stores the last flush timestamp.
If file.mtime_sec == index_write_time_sec, treat as potentially racy → force content rescan regardless of key hit.

Change Discovery (No Resident Daemons)

macOS: FSEvents with kFSEventStreamCreateFlagFileEvents.
- Persist {last_event_id, volume_uuid} in meta.bin.
- On start: request events since last_event_id; if history pruned or IDs wrap, do a full walk once, then resume streaming.
Windows: NTFS USN Change Journal.
- Persist {volume_serial, journal_id_low/high, last_usn}.
- On start: query deltas since last_usn; if journal pruned or mismatch detected, perform full walk then resume.
Linux / others: Gitignore-aware pull walk only (no native journal integration). Use fast directory traversal with cached globset.

Per-file Flow

From OS journal
- Content-change flag → scan & fix → restat → recompute key → stage for write.
- Metadata-only change → recompute key if any stat fields differ → stage.
- Rename/move → recompute key using new path → stage.
- Skip cache lookup (journal already tells us it’s dirty).
From pull walk
- If racy (mtime == index_write_time_sec) → scan/fix.
- Else compute key → binary search keys.bin.
- Hit → skip.
- Miss → scan/fix → recompute key → stage.

Write-out Procedure

Merge staged keys into existing array, sort once.
Write keys.tmp, then atomic rename → keys.bin.
Update meta.bin, set index_write_time_sec = now().
Single-writer guard ensures only one process writes at a time.

Concurrency Model

Acquire repo-level lock before modifying cache.
Parallel candidate scanning allowed; per-file IO remains sequential within threads to limit contention.
Final write is single-threaded and atomic.

Gitignore Handling

Respect .gitignore and .git/info/exclude (plus Git global excludes when available).
Monitor their mtimes; rebuild the compiled matcher when any change.

`meta.bin` Layout (v1)

struct Meta {
  u32 magic;              // "WSMT"
  u32 version;            // 1
  u64 index_write_time_sec;

  u32 platform_flags;     // bit0=macOS, bit1=Windows, bit2=Linux

  // macOS
  u64 fsev_last_event_id;
  u8  fsev_volume_uuid[16];

  // Windows
  u64 usn_last_usn;
  u64 usn_journal_id_low;
  u64 usn_journal_id_high;
  u32 volume_serial;

  // Repo identity snapshot
  u64 repo_root_dev;
  u64 repo_root_ino;

  u128 keys_checksum;     // blake3-128 of current keys.bin
}

Roadmap (High Level)

Core MVP
- Finalise transform_bytes correctness (CRLF handling, tab conversions, newline guarantee).
- Binary heuristics and configuration knobs.
- Unit/property tests and fuzzers for core transforms.
CLI MVP
- Implement ignore-aware traversal with ignore crate.
- --check, --stdin, and --dry-run support.
- Graceful output, logging, exit codes.
Cache System
- Implement key storage, repo detection, lock handling.
- macOS FSEvents + Windows USN integrations.
- Linux pull-walk fallback with racy guard.
Performance Iterations
- Build an intentionally naive baseline (byte-by-byte scans, no cache) to establish ground truth timings.
- Add compile-time feature flags to toggle each optimisation tier (SIMD scans, cache usage, parallel walk, mmap) so we can benchmark deltas.
- Benchmark each tier with hyperfine on the DocSpring repository (/Users/ndbroadbent/code/docspring) for cold and warm cache scenarios, charting the progression from naive to fully optimised.
- Investigate zero-copy writes (mmap) and fallocate collapse-range viability (likely avoid if portability suffers).
Docs & Developer UX
- Update docs with new CLI command list, cache explanation, configuration examples.
- Taskfile targets for lint/test/format.
- Integrate with lefthook for commit-time trimming.

Quality & Testing

Rustfmt + Clippy clean (warnings as errors).
cargo test, property tests, fuzz targets (particularly for whitespace detection and newline edge cases).
Snapshot tests for CLI outputs.
Cross-platform CI (macOS, Linux, Windows) including path edge cases.
End-to-end tests using temporary repos with ignore files and cache warm/cold scenarios.

Open Questions / Research Tracks

Evaluate ignore crate customisation for ignoring .gitmodules and repo submodules efficiently.
Determine optimal SIMD strategy (packed compare vs packed_simd/std::simd).
Explore OS-specific file change collapse operations (e.g., Linux fallocate(FALLOC_FL_COLLAPSE_RANGE)) but confirm portability.
Investigate incremental per-file chunk hashing (content-defined chunking) for future optional accelerations.

This plan captures the current intent: a safety-first, ultra-fast whitespace normaliser with a portable cache model that exploits OS-level notifications where available.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whitespace Project Plan

Vision

Core Capabilities

Architecture Overview

Performance Strategy

Cache Model (`whitespace-cache`)

Locations

Files

Key Definition

Racy Rule

Change Discovery (No Resident Daemons)

Per-file Flow

Write-out Procedure

Concurrency Model

Gitignore Handling

`meta.bin` Layout (v1)

Roadmap (High Level)

Quality & Testing

Open Questions / Research Tracks

FilesExpand file tree

PLAN.md

Latest commit

History

PLAN.md

File metadata and controls

Whitespace Project Plan

Vision

Core Capabilities

Architecture Overview

Performance Strategy

Cache Model (whitespace-cache)

Locations

Files

Key Definition

Racy Rule

Change Discovery (No Resident Daemons)

Per-file Flow

Write-out Procedure

Concurrency Model

Gitignore Handling

meta.bin Layout (v1)

Roadmap (High Level)

Quality & Testing

Open Questions / Research Tracks

Cache Model (`whitespace-cache`)

`meta.bin` Layout (v1)