Build the fastest, safest, developer-centric whitespace normalisation toolkit as a tightly scoped, one-off experiment destined for a deep-dive blog post. Deliver a Rust core library (whitespace-core) with reusable primitives, a high-performance CLI (whitespace-cli), and thorough documentation. Focus on deterministic behaviour, precise whitespace handling, and exceptional DX so teams like DocSpring can adopt it confidently.
- Trim trailing whitespace at line ends.
- Guarantee a trailing newline at EOF.
- Convert line endings between LF and CRLF (bidirectional, configurable default).
- Convert indentation between tabs and spaces with configurable widths.
- Detect binary files quickly and skip them.
- Respect project ignore rules (
.gitignore,.git/info/exclude). - Provide a JSON-friendly API surface (via the core crate) that higher level tools can bind to.
- Core Library (
whitespace-core)- Pure Rust transformation engine with SIMD-accelerated byte scanning (build dedicated packed-compare routines on top of memchr/memrchr primitives).
- Case/encoding agnostic: operate on raw bytes while preserving UTF-8 correctness.
- Deterministic transforms with idempotent operations (re-running should produce identical output).
- CLI (
whitespace-cli)- Clap-based command surface with subcommands for trim/apply/check.
- Option parsing maps directly to
TransformOptionsin the core crate. - Gitignore-aware walker and safety confirmations before writes.
--checkmode for CI hooks.
- Cache + Metadata Layer
- Repo-specific cache of known-clean files keyed by exact stat/info signature.
- OS-specific incremental change tracking (FSEvents, USN Journal, fallback walk on Linux/others).
- Tooling + Docs
- Taskfile-based workflows (
task lint,task test,task build). - Starlight docs covering installation, guides, CLI reference, cache internals.
- Taskfile-based workflows (
- One-pass scanning: Compile ignore globs once, walk the repo a single time, perform transformations streaming per file.
- SIMD and byte-search primitives: Build dedicated SIMD routines (e.g., byteset masks for whitespace, newline detection) backed by memchr/memrchr fallbacks.
- Zero-copy unchanged paths: Skip writes when transformations produce identical buffers.
- Temporary file staging: Only rewrite when necessary; leverage atomic rename for durability.
- Parallel directory traversal: Use
ignore+rayonfor candidate enumeration; keep per-file IO single-threaded to leverage OS caches. - Cache warm-up: Combine stat-based fingerprinting with OS journals to avoid scanning unchanged files.
- Linux:
${XDG_CACHE_HOME:-$HOME/.cache}/whitespace/<repo_id>/ - macOS:
$HOME/Library/Caches/whitespace/<repo_id>/ - Windows:
%LOCALAPPDATA%\whitespace\Cache\<repo_id>\
<repo_id> = blake3-128(canonical_repo_root_path || volume_id || fs_type)
keys.bin– sortedu128array of known-clean keys.meta.bin– structured header (see below) plus write cursors.lock– single-writer guard file (advisory lock).
key = blake3-128(path || dev || ino || size || mtime_ns || ctime_ns || mode)
- Represents a file that is verified clean for that exact stat snapshot.
- Lookup is a binary search over
keys.bin.
meta.index_write_time_secstores the last flush timestamp.- If
file.mtime_sec == index_write_time_sec, treat as potentially racy → force content rescan regardless of key hit.
- macOS: FSEvents with
kFSEventStreamCreateFlagFileEvents.- Persist
{last_event_id, volume_uuid}inmeta.bin. - On start: request events since
last_event_id; if history pruned or IDs wrap, do a full walk once, then resume streaming.
- Persist
- Windows: NTFS USN Change Journal.
- Persist
{volume_serial, journal_id_low/high, last_usn}. - On start: query deltas since
last_usn; if journal pruned or mismatch detected, perform full walk then resume.
- Persist
- Linux / others: Gitignore-aware pull walk only (no native journal integration). Use fast directory traversal with cached globset.
- From OS journal
- Content-change flag → scan & fix → restat → recompute key → stage for write.
- Metadata-only change → recompute key if any stat fields differ → stage.
- Rename/move → recompute key using new path → stage.
- Skip cache lookup (journal already tells us it’s dirty).
- From pull walk
- If racy (mtime ==
index_write_time_sec) → scan/fix. - Else compute key → binary search
keys.bin. - Hit → skip.
- Miss → scan/fix → recompute key → stage.
- If racy (mtime ==
- Merge staged keys into existing array, sort once.
- Write
keys.tmp, then atomic rename →keys.bin. - Update
meta.bin, setindex_write_time_sec = now(). - Single-writer guard ensures only one process writes at a time.
- Acquire repo-level lock before modifying cache.
- Parallel candidate scanning allowed; per-file IO remains sequential within threads to limit contention.
- Final write is single-threaded and atomic.
- Respect
.gitignoreand.git/info/exclude(plus Git global excludes when available). - Monitor their mtimes; rebuild the compiled matcher when any change.
struct Meta {
u32 magic; // "WSMT"
u32 version; // 1
u64 index_write_time_sec;
u32 platform_flags; // bit0=macOS, bit1=Windows, bit2=Linux
// macOS
u64 fsev_last_event_id;
u8 fsev_volume_uuid[16];
// Windows
u64 usn_last_usn;
u64 usn_journal_id_low;
u64 usn_journal_id_high;
u32 volume_serial;
// Repo identity snapshot
u64 repo_root_dev;
u64 repo_root_ino;
u128 keys_checksum; // blake3-128 of current keys.bin
}
- Core MVP
- Finalise
transform_bytescorrectness (CRLF handling, tab conversions, newline guarantee). - Binary heuristics and configuration knobs.
- Unit/property tests and fuzzers for core transforms.
- Finalise
- CLI MVP
- Implement ignore-aware traversal with
ignorecrate. --check,--stdin, and--dry-runsupport.- Graceful output, logging, exit codes.
- Implement ignore-aware traversal with
- Cache System
- Implement key storage, repo detection, lock handling.
- macOS FSEvents + Windows USN integrations.
- Linux pull-walk fallback with racy guard.
- Performance Iterations
- Build an intentionally naive baseline (byte-by-byte scans, no cache) to establish ground truth timings.
- Add compile-time feature flags to toggle each optimisation tier (SIMD scans, cache usage, parallel walk, mmap) so we can benchmark deltas.
- Benchmark each tier with
hyperfineon the DocSpring repository (/Users/ndbroadbent/code/docspring) for cold and warm cache scenarios, charting the progression from naive to fully optimised. - Investigate zero-copy writes (mmap) and
fallocatecollapse-range viability (likely avoid if portability suffers).
- Docs & Developer UX
- Update docs with new CLI command list, cache explanation, configuration examples.
- Taskfile targets for lint/test/format.
- Integrate with
lefthookfor commit-time trimming.
- Rustfmt + Clippy clean (warnings as errors).
cargo test, property tests, fuzz targets (particularly for whitespace detection and newline edge cases).- Snapshot tests for CLI outputs.
- Cross-platform CI (macOS, Linux, Windows) including path edge cases.
- End-to-end tests using temporary repos with ignore files and cache warm/cold scenarios.
- Evaluate
ignorecrate customisation for ignoring.gitmodulesand repo submodules efficiently. - Determine optimal SIMD strategy (packed compare vs
packed_simd/std::simd). - Explore OS-specific file change collapse operations (e.g., Linux
fallocate(FALLOC_FL_COLLAPSE_RANGE)) but confirm portability. - Investigate incremental per-file chunk hashing (content-defined chunking) for future optional accelerations.
This plan captures the current intent: a safety-first, ultra-fast whitespace normaliser with a portable cache model that exploits OS-level notifications where available.