Skip to content

Latest commit

 

History

History
161 lines (137 loc) · 8.26 KB

File metadata and controls

161 lines (137 loc) · 8.26 KB

Whitespace Project Plan

Vision

Build the fastest, safest, developer-centric whitespace normalisation toolkit as a tightly scoped, one-off experiment destined for a deep-dive blog post. Deliver a Rust core library (whitespace-core) with reusable primitives, a high-performance CLI (whitespace-cli), and thorough documentation. Focus on deterministic behaviour, precise whitespace handling, and exceptional DX so teams like DocSpring can adopt it confidently.

Core Capabilities

  • Trim trailing whitespace at line ends.
  • Guarantee a trailing newline at EOF.
  • Convert line endings between LF and CRLF (bidirectional, configurable default).
  • Convert indentation between tabs and spaces with configurable widths.
  • Detect binary files quickly and skip them.
  • Respect project ignore rules (.gitignore, .git/info/exclude).
  • Provide a JSON-friendly API surface (via the core crate) that higher level tools can bind to.

Architecture Overview

  1. Core Library (whitespace-core)
    • Pure Rust transformation engine with SIMD-accelerated byte scanning (build dedicated packed-compare routines on top of memchr/memrchr primitives).
    • Case/encoding agnostic: operate on raw bytes while preserving UTF-8 correctness.
    • Deterministic transforms with idempotent operations (re-running should produce identical output).
  2. CLI (whitespace-cli)
    • Clap-based command surface with subcommands for trim/apply/check.
    • Option parsing maps directly to TransformOptions in the core crate.
    • Gitignore-aware walker and safety confirmations before writes.
    • --check mode for CI hooks.
  3. Cache + Metadata Layer
    • Repo-specific cache of known-clean files keyed by exact stat/info signature.
    • OS-specific incremental change tracking (FSEvents, USN Journal, fallback walk on Linux/others).
  4. Tooling + Docs
    • Taskfile-based workflows (task lint, task test, task build).
    • Starlight docs covering installation, guides, CLI reference, cache internals.

Performance Strategy

  • One-pass scanning: Compile ignore globs once, walk the repo a single time, perform transformations streaming per file.
  • SIMD and byte-search primitives: Build dedicated SIMD routines (e.g., byteset masks for whitespace, newline detection) backed by memchr/memrchr fallbacks.
  • Zero-copy unchanged paths: Skip writes when transformations produce identical buffers.
  • Temporary file staging: Only rewrite when necessary; leverage atomic rename for durability.
  • Parallel directory traversal: Use ignore + rayon for candidate enumeration; keep per-file IO single-threaded to leverage OS caches.
  • Cache warm-up: Combine stat-based fingerprinting with OS journals to avoid scanning unchanged files.

Cache Model (whitespace-cache)

Locations

  • Linux: ${XDG_CACHE_HOME:-$HOME/.cache}/whitespace/<repo_id>/
  • macOS: $HOME/Library/Caches/whitespace/<repo_id>/
  • Windows: %LOCALAPPDATA%\whitespace\Cache\<repo_id>\

<repo_id> = blake3-128(canonical_repo_root_path || volume_id || fs_type)

Files

  • keys.bin – sorted u128 array of known-clean keys.
  • meta.bin – structured header (see below) plus write cursors.
  • lock – single-writer guard file (advisory lock).

Key Definition

key = blake3-128(path || dev || ino || size || mtime_ns || ctime_ns || mode)

  • Represents a file that is verified clean for that exact stat snapshot.
  • Lookup is a binary search over keys.bin.

Racy Rule

  • meta.index_write_time_sec stores the last flush timestamp.
  • If file.mtime_sec == index_write_time_sec, treat as potentially racy → force content rescan regardless of key hit.

Change Discovery (No Resident Daemons)

  • macOS: FSEvents with kFSEventStreamCreateFlagFileEvents.
    • Persist {last_event_id, volume_uuid} in meta.bin.
    • On start: request events since last_event_id; if history pruned or IDs wrap, do a full walk once, then resume streaming.
  • Windows: NTFS USN Change Journal.
    • Persist {volume_serial, journal_id_low/high, last_usn}.
    • On start: query deltas since last_usn; if journal pruned or mismatch detected, perform full walk then resume.
  • Linux / others: Gitignore-aware pull walk only (no native journal integration). Use fast directory traversal with cached globset.

Per-file Flow

  1. From OS journal
    • Content-change flag → scan & fix → restat → recompute key → stage for write.
    • Metadata-only change → recompute key if any stat fields differ → stage.
    • Rename/move → recompute key using new path → stage.
    • Skip cache lookup (journal already tells us it’s dirty).
  2. From pull walk
    • If racy (mtime == index_write_time_sec) → scan/fix.
    • Else compute key → binary search keys.bin.
    • Hit → skip.
    • Miss → scan/fix → recompute key → stage.

Write-out Procedure

  • Merge staged keys into existing array, sort once.
  • Write keys.tmp, then atomic rename → keys.bin.
  • Update meta.bin, set index_write_time_sec = now().
  • Single-writer guard ensures only one process writes at a time.

Concurrency Model

  • Acquire repo-level lock before modifying cache.
  • Parallel candidate scanning allowed; per-file IO remains sequential within threads to limit contention.
  • Final write is single-threaded and atomic.

Gitignore Handling

  • Respect .gitignore and .git/info/exclude (plus Git global excludes when available).
  • Monitor their mtimes; rebuild the compiled matcher when any change.

meta.bin Layout (v1)

struct Meta {
  u32 magic;              // "WSMT"
  u32 version;            // 1
  u64 index_write_time_sec;

  u32 platform_flags;     // bit0=macOS, bit1=Windows, bit2=Linux

  // macOS
  u64 fsev_last_event_id;
  u8  fsev_volume_uuid[16];

  // Windows
  u64 usn_last_usn;
  u64 usn_journal_id_low;
  u64 usn_journal_id_high;
  u32 volume_serial;

  // Repo identity snapshot
  u64 repo_root_dev;
  u64 repo_root_ino;

  u128 keys_checksum;     // blake3-128 of current keys.bin
}

Roadmap (High Level)

  1. Core MVP
    • Finalise transform_bytes correctness (CRLF handling, tab conversions, newline guarantee).
    • Binary heuristics and configuration knobs.
    • Unit/property tests and fuzzers for core transforms.
  2. CLI MVP
    • Implement ignore-aware traversal with ignore crate.
    • --check, --stdin, and --dry-run support.
    • Graceful output, logging, exit codes.
  3. Cache System
    • Implement key storage, repo detection, lock handling.
    • macOS FSEvents + Windows USN integrations.
    • Linux pull-walk fallback with racy guard.
  4. Performance Iterations
    • Build an intentionally naive baseline (byte-by-byte scans, no cache) to establish ground truth timings.
    • Add compile-time feature flags to toggle each optimisation tier (SIMD scans, cache usage, parallel walk, mmap) so we can benchmark deltas.
    • Benchmark each tier with hyperfine on the DocSpring repository (/Users/ndbroadbent/code/docspring) for cold and warm cache scenarios, charting the progression from naive to fully optimised.
    • Investigate zero-copy writes (mmap) and fallocate collapse-range viability (likely avoid if portability suffers).
  5. Docs & Developer UX
    • Update docs with new CLI command list, cache explanation, configuration examples.
    • Taskfile targets for lint/test/format.
    • Integrate with lefthook for commit-time trimming.

Quality & Testing

  • Rustfmt + Clippy clean (warnings as errors).
  • cargo test, property tests, fuzz targets (particularly for whitespace detection and newline edge cases).
  • Snapshot tests for CLI outputs.
  • Cross-platform CI (macOS, Linux, Windows) including path edge cases.
  • End-to-end tests using temporary repos with ignore files and cache warm/cold scenarios.

Open Questions / Research Tracks

  • Evaluate ignore crate customisation for ignoring .gitmodules and repo submodules efficiently.
  • Determine optimal SIMD strategy (packed compare vs packed_simd/std::simd).
  • Explore OS-specific file change collapse operations (e.g., Linux fallocate(FALLOC_FL_COLLAPSE_RANGE)) but confirm portability.
  • Investigate incremental per-file chunk hashing (content-defined chunking) for future optional accelerations.

This plan captures the current intent: a safety-first, ultra-fast whitespace normaliser with a portable cache model that exploits OS-level notifications where available.