Skip to content

Add wiki-llm skill (Community)#100

Open
bitphill wants to merge 3 commits into
zocomputer:mainfrom
bitphill:add-wiki-llm-skill
Open

Add wiki-llm skill (Community)#100
bitphill wants to merge 3 commits into
zocomputer:mainfrom
bitphill:add-wiki-llm-skill

Conversation

@bitphill

@bitphill bitphill commented Jun 27, 2026

Copy link
Copy Markdown

Summary

Adds wiki-llm — a Community skill that maintains a persistent, incrementally-updated wiki for any project folder. Implements the LLM-Wiki pattern by Andrej Karpathy: the agent builds and maintains a structured, interlinked markdown knowledge base that compounds over time, instead of re-deriving knowledge via RAG on every query.

Why it matters — token / cost reduction

Without a wiki, every prompt that touches a project pays the same compounding tax:

  1. Re-scan: the agent must ls / grep / read_file over the project to rediscover what's there.
  2. Re-derive: it re-extracts the same architecture facts, entity definitions, ops history, conventions on every turn.
  3. Re-bloat context: the discovered files get loaded into the prompt as raw text, even when only a few lines actually matter.

For a non-trivial repo (a few thousand source files), each session can easily burn 50k–200k input tokens before the agent does any real work. That cost is paid every prompt, multiplied across every model tier the request fans out to. The wiki replaces that with a tiny, structured, pre-synthesized layer:

  • Each source is read once, summarized into a pages/sources/<slug>.md, and cross-linked to entity / topic pages — the agent reads ~10 KB of pre-condensed wiki instead of 2 MB of raw code.
  • The index.md catalog acts as a O(1) routing table — the agent jumps straight to the relevant page rather than blind-grep.
  • The log.md gives versioned history of ops + decisions, so prior fixes (e.g. an env-var bug from last month) don't need re-derivation.
  • Stop-hook keeps the wiki incrementally fresh — only files actually changed since the last turn get reingested. No global rebuild, no re-summarize of unchanged code.
  • Cross-session compounding: a fact learned in session N is already in the wiki by session N+1, so the cost of learning it is paid exactly once.

Net: bigger projects benefit more (sub-linear context growth vs. linear re-scan), and downstream model fan-out (T1 → T5 cascade in interview-portal, batched eval pipelines, etc.) multiplies the savings further because every tier reads the smaller wiki instead of the raw repo.

Usage

All commands operate on a registered project — pass --project <path> or run from inside one. Every project lives in the central registry at ~/.wiki-llm/registry.json; per-wiki internals live at <wiki>/.wiki-llm/.

Command Purpose Example
wiki init [path] Inspect folder, detect project type (js-ts, python, rust, go, data-folder, notes-folder, generic), emit questions JSON. Re-run with --json <answers> to scaffold. wiki init /home/workspace/my-app then wiki init --json /tmp/answers.json
wiki status List registered projects + pending update + trash counts. wiki status
wiki register [path] Add a project whose wiki already exists to the registry. wiki register /home/workspace/my-app
wiki unregister [path] Drop from registry; wiki files preserved. wiki unregister /home/workspace/my-app
wiki ingest <file> Mark a source file for ingestion; emits source content + agent instruction block (5 MB cap per file). wiki ingest src/server.ts --project /home/workspace/my-app
wiki update [--commit] Diff vs last-snapshot git head + dirty worktree, archive deletions to wiki trash, list pending. --commit advances state. wiki update --commit
wiki query "..." Hybrid BM25 + vector search via qmd if installed, else ripgrep fallback. wiki query "ZO_API_KEY auto-grader fallback"
wiki lint Health check: orphan pages (not referenced by index or any sibling), stale claims (sources newer than dependent page). wiki lint
wiki list-deleted Show files in wiki-native trash with timestamps. wiki list-deleted
wiki recover <relpath> Restore newest archived copy of a deleted source from wiki trash. wiki recover src/utils.ts
wiki hook Stop-hook entrypoint. Cheap and non-LLM: snapshots git diff for every registered project, archives deletions, queues changes. wiki hook (invoked by Claude Code)
wiki install-hook Register wiki hook in ~/.claude/settings.json Stop hook array. Idempotent. wiki install-hook
wiki uninstall-hook Remove the Stop hook. wiki uninstall-hook
wiki install-qmd Best-effort install of qmd search backend (npm/bun → cargo → rustup+cargo). wiki install-qmd
wiki doctor [--fix] Check / install deps. Required: bun, git. Optional: qmd, ripgrep, rustup, cargo. wiki doctor --fix

What's in it (technical)

Architecture

  • Three layers per project (per the LLM-Wiki pattern): raw sources (the project, read-only), the wiki (markdown the agent owns), the schema (<wiki>/AGENTS.md — co-evolved conventions).
  • Central registry at ~/.wiki-llm/registry.json; per-wiki state at <wiki>/.wiki-llm/{config.json, state.json, queue.jsonl, trash/}.
  • Wiki skeleton: index.md (content catalog), log.md (chronological event log), pages/sources/, pages/entities/, pages/topics/ — populated by the agent based on AGENTS.md conventions.

Adaptive init

  • Auto-detects project type from filesystem signals: package.json → js-ts; pyproject.toml / requirements.txt / setup.py → python; Cargo.toml → rust; go.mod → go; ≥2 .csv/.parquet/.duckdb files → data-folder; ≥2 .md/.txt/.pdf → notes-folder; else generic.
  • Question set adapts to project type (different default include globs, scope choices: code+ops for code projects, all for notes/data) so the same skill scaffolds a code wiki and a research wiki differently.

Stop-hook auto-update

  • Hook runs after every Claude Code turn. For each registered project: git rev-parse HEAD + git status --porcelain + diff vs <wiki>/.wiki-llm/state.json:last_head.
  • Apply project include/exclude globs (minimal **, *, {a,b} glob translator).
  • Archive deletions to <wiki>/.wiki-llm/trash/<ISO-stamp>/<relpath> with a manifest line; queue add / modify / delete entries to queue.jsonl.
  • Zero LLM tokens spent in the hook — content regeneration is deferred to an explicit wiki update so token cost stays predictable.

Wiki-native trash + recovery

  • Independent of git: even if the user git reset --hard's away history, the trash still holds the deleted source.
  • Multi-ref recovery: tries HEADstate.last_head → last commit touching the path (git log -n1 --pretty=%H -- <path>).
  • wiki list-deleted / wiki recover <relpath> work from the manifest, not git.
  • Configurable retention: trash_retention_days (default 30) per project; pruning runs during wiki update.

qmd integration

  • Hybrid BM25 + vector search over the wiki (qmd search).
  • Falls back to ripgrep -l with a heading-rank heuristic if qmd is missing.
  • wiki install-qmd tries bun add -g qmdnpm i -g qmdcargo install qmd → bootstrap rustup then cargo.

First-run dependency preflight

  • When ~/.wiki-llm/registry.json doesn't exist (fresh Zo instance), the CLI auto-runs doctor --fix before any subcommand (except doctor and hook themselves to keep latency / recursion in check).

Hardening (security / perf / memory)

This PR was reviewed end-to-end for security, vulnerability exposure, memory leaks, and performance — relevant fixes folded in:

  • Command-injection-safe git invocations: all git calls go through spawnSync with explicit arg arrays — no shell interpolation of user-controlled data. Refs (HEAD, state.last_head, current head) are validated against ^[A-Za-z0-9._/-]{1,255}$ before any use, so a tampered state.json or a malicious branch name cannot break out of the argv boundary.
  • Path-traversal guards: every relpath touching trash, recover, or ingest is rejected if it is absolute, contains .. segments, or has null bytes. A defence-in-depth isUnder(root, dest) check then re-resolves the final write path and refuses anything that escapes the trash root, project root, or wiki root — so a corrupt manifest cannot get wiki recover to clobber /etc/passwd.
  • Memory caps: git show / git diff are capped at 50 MB via maxBuffer; ingest refuses sources over 5 MB with a clear error rather than OOM-ing the process.
  • Crash-resistant JSON: safeReadJSON() wraps every parse of registry / state / config / manifest / settings; a corrupt file falls back to defaults instead of throwing.
  • Atomic writes: registry, state, config, and ~/.claude/settings.json are written via .tmp + rename, so a crash mid-write can't leave them half-written.
  • Single-flight hook lock: wiki hook acquires a ~/.wiki-llm/hook.lock (O_EXCL with 60-s stale-lock reclaim), so two parallel Stop fires can't race-corrupt the queue.
  • Queue de-dup: appending the same file + action within the last 100 entries is skipped — prevents queue growth on chatty turns.
  • Trash retention: configurable per-project (default 30 days); pruning runs in-band during wiki update.
  • Lint perf: replaced the O(N²) cross-page substring scan with a single O(N) markdown-link scan into a Set<string> — lint over hundreds of pages is now linear, not quadratic.
  • which() cache: per-process memoization; binary detection no longer repeats which shell-outs across a single command's lifetime.

Files

  • Community/wiki-llm/SKILL.md — instructions for agents
  • Community/wiki-llm/DISPLAY.json — catalog metadata (icon: book-open)
  • Community/wiki-llm/scripts/wiki — the Bun CLI
  • Community/wiki-llm/assets/wiki-template/AGENTS.md + index.md + log.md skeleton seeded into every new wiki
  • Community/wiki-llm/references/CONCEPT.md — the original LLM-Wiki concept doc

Validation

bun validate passes clean for Community/wiki-llm. Smoke tested end-to-end on an interview-portal project: init → seed → register qmd collection → ingest → update → lint → list-deleted → recover → Stop hook trigger → path-traversal-rejection → corrupt-JSON survival → trash-retention prune.

bitphill added 2 commits June 27, 2026 21:07
A skill that maintains a persistent, incrementally-updated wiki for any
project folder. Implements the LLM-Wiki pattern: rather than RAG over raw
files, the agent builds and maintains a structured markdown knowledge
base that compounds over time.

Highlights:
- Multi-project registry (`wiki init <path>` per project, all tracked
  centrally) with single-project mode supported.
- Adaptive init questions per project type (js-ts, python, rust, go,
  data-folder, notes-folder, generic-website, generic).
- Optional Claude Code Stop hook that auto-detects git diffs in any
  registered project, archives deletions to a recoverable trash, and
  queues pending ingests. Update content regeneration is deferred to an
  explicit `wiki update` so token cost stays predictable.
- Wiki-native trash: deleted source files are archived under
  `<wiki>/.wiki-llm/trash/<timestamp>/` with a manifest;
  `wiki list-deleted` and `wiki recover` work independently of git.
- qmd integration: hybrid BM25+vector search over the wiki via
  `qmd query` / `qmd index`; `wiki install-qmd` provisions it.
- Single Bun CLI: init, status, ingest, update, query, lint,
  list-deleted, recover, hook, install-hook, install-qmd.
- New `wiki doctor [--fix]` command: checks bun, git, qmd, ripgrep,
  rustup, cargo. With --fix, best-effort installs missing pieces
  (qmd via bun/npm → cargo → bootstrap rustup then cargo; git and
  ripgrep via apt-get when running as root).
- Auto-preflight: on the first ever invocation in a fresh Zo
  Computer instance (no `~/.wiki-llm/registry.json` yet), the CLI
  silently runs `doctor --fix` to install anything missing before
  the user's command runs. Skipped for `doctor` itself and `hook`
  (to keep Stop-hook latency minimal).
- `install-qmd` upgraded to multi-strategy install
  (bun → npm → cargo → bootstrap rustup then cargo).
- `which()` probes well-known install locations (`~/.cargo/bin`,
  `~/.bun/bin`, `/usr/local/bin`) so cargo/qmd are detected even
  when not on PATH in a non-login shell.
- SKILL.md documents both the new command and the first-run
  preflight behavior.
@bitphill

Copy link
Copy Markdown
Author

Update: added wiki doctor command + first-run dependency preflight.

  • wiki doctor [--fix] — checks bun, git, qmd, ripgrep, rustup, cargo. With --fix, best-effort installs missing pieces (qmd via bun add -gnpm i -gcargo install → bootstraps rustup if cargo is missing and retries; git/ripgrep via apt-get when running as root).
  • Auto-preflight: on the first ever invocation in a fresh Zo Computer instance (no ~/.wiki-llm/registry.json yet), the CLI silently runs doctor --fix to install anything missing before the user's command runs. Skipped for doctor itself and the cheap hook entrypoint.
  • install-qmd upgraded to multi-strategy install (bun → npm → cargo → bootstrap rustup then cargo).
  • which() probes ~/.cargo/bin, ~/.bun/bin, /usr/local/bin so cargo/qmd are detected even when not on PATH in a non-login shell.

Smoke-tested on a fresh shell: wiki doctor reports all 6 deps green; preflight stays silent when nothing is missing.

Security:
- Replace shelled git interpolation with spawnSync arg form; validate
  refs against ^[A-Za-z0-9._/-]{1,255}$ before any use
- Reject path-traversal (.., absolute, null bytes) on ingest/recover/
  trash dest; enforce isUnder(root, dest) on all writes
- Cap `git show`/`git diff` output at 50 MB (maxBuffer)
- Cap ingest source size at 5 MB

Robustness:
- safeReadJSON() everywhere — corrupt registry/state/manifest no
  longer crashes the CLI
- Atomic writes (.tmp + rename) for registry, state, config, settings
- Single-flight lockfile around `wiki hook` (stale-lock reclaim @60s)
- Queue dedup on append (same file+action within last 100 entries)

Performance / memory:
- Lint: O(N) reference scan over a Set, replacing O(N^2) substring
  pair-check across all bodies
- which() result cached per-process
- Trash retention: prune trash dirs older than `trash_retention_days`
  (default 30; configurable per-project)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant