The_BOFfice

Automated pipeline for generating Beacon Object Files. A small team of named AI workers plans, writes, triages, and reviews BOFs; every shipped artifact has passed compile + lint + code-review gates, and every tier's review is persisted for audit.

Currently shipping 17 BOFs under output/ — file utilities (cat, cp, head, mv, rm, rmdir, stat, tail, touch, attrib), environment (cd, pwd, which, uptime), process (pkill), and network (dig, tcp_send). See the Generated BOFs table below.

Employee Roster

First Name	Job Title	Backend	Employment Status	Job Description
Michael	BOF Tier Planner	Claude Opus 4.7	Active	Plans the BOF tiers based on a user-provided list of Linux tools to convert. Scoped to Linux coreutils; rejects out-of-scope tool names.
Pam	BOF Architect	Claude Sonnet 4.6	Active	Writes an implementation spec for one tier at a time — DFR list, arg order, flow. Cheap structured output; audit showed Sonnet sufficient here.
Jim	BOF Developer	Qwen2.5-Coder-7B (local, LoRA-fine-tuned, 4-bit NF4)	Active	Generates the C source from Pam's spec. Runs on the local GPU (~5 GiB VRAM in NF4) at zero API cost.
Stanley	Triage Analyst	Claude Opus 4.7	On-call	Reads failed compiler/lint output, the source, and `beacon.h`; writes a one-sentence diagnosis + concrete fix that gets injected into Jim's retry prompt. Capped at one call per tier (fires on the second consecutive compile fail) so spend stays bounded.
Angela	Code Reviewer	Claude Opus 4.7	Active	Reviews successfully-compiled code against a written rubric; skipped entirely when build fails since the compiler output is already the feedback. Full review persisted as `output/<tool>/review.json`.
Oscar	Manifest Writer	Claude Sonnet 4.6	Active	Reads each shippable BOF's source + Pam's spec and writes source-gated descriptions for `manifest.json`. Runs on `--build-manifest`, never during tier generation.

Error Handling Flow

Build failures don't go to Angela — compiler stderr says what's wrong. Stanley fires only when Jim is stuck (two consecutive compile fails in the same tier), and only once per tier — keeping the API bill bounded while still surfacing the why on the cases where the compiler error alone isn't enough for Jim.

flowchart TD
    Jim[Jim writes code] --> Compile{Compile?}

    Compile -->|OK| Lint{boflint?}
    Compile -->|FAIL| Streak{2+ consecutive<br/>compile fails<br/>this tier?}

    Streak -->|No| Retry
    Streak -->|Yes, not yet<br/>triaged this tier| Stanley[Stanley triage<br/>Claude Opus 4.7 · 1×/tier]
    Streak -->|Yes, already<br/>triaged| Retry

    Lint -->|OK| Angela[Angela review<br/>Claude Opus 4.7]
    Lint -->|FAIL| Retry

    Stanley --> Retry[Retry prompt<br/>+ diagnosis + suggested fix]
    Angela -->|Issues| Retry
    Angela -->|Pass| Done[Ship BOF ✓<br/>+ review.json]

    Retry --> Jim

    classDef jim fill:#e6f4ff,stroke:#0969da,color:#0b2a4a
    classDef paid fill:#fff8c5,stroke:#9a6700,color:#3b2300
    classDef gate fill:#f6f8fa,stroke:#57606a,color:#1f2328
    classDef done fill:#dafbe1,stroke:#2da44e,color:#114619

    class Jim jim
    class Stanley,Angela paid
    class Compile,Lint,Streak gate
    class Done done

Cost-saving rules baked into the flow:

Angela is skipped whenever the build fails — compiler stderr and boflint output already say what's wrong. No Opus spend on obvious failures.
Stanley is one Opus call per tier, max. He only fires on the second consecutive compile fail; first fails get retried with raw compiler output alone (cheaper, and most first fails are surface typos Jim corrects without help).
Review caching by code_hash — if Jim regenerates identical code across retries, Angela reuses her prior review instead of re-spending. The pipeline also halts the tier when Jim regenerates identical code despite triage feedback — a sign the retry isn't going to converge.
Jim runs locally at zero marginal cost — Qwen2.5-Coder-7B in 4-bit NF4 on a single 24 GiB card. Every iteration Jim-only is free; only the escalations (Pam, Stanley, Angela, Oscar) touch the Anthropic API.
Every call is logged to ai_responses with input/output/cache tokens so per-tier spend is queryable via scripts/spend_report.py.

Angela's Review Rubric

Angela's prompt is the source of truth for what "shippable" means in this repo. Lives at data/configuration/prompts/critic.md. It opens with a mandatory pre-review scan that forces three structural passes before any topical review:

Loop-body callback scan — any BeaconPrintf(CALLBACK_OUTPUT, ...) or BeaconOutput(..., CALLBACK_OUTPUT, ...) inside a for/while/do-while body is a major "algorithm" issue. N producer items = N separate C2 callback events; the correct shape is BeaconFormatAlloc → Append per row → single BeaconOutput.
I/O return scan — every ReadFile/WriteFile/SetFilePointer/RegQueryValueEx/DeviceIoControl/recv/send/WinHttp* call must have its return checked before any out-parameter (bytesRead, data buffer, etc.) is used. Missing check → major.
Cap scan — every hard-coded upper bound (allocation size, loop count, buffer length) must emit an operator-visible notice when hit. Silent truncation → major.

After the pre-scan, the rubric covers: DFR discipline, API signature correctness (including the Reg/LSA/NetAPI "return-value-is-the-error; don't call GetLastError" trap), allocator pairing, comment-truth, algorithm correctness (including the snprintf-return-check rule), forbidden CRT headers, OPSEC scrutiny (persistence keys, noisy API flags, disk artifacts, hooked-API sequences), and tier-appropriateness.

Acceptance rule: any critical or major issue = fail; any low OPSEC issue = fail; otherwise pass. Low non-OPSEC issues (unused DFR, small comment nits) ship with the BOF and surface in review.json for operator awareness.

Repository Outputs

Every successful tier produces artifacts at four layers; higher tiers overwrite the previous tier's output in each layer so the latest version always wins.

Path	Written by	Contents
`output/<tool>/<tool>.c` + `.x64.o`	Pipeline on tier success	Final accepted source + compiled BOF.
`output/<tool>/review.json`	Pipeline on tier success	Angela's full review — `status`, `issues[]`, `fix_suggestions[]`, raw response, token usage. Auditable after the fact.
`manifest.json`	`--build-manifest` (Oscar)	Blight schema v2 manifest: sha256 / imphash / stringhash, extracted args, Oscar's descriptions. One entry per tool.
`training_data.jsonl`	Live hook on tier success + `--refresh-training-data`	Minimal SFT corpus. One line per gated-clean tier: `{"messages": [{"role": "system", ...}, {"role": "user", "<Pam's spec>"}, {"role": "assistant", "<accepted .c>"}]}`. Only `pass` reviews with no `critical`/`major` issues land here; empty-Angela-response cases are still accepted since compile+lint already gated the code.
`Makefile` (root)	Hand-written	Rebuild every `output/<tool>/<tool>.c` deterministically with the exact flags the pipeline used. No AI, no DB — `make all`.

Tier-progression rule — execute_tool_tiers walks tiers in strict level order. A tier cannot run unless every lower tier is completed; any failed/in_progress blocker halts the tool with an actionable message. This prevents higher tiers from building on nothing. The executor also auto-retries failed/in_progress tiers at the start of a plan run, so --execute-plan on a plan with prior failures just re-attempts without a manual reset.

Commands

python main.py --create-plan [tools]       Create a BOF development plan via Michael (interactive if tools omitted)
python main.py --execute-plan <id|name>    Run all pending tiers across the plan, in level order (auto-retries failed)
python main.py --execute-tool <id>         Run a single tool's pending tiers
python main.py --execute-tier <id>         Run a single tier
python main.py --pipeline-status           Show worker loadout + backend configuration
python main.py --build-manifest            Rebuild manifest.json from output/ (Oscar writes descriptions)
python main.py --refresh-training-data     Rebuild training_data.jsonl from completed tiers in the DB
python scripts/spend_report.py             Estimate AI spend by worker / model / run / tool (--by, --since, --top, --csv)
python scripts/reset_tiers.py              List plan/tier state; --failed-only, --plan <name>, or tool names to reset
make                                       Rebuild every output/<tool>/<tool>.c -> .x64.o locally (needs MinGW)

Generated BOFs

Each BOF has its own page in the GitHub Wiki with arguments, behavior, and a working example. Source lives at output/<name>/<name>.c; compiled object at output/<name>/<name>.x64.o; Angela's accepting review at output/<name>/review.json.

BOF	Pretty Name	Quick Description
`attrib`	File Attributes	Read or toggle the `HIDDEN` / `READONLY` / `SYSTEM` attribute bits on a file.
`cat`	Cat	Read a file (up to 10 MB) and emit contents, optionally prefixed with line numbers.
`cd`	Change Directory	Change the beacon process's working directory; error path uses `FormatMessageA` for human-readable messages.
`cp`	Copy File	Copy a file via `CopyFileW` (Unicode-aware); refuses to overwrite by default.
`dig`	DNS Lookup	Resolve a hostname to its IPv4 A records via `DnsQuery_A`.
`head`	Head	Print the first N lines of a file (N ∈ 1–1000); stops at 512 KiB buffered.
`mv`	Move / Rename File	Move or rename a file via `MoveFileExA`; optional `force` flag to overwrite existing destination.
`pkill`	Kill by Name	Terminate every process whose image name matches (case-insensitive); skips the beacon itself; uses `PROCESS_TERMINATE` for minimal privilege.
`pwd`	Print Working Dir	Return the beacon process's current working directory.
`rm`	Remove File	Delete a single file; refuses if the path is a directory and points the operator at `rmdir`.
`rmdir`	Remove Directory	Remove an empty directory; refuses paths under `System32` / `SysWOW64` / `Windows`.
`stat`	Stat	Report file size (64-bit), attributes, and creation/modified/accessed timestamps in UTC.
`tail`	Tail	Read the last N bytes of a file (capped at 1 MiB per read; files up to 2 GB).
`tcp_send`	TCP Send	Open a TCP socket to `host:port`, send a payload, return up to 4 KiB of reply.
`touch`	Touch	Create a file if absent, or update its last-write/last-access times if it exists.
`uptime`	Uptime	Report system uptime (via `GetTickCount64`, wrap-safe) plus current UTC time.
`which`	Which	Find an executable on the target's `%PATH%`; optional flag to enumerate every match.

Local Model Setup (Jim)

Jim runs the fine-tuned Qwen2.5-Coder-7B LoRA checkpoint from a local bofit repo. Configured in data/configuration/workers.yaml:

pam_local:
  type: "qwen"
  checkpoint: "C:/Users/drew/Desktop/bofit/output/models/sft_20260420_142812/final"
  base_model: "Qwen/Qwen2.5-Coder-7B-Instruct"
  temperature: 0.7
  max_new_tokens: 4096
  load_in_4bit: true

load_in_4bit uses bitsandbytes NF4 quantization + double-quant + bf16 compute, bringing the 7B model to ~5 GiB VRAM. On a 24 GiB card this leaves plenty of headroom for KV cache and activations. The worker also forces the memory-efficient SDPA kernel at generate time (Windows PyTorch has no FlashAttention wheels, so the default math kernel would otherwise materialize the full (batch, heads, seq, seq) attention matrix).

Known Limitations

Generated BOFs are over-cautious by design. To keep the beacon stable, Pam's specs default to defensive ceilings on anything that could blow up memory, time, or recursion budget — and Jim faithfully bakes them into the C source. The cost is that the BOFs sometimes refuse work that would actually be fine. Current ceilings across the shipping pack:

BOF	Hardcoded ceiling	Why it's there
`cat`	rejects files > 10 MB	Prevents loading the whole file into beacon-process memory in one allocation.
`head`	reads up to 512 KiB; line count clamped 1–1000	Caps the working buffer; huge log files don't warrant more in one shot.
`tail`	reads at most 1 MiB from the file tail; files > 2 GB not supported	`SetFilePointer`'s 32-bit `LONG` form keeps the code simple at the cost of huge-file support.
`tcp_send`	receive buffer fixed at 4 KiB; timeout clamped 1–10000 ms	Single-shot request/reply pattern, not a streaming client.
`pkill`	`PROCESS_TERMINATE` only (not `PROCESS_ALL_ACCESS`)	Lower-privilege handle = less EDR telemetry on the open.
`rmdir`	refuses paths containing `System32` / `SysWOW64` / `Windows`	Defensive guard so an operator typo doesn't disturb the OS tree.
`rm`	refuses directories	Keeps `rm`'s and `rmdir`'s responsibilities distinct; directories must go through `rmdir`'s guard.

Open work-item: figure out a single, well-justified set of ceilings (or operator-supplied overrides) instead of each BOF inheriting whatever number Pam felt safe writing into the spec. Until then: if a BOF refuses your input with a "too large" error, the limit lives in output/<tool>/<tool>.c and you can rebuild with a higher cap via make.

Training Jim

The fine-tuning pipeline lives in a sibling repo (bofit); this repo provides the training corpus.

Every successful tier appends one {system, user, assistant} record to training_data.jsonl, scrubbed of cosmetic contamination (}; artifacts, WHAT-comments, non-canonical includes, bare libc) before write.
--refresh-training-data rebuilds the corpus from authoritative state (DB + manifest + output/) when you need to regenerate after cleanup.
_critic_ok_for_training (in core/integrations/training_data.py) is the gate: only pass reviews with no critical/major issues land in training data. Empty-Angela-response cases are still accepted since the compile+lint gates already ran.
scripts/compile_filter_dataset.py, scripts/merge_datasets.py, scripts/build_val_split.py, and scripts/scrub_training_data.py are the corpus-hygiene utilities.

Current checkpoint in production: sft_20260420_142812/final. Training pattern: continuous — every successful run adds new examples, with the expectation that a bofit retrain happens periodically.

Feature Roadmap

Feature	Priority	Status	Description
Blight `manifest.json` generation	High	Done	Oscar writes Blight schema-v2 manifests with sha256 / imphash / stringhash + source-gated descriptions. `python main.py --build-manifest`.
Reproducible local build (`Makefile`)	High	Done	Root `Makefile` rebuilds every `output/<tool>/<tool>.c -> .x64.o` with the pipeline's exact flags, no AI required.
Training-data JSONL corpus	High	Done	Every gated-clean tier appends a messages-format SFT example to `training_data.jsonl`. `--refresh-training-data` rebuilds from authoritative state.
Review persistence	High	Done	Angela's full review (issues, fix-suggestions, raw response, tokens) saved as `output/<tool>/review.json` on every successful finalize.
Local-GPU worker for Jim	High	Done	Qwen2.5-Coder-7B LoRA in 4-bit NF4 on the operator's GPU — zero API spend per generation.
Cobalt Strike `.cna` file generation	High	Planned	Emit an Aggressor script alongside each BOF so CS operators can load them natively.
COFFLoader / Blight tester	High	Planned	Wire COFFLoader / Blight-WinRM in for automatic end-to-end BOF smoke-testing.
AI cost optimizations	Medium	Ongoing	Sonnet for structured output (Pam, Oscar) · Angela skipped on build fail · code-hash review cache · token usage logged to DB · Opus escalation on sticky tiers · duplicate-code guard halts tier on identical regenerations · 9-iteration retry budget.
Automatic retraining	Medium	Planned	Fire a `bofit` retrain from `training_data.jsonl` once a BOF-count threshold is reached.
Standardize BOF safety ceilings	Medium	Planned	Replace each BOF's arbitrary hardcoded caps (file-size, depth, output) with a principled shared set or runtime operator overrides.
Dwight Worker	Medium	TBD	One-line CRUD bug-fixer for the cases too small to justify a Jim iteration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The_BOFfice

Employee Roster

Error Handling Flow

Angela's Review Rubric

Repository Outputs

Commands

Generated BOFs

Local Model Setup (Jim)

Known Limitations

Training Jim

Feature Roadmap

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
core		core
data		data
output		output
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
main.py		main.py
manifest.json		manifest.json
requirements.txt		requirements.txt
training_data.jsonl		training_data.jsonl

Folders and files

Latest commit

History

Repository files navigation

The_BOFfice

Employee Roster

Error Handling Flow

Angela's Review Rubric

Repository Outputs

Commands

Generated BOFs

Local Model Setup (Jim)

Known Limitations

Training Jim

Feature Roadmap

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages