Automated pipeline for generating Beacon Object Files. A small team of named AI workers plans, writes, triages, and reviews BOFs; every shipped artifact has passed compile + lint + code-review gates, and every tier's review is persisted for audit.
Currently shipping 17 BOFs under output/ — file utilities (cat, cp, head, mv, rm, rmdir, stat, tail, touch, attrib), environment (cd, pwd, which, uptime), process (pkill), and network (dig, tcp_send). See the Generated BOFs table below.
| First Name | Job Title | Backend | Employment Status | Job Description |
|---|---|---|---|---|
| Michael | BOF Tier Planner | Claude Opus 4.7 | Active | Plans the BOF tiers based on a user-provided list of Linux tools to convert. Scoped to Linux coreutils; rejects out-of-scope tool names. |
| Pam | BOF Architect | Claude Sonnet 4.6 | Active | Writes an implementation spec for one tier at a time — DFR list, arg order, flow. Cheap structured output; audit showed Sonnet sufficient here. |
| Jim | BOF Developer | Qwen2.5-Coder-7B (local, LoRA-fine-tuned, 4-bit NF4) | Active | Generates the C source from Pam's spec. Runs on the local GPU (~5 GiB VRAM in NF4) at zero API cost. |
| Stanley | Triage Analyst | Claude Opus 4.7 | On-call | Reads failed compiler/lint output, the source, and beacon.h; writes a one-sentence diagnosis + concrete fix that gets injected into Jim's retry prompt. Capped at one call per tier (fires on the second consecutive compile fail) so spend stays bounded. |
| Angela | Code Reviewer | Claude Opus 4.7 | Active | Reviews successfully-compiled code against a written rubric; skipped entirely when build fails since the compiler output is already the feedback. Full review persisted as output/<tool>/review.json. |
| Oscar | Manifest Writer | Claude Sonnet 4.6 | Active | Reads each shippable BOF's source + Pam's spec and writes source-gated descriptions for manifest.json. Runs on --build-manifest, never during tier generation. |
Build failures don't go to Angela — compiler stderr says what's wrong. Stanley fires only when Jim is stuck (two consecutive compile fails in the same tier), and only once per tier — keeping the API bill bounded while still surfacing the why on the cases where the compiler error alone isn't enough for Jim.
flowchart TD
Jim[Jim writes code] --> Compile{Compile?}
Compile -->|OK| Lint{boflint?}
Compile -->|FAIL| Streak{2+ consecutive<br/>compile fails<br/>this tier?}
Streak -->|No| Retry
Streak -->|Yes, not yet<br/>triaged this tier| Stanley[Stanley triage<br/>Claude Opus 4.7 · 1×/tier]
Streak -->|Yes, already<br/>triaged| Retry
Lint -->|OK| Angela[Angela review<br/>Claude Opus 4.7]
Lint -->|FAIL| Retry
Stanley --> Retry[Retry prompt<br/>+ diagnosis + suggested fix]
Angela -->|Issues| Retry
Angela -->|Pass| Done[Ship BOF ✓<br/>+ review.json]
Retry --> Jim
classDef jim fill:#e6f4ff,stroke:#0969da,color:#0b2a4a
classDef paid fill:#fff8c5,stroke:#9a6700,color:#3b2300
classDef gate fill:#f6f8fa,stroke:#57606a,color:#1f2328
classDef done fill:#dafbe1,stroke:#2da44e,color:#114619
class Jim jim
class Stanley,Angela paid
class Compile,Lint,Streak gate
class Done done
Cost-saving rules baked into the flow:
- Angela is skipped whenever the build fails — compiler stderr and boflint output already say what's wrong. No Opus spend on obvious failures.
- Stanley is one Opus call per tier, max. He only fires on the second consecutive compile fail; first fails get retried with raw compiler output alone (cheaper, and most first fails are surface typos Jim corrects without help).
- Review caching by
code_hash— if Jim regenerates identical code across retries, Angela reuses her prior review instead of re-spending. The pipeline also halts the tier when Jim regenerates identical code despite triage feedback — a sign the retry isn't going to converge. - Jim runs locally at zero marginal cost — Qwen2.5-Coder-7B in 4-bit NF4 on a single 24 GiB card. Every iteration Jim-only is free; only the escalations (Pam, Stanley, Angela, Oscar) touch the Anthropic API.
- Every call is logged to
ai_responseswith input/output/cache tokens so per-tier spend is queryable viascripts/spend_report.py.
Angela's prompt is the source of truth for what "shippable" means in this repo. Lives at data/configuration/prompts/critic.md. It opens with a mandatory pre-review scan that forces three structural passes before any topical review:
- Loop-body callback scan — any
BeaconPrintf(CALLBACK_OUTPUT, ...)orBeaconOutput(..., CALLBACK_OUTPUT, ...)inside afor/while/do-whilebody is amajor"algorithm" issue. N producer items = N separate C2 callback events; the correct shape isBeaconFormatAlloc→Appendper row → singleBeaconOutput. - I/O return scan — every
ReadFile/WriteFile/SetFilePointer/RegQueryValueEx/DeviceIoControl/recv/send/WinHttp*call must have its return checked before any out-parameter (bytesRead, data buffer, etc.) is used. Missing check →major. - Cap scan — every hard-coded upper bound (allocation size, loop count, buffer length) must emit an operator-visible notice when hit. Silent truncation →
major.
After the pre-scan, the rubric covers: DFR discipline, API signature correctness (including the Reg/LSA/NetAPI "return-value-is-the-error; don't call GetLastError" trap), allocator pairing, comment-truth, algorithm correctness (including the snprintf-return-check rule), forbidden CRT headers, OPSEC scrutiny (persistence keys, noisy API flags, disk artifacts, hooked-API sequences), and tier-appropriateness.
Acceptance rule: any critical or major issue = fail; any low OPSEC issue = fail; otherwise pass. Low non-OPSEC issues (unused DFR, small comment nits) ship with the BOF and surface in review.json for operator awareness.
Every successful tier produces artifacts at four layers; higher tiers overwrite the previous tier's output in each layer so the latest version always wins.
| Path | Written by | Contents |
|---|---|---|
output/<tool>/<tool>.c + .x64.o |
Pipeline on tier success | Final accepted source + compiled BOF. |
output/<tool>/review.json |
Pipeline on tier success | Angela's full review — status, issues[], fix_suggestions[], raw response, token usage. Auditable after the fact. |
manifest.json |
--build-manifest (Oscar) |
Blight schema v2 manifest: sha256 / imphash / stringhash, extracted args, Oscar's descriptions. One entry per tool. |
training_data.jsonl |
Live hook on tier success + --refresh-training-data |
Minimal SFT corpus. One line per gated-clean tier: {"messages": [{"role": "system", ...}, {"role": "user", "<Pam's spec>"}, {"role": "assistant", "<accepted .c>"}]}. Only pass reviews with no critical/major issues land here; empty-Angela-response cases are still accepted since compile+lint already gated the code. |
Makefile (root) |
Hand-written | Rebuild every output/<tool>/<tool>.c deterministically with the exact flags the pipeline used. No AI, no DB — make all. |
Tier-progression rule — execute_tool_tiers walks tiers in strict level order. A tier cannot run unless every lower tier is completed; any failed/in_progress blocker halts the tool with an actionable message. This prevents higher tiers from building on nothing. The executor also auto-retries failed/in_progress tiers at the start of a plan run, so --execute-plan on a plan with prior failures just re-attempts without a manual reset.
python main.py --create-plan [tools] Create a BOF development plan via Michael (interactive if tools omitted)
python main.py --execute-plan <id|name> Run all pending tiers across the plan, in level order (auto-retries failed)
python main.py --execute-tool <id> Run a single tool's pending tiers
python main.py --execute-tier <id> Run a single tier
python main.py --pipeline-status Show worker loadout + backend configuration
python main.py --build-manifest Rebuild manifest.json from output/ (Oscar writes descriptions)
python main.py --refresh-training-data Rebuild training_data.jsonl from completed tiers in the DB
python scripts/spend_report.py Estimate AI spend by worker / model / run / tool (--by, --since, --top, --csv)
python scripts/reset_tiers.py List plan/tier state; --failed-only, --plan <name>, or tool names to reset
make Rebuild every output/<tool>/<tool>.c -> .x64.o locally (needs MinGW)
Each BOF has its own page in the GitHub Wiki with arguments, behavior, and a working example. Source lives at output/<name>/<name>.c; compiled object at output/<name>/<name>.x64.o; Angela's accepting review at output/<name>/review.json.
| BOF | Pretty Name | Quick Description |
|---|---|---|
attrib |
File Attributes | Read or toggle the HIDDEN / READONLY / SYSTEM attribute bits on a file. |
cat |
Cat | Read a file (up to 10 MB) and emit contents, optionally prefixed with line numbers. |
cd |
Change Directory | Change the beacon process's working directory; error path uses FormatMessageA for human-readable messages. |
cp |
Copy File | Copy a file via CopyFileW (Unicode-aware); refuses to overwrite by default. |
dig |
DNS Lookup | Resolve a hostname to its IPv4 A records via DnsQuery_A. |
head |
Head | Print the first N lines of a file (N ∈ 1–1000); stops at 512 KiB buffered. |
mv |
Move / Rename File | Move or rename a file via MoveFileExA; optional force flag to overwrite existing destination. |
pkill |
Kill by Name | Terminate every process whose image name matches (case-insensitive); skips the beacon itself; uses PROCESS_TERMINATE for minimal privilege. |
pwd |
Print Working Dir | Return the beacon process's current working directory. |
rm |
Remove File | Delete a single file; refuses if the path is a directory and points the operator at rmdir. |
rmdir |
Remove Directory | Remove an empty directory; refuses paths under System32 / SysWOW64 / Windows. |
stat |
Stat | Report file size (64-bit), attributes, and creation/modified/accessed timestamps in UTC. |
tail |
Tail | Read the last N bytes of a file (capped at 1 MiB per read; files up to 2 GB). |
tcp_send |
TCP Send | Open a TCP socket to host:port, send a payload, return up to 4 KiB of reply. |
touch |
Touch | Create a file if absent, or update its last-write/last-access times if it exists. |
uptime |
Uptime | Report system uptime (via GetTickCount64, wrap-safe) plus current UTC time. |
which |
Which | Find an executable on the target's %PATH%; optional flag to enumerate every match. |
Jim runs the fine-tuned Qwen2.5-Coder-7B LoRA checkpoint from a local bofit repo. Configured in data/configuration/workers.yaml:
pam_local:
type: "qwen"
checkpoint: "C:/Users/drew/Desktop/bofit/output/models/sft_20260420_142812/final"
base_model: "Qwen/Qwen2.5-Coder-7B-Instruct"
temperature: 0.7
max_new_tokens: 4096
load_in_4bit: trueload_in_4bit uses bitsandbytes NF4 quantization + double-quant + bf16 compute, bringing the 7B model to ~5 GiB VRAM. On a 24 GiB card this leaves plenty of headroom for KV cache and activations. The worker also forces the memory-efficient SDPA kernel at generate time (Windows PyTorch has no FlashAttention wheels, so the default math kernel would otherwise materialize the full (batch, heads, seq, seq) attention matrix).
Generated BOFs are over-cautious by design. To keep the beacon stable, Pam's specs default to defensive ceilings on anything that could blow up memory, time, or recursion budget — and Jim faithfully bakes them into the C source. The cost is that the BOFs sometimes refuse work that would actually be fine. Current ceilings across the shipping pack:
| BOF | Hardcoded ceiling | Why it's there |
|---|---|---|
cat |
rejects files > 10 MB | Prevents loading the whole file into beacon-process memory in one allocation. |
head |
reads up to 512 KiB; line count clamped 1–1000 | Caps the working buffer; huge log files don't warrant more in one shot. |
tail |
reads at most 1 MiB from the file tail; files > 2 GB not supported | SetFilePointer's 32-bit LONG form keeps the code simple at the cost of huge-file support. |
tcp_send |
receive buffer fixed at 4 KiB; timeout clamped 1–10000 ms | Single-shot request/reply pattern, not a streaming client. |
pkill |
PROCESS_TERMINATE only (not PROCESS_ALL_ACCESS) |
Lower-privilege handle = less EDR telemetry on the open. |
rmdir |
refuses paths containing System32 / SysWOW64 / Windows |
Defensive guard so an operator typo doesn't disturb the OS tree. |
rm |
refuses directories | Keeps rm's and rmdir's responsibilities distinct; directories must go through rmdir's guard. |
Open work-item: figure out a single, well-justified set of ceilings (or operator-supplied overrides) instead of each BOF inheriting whatever number Pam felt safe writing into the spec. Until then: if a BOF refuses your input with a "too large" error, the limit lives in output/<tool>/<tool>.c and you can rebuild with a higher cap via make.
The fine-tuning pipeline lives in a sibling repo (bofit); this repo provides the training corpus.
- Every successful tier appends one
{system, user, assistant}record totraining_data.jsonl, scrubbed of cosmetic contamination (};artifacts, WHAT-comments, non-canonical includes, bare libc) before write. --refresh-training-datarebuilds the corpus from authoritative state (DB + manifest +output/) when you need to regenerate after cleanup._critic_ok_for_training(incore/integrations/training_data.py) is the gate: onlypassreviews with nocritical/majorissues land in training data. Empty-Angela-response cases are still accepted since the compile+lint gates already ran.scripts/compile_filter_dataset.py,scripts/merge_datasets.py,scripts/build_val_split.py, andscripts/scrub_training_data.pyare the corpus-hygiene utilities.
Current checkpoint in production: sft_20260420_142812/final. Training pattern: continuous — every successful run adds new examples, with the expectation that a bofit retrain happens periodically.
| Feature | Priority | Status | Description |
|---|---|---|---|
Blight manifest.json generation |
High | Done | Oscar writes Blight schema-v2 manifests with sha256 / imphash / stringhash + source-gated descriptions. python main.py --build-manifest. |
Reproducible local build (Makefile) |
High | Done | Root Makefile rebuilds every output/<tool>/<tool>.c -> .x64.o with the pipeline's exact flags, no AI required. |
| Training-data JSONL corpus | High | Done | Every gated-clean tier appends a messages-format SFT example to training_data.jsonl. --refresh-training-data rebuilds from authoritative state. |
| Review persistence | High | Done | Angela's full review (issues, fix-suggestions, raw response, tokens) saved as output/<tool>/review.json on every successful finalize. |
| Local-GPU worker for Jim | High | Done | Qwen2.5-Coder-7B LoRA in 4-bit NF4 on the operator's GPU — zero API spend per generation. |
Cobalt Strike .cna file generation |
High | Planned | Emit an Aggressor script alongside each BOF so CS operators can load them natively. |
| COFFLoader / Blight tester | High | Planned | Wire COFFLoader / Blight-WinRM in for automatic end-to-end BOF smoke-testing. |
| AI cost optimizations | Medium | Ongoing | Sonnet for structured output (Pam, Oscar) · Angela skipped on build fail · code-hash review cache · token usage logged to DB · Opus escalation on sticky tiers · duplicate-code guard halts tier on identical regenerations · 9-iteration retry budget. |
| Automatic retraining | Medium | Planned | Fire a bofit retrain from training_data.jsonl once a BOF-count threshold is reached. |
| Standardize BOF safety ceilings | Medium | Planned | Replace each BOF's arbitrary hardcoded caps (file-size, depth, output) with a principled shared set or runtime operator overrides. |
| Dwight Worker | Medium | TBD | One-line CRUD bug-fixer for the cases too small to justify a Jim iteration. |