Skip to content

Drew-Alleman/The_BOFfice

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The_BOFfice

Automated pipeline for generating Beacon Object Files. A small team of named AI workers plans, writes, triages, and reviews BOFs; every shipped artifact has passed compile + lint + code-review gates, and every tier's review is persisted for audit.

image

Currently shipping 17 BOFs under output/ — file utilities (cat, cp, head, mv, rm, rmdir, stat, tail, touch, attrib), environment (cd, pwd, which, uptime), process (pkill), and network (dig, tcp_send). See the Generated BOFs table below.

Employee Roster

First Name Job Title Backend Employment Status Job Description
Michael BOF Tier Planner Claude Opus 4.7 Active Plans the BOF tiers based on a user-provided list of Linux tools to convert. Scoped to Linux coreutils; rejects out-of-scope tool names.
Pam BOF Architect Claude Sonnet 4.6 Active Writes an implementation spec for one tier at a time — DFR list, arg order, flow. Cheap structured output; audit showed Sonnet sufficient here.
Jim BOF Developer Qwen2.5-Coder-7B (local, LoRA-fine-tuned, 4-bit NF4) Active Generates the C source from Pam's spec. Runs on the local GPU (~5 GiB VRAM in NF4) at zero API cost.
Stanley Triage Analyst Claude Opus 4.7 On-call Reads failed compiler/lint output, the source, and beacon.h; writes a one-sentence diagnosis + concrete fix that gets injected into Jim's retry prompt. Capped at one call per tier (fires on the second consecutive compile fail) so spend stays bounded.
Angela Code Reviewer Claude Opus 4.7 Active Reviews successfully-compiled code against a written rubric; skipped entirely when build fails since the compiler output is already the feedback. Full review persisted as output/<tool>/review.json.
Oscar Manifest Writer Claude Sonnet 4.6 Active Reads each shippable BOF's source + Pam's spec and writes source-gated descriptions for manifest.json. Runs on --build-manifest, never during tier generation.

Error Handling Flow

Build failures don't go to Angela — compiler stderr says what's wrong. Stanley fires only when Jim is stuck (two consecutive compile fails in the same tier), and only once per tier — keeping the API bill bounded while still surfacing the why on the cases where the compiler error alone isn't enough for Jim.

flowchart TD
    Jim[Jim writes code] --> Compile{Compile?}

    Compile -->|OK| Lint{boflint?}
    Compile -->|FAIL| Streak{2+ consecutive<br/>compile fails<br/>this tier?}

    Streak -->|No| Retry
    Streak -->|Yes, not yet<br/>triaged this tier| Stanley[Stanley triage<br/>Claude Opus 4.7 · 1×/tier]
    Streak -->|Yes, already<br/>triaged| Retry

    Lint -->|OK| Angela[Angela review<br/>Claude Opus 4.7]
    Lint -->|FAIL| Retry

    Stanley --> Retry[Retry prompt<br/>+ diagnosis + suggested fix]
    Angela -->|Issues| Retry
    Angela -->|Pass| Done[Ship BOF ✓<br/>+ review.json]

    Retry --> Jim

    classDef jim fill:#e6f4ff,stroke:#0969da,color:#0b2a4a
    classDef paid fill:#fff8c5,stroke:#9a6700,color:#3b2300
    classDef gate fill:#f6f8fa,stroke:#57606a,color:#1f2328
    classDef done fill:#dafbe1,stroke:#2da44e,color:#114619

    class Jim jim
    class Stanley,Angela paid
    class Compile,Lint,Streak gate
    class Done done
Loading

Cost-saving rules baked into the flow:

  1. Angela is skipped whenever the build fails — compiler stderr and boflint output already say what's wrong. No Opus spend on obvious failures.
  2. Stanley is one Opus call per tier, max. He only fires on the second consecutive compile fail; first fails get retried with raw compiler output alone (cheaper, and most first fails are surface typos Jim corrects without help).
  3. Review caching by code_hash — if Jim regenerates identical code across retries, Angela reuses her prior review instead of re-spending. The pipeline also halts the tier when Jim regenerates identical code despite triage feedback — a sign the retry isn't going to converge.
  4. Jim runs locally at zero marginal cost — Qwen2.5-Coder-7B in 4-bit NF4 on a single 24 GiB card. Every iteration Jim-only is free; only the escalations (Pam, Stanley, Angela, Oscar) touch the Anthropic API.
  5. Every call is logged to ai_responses with input/output/cache tokens so per-tier spend is queryable via scripts/spend_report.py.

Angela's Review Rubric

Angela's prompt is the source of truth for what "shippable" means in this repo. Lives at data/configuration/prompts/critic.md. It opens with a mandatory pre-review scan that forces three structural passes before any topical review:

  1. Loop-body callback scan — any BeaconPrintf(CALLBACK_OUTPUT, ...) or BeaconOutput(..., CALLBACK_OUTPUT, ...) inside a for/while/do-while body is a major "algorithm" issue. N producer items = N separate C2 callback events; the correct shape is BeaconFormatAllocAppend per row → single BeaconOutput.
  2. I/O return scan — every ReadFile/WriteFile/SetFilePointer/RegQueryValueEx/DeviceIoControl/recv/send/WinHttp* call must have its return checked before any out-parameter (bytesRead, data buffer, etc.) is used. Missing check → major.
  3. Cap scan — every hard-coded upper bound (allocation size, loop count, buffer length) must emit an operator-visible notice when hit. Silent truncation → major.

After the pre-scan, the rubric covers: DFR discipline, API signature correctness (including the Reg/LSA/NetAPI "return-value-is-the-error; don't call GetLastError" trap), allocator pairing, comment-truth, algorithm correctness (including the snprintf-return-check rule), forbidden CRT headers, OPSEC scrutiny (persistence keys, noisy API flags, disk artifacts, hooked-API sequences), and tier-appropriateness.

Acceptance rule: any critical or major issue = fail; any low OPSEC issue = fail; otherwise pass. Low non-OPSEC issues (unused DFR, small comment nits) ship with the BOF and surface in review.json for operator awareness.

Repository Outputs

Every successful tier produces artifacts at four layers; higher tiers overwrite the previous tier's output in each layer so the latest version always wins.

Path Written by Contents
output/<tool>/<tool>.c + .x64.o Pipeline on tier success Final accepted source + compiled BOF.
output/<tool>/review.json Pipeline on tier success Angela's full review — status, issues[], fix_suggestions[], raw response, token usage. Auditable after the fact.
manifest.json --build-manifest (Oscar) Blight schema v2 manifest: sha256 / imphash / stringhash, extracted args, Oscar's descriptions. One entry per tool.
training_data.jsonl Live hook on tier success + --refresh-training-data Minimal SFT corpus. One line per gated-clean tier: {"messages": [{"role": "system", ...}, {"role": "user", "<Pam's spec>"}, {"role": "assistant", "<accepted .c>"}]}. Only pass reviews with no critical/major issues land here; empty-Angela-response cases are still accepted since compile+lint already gated the code.
Makefile (root) Hand-written Rebuild every output/<tool>/<tool>.c deterministically with the exact flags the pipeline used. No AI, no DB — make all.

Tier-progression ruleexecute_tool_tiers walks tiers in strict level order. A tier cannot run unless every lower tier is completed; any failed/in_progress blocker halts the tool with an actionable message. This prevents higher tiers from building on nothing. The executor also auto-retries failed/in_progress tiers at the start of a plan run, so --execute-plan on a plan with prior failures just re-attempts without a manual reset.

Commands

python main.py --create-plan [tools]       Create a BOF development plan via Michael (interactive if tools omitted)
python main.py --execute-plan <id|name>    Run all pending tiers across the plan, in level order (auto-retries failed)
python main.py --execute-tool <id>         Run a single tool's pending tiers
python main.py --execute-tier <id>         Run a single tier
python main.py --pipeline-status           Show worker loadout + backend configuration
python main.py --build-manifest            Rebuild manifest.json from output/ (Oscar writes descriptions)
python main.py --refresh-training-data     Rebuild training_data.jsonl from completed tiers in the DB
python scripts/spend_report.py             Estimate AI spend by worker / model / run / tool (--by, --since, --top, --csv)
python scripts/reset_tiers.py              List plan/tier state; --failed-only, --plan <name>, or tool names to reset
make                                       Rebuild every output/<tool>/<tool>.c -> .x64.o locally (needs MinGW)

Generated BOFs

Each BOF has its own page in the GitHub Wiki with arguments, behavior, and a working example. Source lives at output/<name>/<name>.c; compiled object at output/<name>/<name>.x64.o; Angela's accepting review at output/<name>/review.json.

BOF Pretty Name Quick Description
attrib File Attributes Read or toggle the HIDDEN / READONLY / SYSTEM attribute bits on a file.
cat Cat Read a file (up to 10 MB) and emit contents, optionally prefixed with line numbers.
cd Change Directory Change the beacon process's working directory; error path uses FormatMessageA for human-readable messages.
cp Copy File Copy a file via CopyFileW (Unicode-aware); refuses to overwrite by default.
dig DNS Lookup Resolve a hostname to its IPv4 A records via DnsQuery_A.
head Head Print the first N lines of a file (N ∈ 1–1000); stops at 512 KiB buffered.
mv Move / Rename File Move or rename a file via MoveFileExA; optional force flag to overwrite existing destination.
pkill Kill by Name Terminate every process whose image name matches (case-insensitive); skips the beacon itself; uses PROCESS_TERMINATE for minimal privilege.
pwd Print Working Dir Return the beacon process's current working directory.
rm Remove File Delete a single file; refuses if the path is a directory and points the operator at rmdir.
rmdir Remove Directory Remove an empty directory; refuses paths under System32 / SysWOW64 / Windows.
stat Stat Report file size (64-bit), attributes, and creation/modified/accessed timestamps in UTC.
tail Tail Read the last N bytes of a file (capped at 1 MiB per read; files up to 2 GB).
tcp_send TCP Send Open a TCP socket to host:port, send a payload, return up to 4 KiB of reply.
touch Touch Create a file if absent, or update its last-write/last-access times if it exists.
uptime Uptime Report system uptime (via GetTickCount64, wrap-safe) plus current UTC time.
which Which Find an executable on the target's %PATH%; optional flag to enumerate every match.

Local Model Setup (Jim)

Jim runs the fine-tuned Qwen2.5-Coder-7B LoRA checkpoint from a local bofit repo. Configured in data/configuration/workers.yaml:

pam_local:
  type: "qwen"
  checkpoint: "C:/Users/drew/Desktop/bofit/output/models/sft_20260420_142812/final"
  base_model: "Qwen/Qwen2.5-Coder-7B-Instruct"
  temperature: 0.7
  max_new_tokens: 4096
  load_in_4bit: true

load_in_4bit uses bitsandbytes NF4 quantization + double-quant + bf16 compute, bringing the 7B model to ~5 GiB VRAM. On a 24 GiB card this leaves plenty of headroom for KV cache and activations. The worker also forces the memory-efficient SDPA kernel at generate time (Windows PyTorch has no FlashAttention wheels, so the default math kernel would otherwise materialize the full (batch, heads, seq, seq) attention matrix).

Known Limitations

Generated BOFs are over-cautious by design. To keep the beacon stable, Pam's specs default to defensive ceilings on anything that could blow up memory, time, or recursion budget — and Jim faithfully bakes them into the C source. The cost is that the BOFs sometimes refuse work that would actually be fine. Current ceilings across the shipping pack:

BOF Hardcoded ceiling Why it's there
cat rejects files > 10 MB Prevents loading the whole file into beacon-process memory in one allocation.
head reads up to 512 KiB; line count clamped 1–1000 Caps the working buffer; huge log files don't warrant more in one shot.
tail reads at most 1 MiB from the file tail; files > 2 GB not supported SetFilePointer's 32-bit LONG form keeps the code simple at the cost of huge-file support.
tcp_send receive buffer fixed at 4 KiB; timeout clamped 1–10000 ms Single-shot request/reply pattern, not a streaming client.
pkill PROCESS_TERMINATE only (not PROCESS_ALL_ACCESS) Lower-privilege handle = less EDR telemetry on the open.
rmdir refuses paths containing System32 / SysWOW64 / Windows Defensive guard so an operator typo doesn't disturb the OS tree.
rm refuses directories Keeps rm's and rmdir's responsibilities distinct; directories must go through rmdir's guard.

Open work-item: figure out a single, well-justified set of ceilings (or operator-supplied overrides) instead of each BOF inheriting whatever number Pam felt safe writing into the spec. Until then: if a BOF refuses your input with a "too large" error, the limit lives in output/<tool>/<tool>.c and you can rebuild with a higher cap via make.

Training Jim

The fine-tuning pipeline lives in a sibling repo (bofit); this repo provides the training corpus.

  1. Every successful tier appends one {system, user, assistant} record to training_data.jsonl, scrubbed of cosmetic contamination (}; artifacts, WHAT-comments, non-canonical includes, bare libc) before write.
  2. --refresh-training-data rebuilds the corpus from authoritative state (DB + manifest + output/) when you need to regenerate after cleanup.
  3. _critic_ok_for_training (in core/integrations/training_data.py) is the gate: only pass reviews with no critical/major issues land in training data. Empty-Angela-response cases are still accepted since the compile+lint gates already ran.
  4. scripts/compile_filter_dataset.py, scripts/merge_datasets.py, scripts/build_val_split.py, and scripts/scrub_training_data.py are the corpus-hygiene utilities.

Current checkpoint in production: sft_20260420_142812/final. Training pattern: continuous — every successful run adds new examples, with the expectation that a bofit retrain happens periodically.

Feature Roadmap

Feature Priority Status Description
Blight manifest.json generation High Done Oscar writes Blight schema-v2 manifests with sha256 / imphash / stringhash + source-gated descriptions. python main.py --build-manifest.
Reproducible local build (Makefile) High Done Root Makefile rebuilds every output/<tool>/<tool>.c -> .x64.o with the pipeline's exact flags, no AI required.
Training-data JSONL corpus High Done Every gated-clean tier appends a messages-format SFT example to training_data.jsonl. --refresh-training-data rebuilds from authoritative state.
Review persistence High Done Angela's full review (issues, fix-suggestions, raw response, tokens) saved as output/<tool>/review.json on every successful finalize.
Local-GPU worker for Jim High Done Qwen2.5-Coder-7B LoRA in 4-bit NF4 on the operator's GPU — zero API spend per generation.
Cobalt Strike .cna file generation High Planned Emit an Aggressor script alongside each BOF so CS operators can load them natively.
COFFLoader / Blight tester High Planned Wire COFFLoader / Blight-WinRM in for automatic end-to-end BOF smoke-testing.
AI cost optimizations Medium Ongoing Sonnet for structured output (Pam, Oscar) · Angela skipped on build fail · code-hash review cache · token usage logged to DB · Opus escalation on sticky tiers · duplicate-code guard halts tier on identical regenerations · 9-iteration retry budget.
Automatic retraining Medium Planned Fire a bofit retrain from training_data.jsonl once a BOF-count threshold is reached.
Standardize BOF safety ceilings Medium Planned Replace each BOF's arbitrary hardcoded caps (file-size, depth, output) with a principled shared set or runtime operator overrides.
Dwight Worker Medium TBD One-line CRUD bug-fixer for the cases too small to justify a Jim iteration.

About

Automated pipeline for generating Beacon Object Files using AI code generation. Currently converting Linux coreutils into tiered BOF collections organized by operational priority

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors