AI agents that triage infrastructure alerts, investigate root causes, and propose fixes — while a solo operator sleeps.
For the complete technical reference, see README.extensive.md.
One person. 310+ infrastructure objects across 6 sites. 3 firewalls, 12 Kubernetes nodes, self-hosted everything. When an alert fires at 3am, there's no team to call. There never is.
Three agentic subsystems that handle the detective work — ChatOps (infrastructure), ChatSecOps (security), ChatDevOps (CI/CD) — built on n8n orchestration, Matrix as the human interface, and a 3-tier agent architecture. The human stays in the loop for every infrastructure change. The system never acts without a thumbs-up or poll vote.
The system evaluates its own performance and auto-patches its prompts. Every session is scored by an LLM-as-a-Judge on 5 quality dimensions (gemma3:12b local-first since 2026-04-19, Haiku for calibration). When a dimension averages below threshold over 30 days, the preference-iterating patcher (IFRNLLEI01PRD-645, 2026-04-20) generates 3 candidate instruction variants (concise / detailed / examples) and assigns each future matching session to one arm via deterministic BLAKE2b hash — plus a no-patch control. A daily cron runs a one-sided Welch t-test once every arm reaches 15 samples; the winner is promoted only if it beats control by ≥ 0.05 points with p < 0.1. Otherwise the trial is aborted. Prompt-level policy iteration — no model weights are ever fine-tuned.
Session → LLM Judge (5 dims) → dimension trending below threshold
→ prompt-patch-trial.py generates 3 candidate variants + 1 control
→ future sessions hash-routed to arms → Welch t-test at 15+ samples/arm
→ winner promoted to config/prompt-patches.json (source: "trial:N:idx=I")
→ next eval cycle scores the new patch → loop continues
Before Claude Code investigates, a Haiku planner generates a 3-5 step investigation plan. The planner queries AWX for matching Ansible playbooks from 41 proven templates (maintenance, cert sync, K8s drain, PVE updates, DMZ deployments). Plans naturally include "Run AWX Template 64 with dry_run=true" as remediation steps — bridging AI reasoning with proven automation.
Instead of only reacting after alerts fire, the system queries LibreNMS API daily for trending risk across both sites. Devices are scored on disk usage trends, alert frequency, and health signals. A daily top-10 risk report posts to Matrix before problems become incidents.
Retrieval uses Reciprocal Rank Fusion across 5 signals (semantic + keyword + compiled wiki + MemPalace transcripts + chaos baselines), plus a GraphRAG knowledge graph (360 entities, 193 relationships). Retrieval short-circuits via two intent detectors: temporal window ("last 48h", "72 hours ending YYYY-MM-DD") filters wiki on source_mtime, and mtime-sort intent ("name any three memory files created in the last 48h") bypasses semantic retrieval entirely and returns an mtime-ranked window. Results older than 7 days get age-proportional staleness warnings. A Haiku synth step composes cross-chunk answers when top rerank < threshold (3-4× faster p95 than the Ollama ensemble). SYNTH_HAIKU_FORCE_FAIL env supports 5 failure modes (429 / auth / timeout / network / empty) that all fall back cleanly to local qwen2.5.
Following Andrej Karpathy's LLM Knowledge Bases pattern: raw data from 7+ sources (117 memory files, 55 CLAUDE.md files, 33 incidents, 27 lessons, 101 OpenClaw memories, 17 skills, ~5,200 lab docs) is compiled into a browsable 44-article wiki with auto-maintained indexes, daily SHA-256 incremental recompilation, and contradiction detection. All articles embedded into RAG as the 3rd fusion signal.
88,448 tool calls instrumented across 108 tool types with per-tool error rates and latency percentiles. 39K OTel spans across 94 traces exported to OpenObserve (OTLP). 10 Grafana dashboards (64+ panels) covering ChatOps, ChatSecOps, ChatDevOps, and trace analysis. 18,220 infrastructure commands logged across 232 devices.
58 scenarios across 3 eval sets (22 regression + 20 discovery + 16 holdout) + 54 adversarial red-team tests. Prompt Scorecard grades 19 surfaces daily on 6 dimensions. Agent Trajectory scoring on 8 infra / 4 dev steps. A/B variant testing (react_v1 vs react_v2). CI eval gate blocks bad merges. Monthly eval flywheel cycle.
The 2026-04-20 audit of openai/openai-agents-python flagged 11 gaps; 9 were implemented (issues IFRNLLEI01PRD-635..643). The system now has a versioned, typed, recoverable substrate the old string-based Matrix pipeline couldn't offer:
- Schema versioning on 9 session/audit tables + a central registry (
scripts/lib/schema_version.py) mirroring the SDK'sRunState.CURRENT_SCHEMA_VERSION/SCHEMA_VERSION_SUMMARIESpattern. Writers stampschema_version=CURRENT; readerscheck_row()fail-fast on future versions. - 13 typed events (
session_events.py) in a newevent_logtable —tool_started/ended,handoff_requested/completed/cycle_detected/compaction,reasoning_item_created,mcp_approval_*,agent_updated,message_output_created,tool_guardrail_rejection,agent_as_tool_call. Replaces free-form Matrix strings with Grafana-queryable structured telemetry. - Per-turn lifecycle hooks —
session-start.sh,post-tool-use.sh,user-prompt-submit.sh,session-end.sh(new — theon_final_outputequivalent) feeding asession_turnstable with per-turn cost, tokens, duration, tool count. - 3-behavior tool-guardrail taxonomy (
allow/reject_content/deny) inunified-guard.sh+audit-bash.sh+protect-files.sh.reject_contentsends Claude a retry hint instead of a wall;denyhard-halts. Every rejection is a typed event. HandoffInputDataenvelope (scripts/lib/handoff.py) — zlib-compressed base64 payload carryinginput_history,pre_handoff_items,new_items,run_context. 176 KB history → 752 B on the wire (0.43% ratio). Eliminates the "re-derive context via RAG" cost on escalation.- Transcript compaction (
scripts/compact-handoff-history.py) — opt-in per escalation. Localgemma3:12bwith Haiku fallback; circuit-breaker aware. - Agent-as-tool wrapper (
scripts/agent_as_tool.py) — wraps the 10 sub-agent definitions as callable tools so the orchestrator LLM can conditionally invoke them in the ambiguous-risk (0.4–0.6) band, complementing our deterministic routing. - Handoff depth counter + cycle detection (
scripts/lib/handoff_depth.py) —handoff_depth >= 5forces[POLL];>= 10hard-halts; any agent twice in the chain is refused and logged ashandoff_cycle_detected. - Immutable per-turn snapshots (
scripts/lib/snapshot.py) — a snapshot is captured BEFORE each mutating tool call (Bash,Edit,Write,Task; read-only tools skipped);rollback_to(id)restores any priorsessionsrow. 7-day retention.
Four new SQLite tables (event_log, handoff_log, session_state_snapshot, session_turns) bring the total to 35. Migrations 006–011 apply idempotently on both fresh and legacy DBs. Two follow-ups since then — the A/B prompt patcher (IFRNLLEI01PRD-645, prompt_patch_trial + session_trial_assignment) and the CLI-session RAG capture pipeline (-646/-647/-648, no new tables; chunks + tool calls + knowledge rows tagged issue_id='cli-<uuid>' on the existing schema) — bring the live total to 39.
Before this, only YT-backed Runner sessions had their transcripts/tool-calls/extracted knowledge written into the shared RAG tables. Interactive claude CLI sessions (human-in-the-loop dev work) were only captured by poll-claude-usage.sh for cost/tokens — their content was lost to retrieval.
A 3-tier pipeline (IFRNLLEI01PRD-646/-647/-648) closes the gap. A single cron line chains three idempotent steps over every CLI JSONL:
archive-session-transcript.pychunks exchange pairs →session_transcripts+nomic-embed-textembeddings + doc-chain refined summary atchunk_index=-1(sessions ≥ 5000 assistant chars).parse-tool-calls.pyextractstool_use/tool_resultpairs →tool_call_log(issue_id resolves tocli-<uuid>via patched path inference).extract-cli-knowledge.pyrunsgemma3:12bin strict-JSON mode over the summary rows →incident_knowledgewithproject='chatops-cli', embedded for retrieval.
Retrieval weights chatops-cli rows at CLI_INCIDENT_WEIGHT=0.75 by default so real infra incidents still win close ties. Byte-offset watermark skips unchanged files. Soak test (10 files): 12 chunks + 245 tool-call rows + 4 knowledge extractions — gemma correctly classified one sample as subsystem=sqlite-schema, tags=[schema, migration, versioning, data] at 0.95 confidence.
scripts/qa/run-qa-suite.sh runs 23 per-issue suite files + 6 e2e + 2 bench = 31 files (~45 s) with JSON scorecard + summary output:
- Per-issue suites — sanity + QA + integration for every adoption, plus 16 tests for the preference-iterating patcher (-645) and 12 tests for the CLI-session RAG pipeline (-646/-647/-648).
- Writer coverage — every script that
INSERTs into a versioned table is asserted to stampschema_version=1; same for all 5 n8n-workflow INSERT sites. - Pattern-by-pattern coverage — 53 deny-pattern tests + 32 reject-pattern tests.
- Payload shape — every one of the 13 event types round-trips through the CLI + Python paths.
- Concurrent-bump fuzz — 8 parallel
handoff_depth.bump()calls with no-lost-updates assertion. Surfaced and fixed a real race condition. - Mock HTTP server (
scripts/qa/lib/mock_http.py) — stdlib-only fake ollama/anthropic endpoints for testing successful compaction offline. - 6 e2e scenarios — happy path (all 9 adoptions in one flow), cycle prevention, crash + rollback, schema forward-compat, envelope-to-subagent, compaction in handoff.
- Benchmarks — p95 latencies for event emit (111 ms), handoff bump (108 ms), envelope encode (76 ms), snapshot capture (86 ms), unified-guard hook (198 ms), migration on a 10K-row legacy DB (~200 ms).
Alert → n8n → OpenClaw (GPT-5.1, 7-21s) → Haiku Planner (+AWX) → Claude Code (Opus 4.6, 5-15min) → Human (Matrix)
| Component | Role |
|---|---|
| n8n | 25 workflows (424 nodes) — alert intake, session management, knowledge population |
| OpenClaw v2026.4.11 (GPT-5.1) | Tier 1 — fast triage with 17 skills + Active Memory, handles 80%+ without escalation |
| Claude Code (Opus 4.6) | Tier 2 — 10 sub-agents, ReAct reasoning, interactive [POLL] approval |
| AWX | 41 Ansible playbooks wired into AI planner |
| Matrix (Synapse) | Human-in-the-loop — polls, reactions, replies |
| Prometheus + Grafana | 10 dashboards, 64+ panels, 10 metric exporters |
| OpenObserve | OTel tracing — 39K spans, OTLP export |
| Ollama (RTX 3090 Ti) | Local embeddings — nomic-embed-text, query rewriting |
| Compiled Wiki | 44 articles from 7+ sources, daily recompilation |
The system investigates freely but never executes infrastructure changes without human approval:
- Claude Code hooks — 7 injection detection groups + 59 destructive/exfiltration patterns blocked deterministically. Now emits the 3-behavior taxonomy (
allow/reject_content/deny) — recoverable patterns get a retry hint instead of a wall. Every rejection lands inevent_logas a typedtool_guardrail_rejectionevent. - safe-exec.sh — code-level blocklist that prompt injection cannot bypass
- exec-approvals.json — 36 specific skill patterns (no wildcards)
- Evaluator-Optimizer — Haiku screens high-stakes responses before posting
- Confidence gating — < 0.5 stops, < 0.7 escalates
- Budget ceilings — EUR 5/session warning, $25/day plan-only mode
- Credential scanning — 16 PII patterns redacted, 39 credentials tracked with rotation
Plus: handoff depth counter forces [POLL] at depth ≥ 5 / hard-halts at ≥ 10, and any agent cycling back into its own chain is refused. An audit-risk-decisions.sh weekly invariant check rejects any reject_content event with an empty message (would blind the agent).
| Metric | Value |
|---|---|
| Operational activation audit | A (91.8%) — 23 tables populated, 148K+ rows |
| Agentic design patterns | 21/21 at A+ (tri-source audit: 11/11 dimensions) |
| OpenAI Agents SDK adoption batch | 9/9 implemented (issues 635–643), 45 files changed, 6 migrations, 4 new tables |
| Preference-iterating prompt patcher | Live (issue 645) — N-candidate A/B trials, Welch t-test, auto-promote |
| CLI-session RAG capture | Live (issues 646/647/648) — transcripts + tool-calls + knowledge extraction |
| QA suite | 283/285 PASS (99.3%) across 31 suite files — ~45s run, JSON scorecard |
| Handoff envelope compression | 0.43% ratio (176 KB input_history → 752 B on the wire, zlib+b64) |
| AWX/Ansible runbooks | 41 playbooks wired into Plan-and-Execute |
| Tool call instrumentation | 88,448 calls across 108 types, per-tool error rates + latency p50/p95 |
| OTel tracing | 39K spans → OpenObserve + Prometheus metrics |
| Typed session events | 13 event classes, queryable event_log table + Prom exporter |
| GraphRAG knowledge graph | 360 entities, 193 relationships |
| Self-improving prompt patches | 5 active (auto-generated from eval scores) |
| Predictive risk scoring | 123 devices scanned daily, 23 at elevated risk |
| Holistic health check | 96%+ — 142 checks (functional + e2e + cross-site) |
| Session-holistic E2E | 100% (23/23) — covers 18 YT issues with before/after scoring |
| SQLite tables | 39 (31 base + 2 risk/circuit breakers [-631/-632] + 4 adoption [-635..643] + 2 prompt-trials [-645]) |
| Industry benchmark | 4.10/5.00 (82%) -- 15 dimensions, 23 industry sources, E2E certified (39/39) |
| RAGAS golden set | 33 queries (15 hard-eval tagged) — multi-hop / temporal / negation / meta / cross-corpus |
| Weekly hard-eval (50-q) | judge-graded hit@5 = 0.90, p50 5.7s, p95 13.6s |
| RAGAS RAG quality | Faithfulness 0.88, Precision 0.86, Recall 0.88 (18 evaluations via Claude Haiku) |
| NIST behavioral telemetry | 5/5 AG-MS.1 signals active (action velocity, permission escalation, cross-boundary, delegation depth, exception rate) |
| Adversarial red-team | 54 tests (32 baseline + 22 adversarial), quarterly schedule, 12 bypass vectors hardened |
| Governance compliance | EU AI Act limited-risk assessment, QMS (Art. 17), NIST oversight boundary framework |
| Supply chain security | CycloneDX SBOM in CI, model provenance chain, agent decommissioning procedure |
| Document | What it covers |
|---|---|
| Operational Activation Audit | Scores data activation — 21/21 tables, 109K rows |
| Tri-Source Audit | 11/11 dimensions A+ (Gulli + Anthropic + industry) |
| External Source Mapping | atlas-agents + claude-code-from-source techniques applied |
| Agentic Patterns Audit | 21/21 pattern scorecard |
| Evaluation Process | 3-set eval, flywheel, CI gate |
| ACI Tool Audit | 10 MCP tools against 8-point checklist |
| Compiled Wiki | 44 auto-compiled articles |
| Industry Benchmark | 15-dimension scored assessment against 23 industry sources |
| EU AI Act Assessment | Risk classification + article mapping |
| Tool Risk Classification | 153 MCP tools classified (NIST AG-MP.1) |
| Agent Decommissioning | Per-tier lifecycle procedures |
| Installation Guide | Setup steps + cron configuration |
git clone https://github.com/papadopouloskyriakos/agentic-chatops.git
cd agentic-chatops
cp .env.example .env # Add your credentialsSee the Installation Guide for full setup.
- Agentic Design Patterns by Antonio Gulli (Springer, 2025) — 21 patterns, all implemented
- Claude Certified Architect – Foundations (Anthropic) — sub-agent design
- Industry References — Anthropic, OpenAI, LangChain, Microsoft
- atlas-agents + claude-code-from-source — external techniques applied
Sanitized mirror of a private GitLab repository. Provided as-is for educational and reference purposes.
Built by a solo infrastructure operator who got tired of waking up at 3am for alerts that an AI could triage.
