Skip to content

papadopouloskyriakos/agentic-chatops

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

agentic-chatops

AI agents that triage infrastructure alerts, investigate root causes, and propose fixes — while a solo operator sleeps.

For the complete technical reference, see README.extensive.md.

Architecture

The Problem

One person. 310+ infrastructure objects across 6 sites. 3 firewalls, 12 Kubernetes nodes, self-hosted everything. When an alert fires at 3am, there's no team to call. There never is.

The Solution

Three agentic subsystems that handle the detective work — ChatOps (infrastructure), ChatSecOps (security), ChatDevOps (CI/CD) — built on n8n orchestration, Matrix as the human interface, and a 3-tier agent architecture. The human stays in the loop for every infrastructure change. The system never acts without a thumbs-up or poll vote.


What Makes This Different

Self-Improving Prompts — now with A/B trials (nobody else does this)

The system evaluates its own performance and auto-patches its prompts. Every session is scored by an LLM-as-a-Judge on 5 quality dimensions (gemma3:12b local-first since 2026-04-19, Haiku for calibration). When a dimension averages below threshold over 30 days, the preference-iterating patcher (IFRNLLEI01PRD-645, 2026-04-20) generates 3 candidate instruction variants (concise / detailed / examples) and assigns each future matching session to one arm via deterministic BLAKE2b hash — plus a no-patch control. A daily cron runs a one-sided Welch t-test once every arm reaches 15 samples; the winner is promoted only if it beats control by ≥ 0.05 points with p < 0.1. Otherwise the trial is aborted. Prompt-level policy iteration — no model weights are ever fine-tuned.

Session → LLM Judge (5 dims) → dimension trending below threshold
  → prompt-patch-trial.py generates 3 candidate variants + 1 control
  → future sessions hash-routed to arms → Welch t-test at 15+ samples/arm
  → winner promoted to config/prompt-patches.json (source: "trial:N:idx=I")
  → next eval cycle scores the new patch → loop continues

AI Planner Wired to Proven Ansible Playbooks

Before Claude Code investigates, a Haiku planner generates a 3-5 step investigation plan. The planner queries AWX for matching Ansible playbooks from 41 proven templates (maintenance, cert sync, K8s drain, PVE updates, DMZ deployments). Plans naturally include "Run AWX Template 64 with dry_run=true" as remediation steps — bridging AI reasoning with proven automation.

Predictive Alerting

Instead of only reacting after alerts fire, the system queries LibreNMS API daily for trending risk across both sites. Devices are scored on disk usage trends, alert frequency, and health signals. A daily top-10 risk report posts to Matrix before problems become incidents.

5-Signal RAG + GraphRAG + Staleness + Temporal Filter + mtime-Sort

Retrieval uses Reciprocal Rank Fusion across 5 signals (semantic + keyword + compiled wiki + MemPalace transcripts + chaos baselines), plus a GraphRAG knowledge graph (360 entities, 193 relationships). Retrieval short-circuits via two intent detectors: temporal window ("last 48h", "72 hours ending YYYY-MM-DD") filters wiki on source_mtime, and mtime-sort intent ("name any three memory files created in the last 48h") bypasses semantic retrieval entirely and returns an mtime-ranked window. Results older than 7 days get age-proportional staleness warnings. A Haiku synth step composes cross-chunk answers when top rerank < threshold (3-4× faster p95 than the Ollama ensemble). SYNTH_HAIKU_FORCE_FAIL env supports 5 failure modes (429 / auth / timeout / network / empty) that all fall back cleanly to local qwen2.5.

Karpathy-Style Compiled Knowledge Base

Following Andrej Karpathy's LLM Knowledge Bases pattern: raw data from 7+ sources (117 memory files, 55 CLAUDE.md files, 33 incidents, 27 lessons, 101 OpenClaw memories, 17 skills, ~5,200 lab docs) is compiled into a browsable 44-article wiki with auto-maintained indexes, daily SHA-256 incremental recompilation, and contradiction detection. All articles embedded into RAG as the 3rd fusion signal.

Full Observability Stack with OTel

88,448 tool calls instrumented across 108 tool types with per-tool error rates and latency percentiles. 39K OTel spans across 94 traces exported to OpenObserve (OTLP). 10 Grafana dashboards (64+ panels) covering ChatOps, ChatSecOps, ChatDevOps, and trace analysis. 18,220 infrastructure commands logged across 232 devices.

Formal Evaluation Pipeline

58 scenarios across 3 eval sets (22 regression + 20 discovery + 16 holdout) + 54 adversarial red-team tests. Prompt Scorecard grades 19 surfaces daily on 6 dimensions. Agent Trajectory scoring on 8 infra / 4 dev steps. A/B variant testing (react_v1 vs react_v2). CI eval gate blocks bad merges. Monthly eval flywheel cycle.

Structured Agentic Substrate — 9 adoptions from the OpenAI Agents SDK

The 2026-04-20 audit of openai/openai-agents-python flagged 11 gaps; 9 were implemented (issues IFRNLLEI01PRD-635..643). The system now has a versioned, typed, recoverable substrate the old string-based Matrix pipeline couldn't offer:

  • Schema versioning on 9 session/audit tables + a central registry (scripts/lib/schema_version.py) mirroring the SDK's RunState.CURRENT_SCHEMA_VERSION / SCHEMA_VERSION_SUMMARIES pattern. Writers stamp schema_version=CURRENT; readers check_row() fail-fast on future versions.
  • 13 typed events (session_events.py) in a new event_log table — tool_started/ended, handoff_requested/completed/cycle_detected/compaction, reasoning_item_created, mcp_approval_*, agent_updated, message_output_created, tool_guardrail_rejection, agent_as_tool_call. Replaces free-form Matrix strings with Grafana-queryable structured telemetry.
  • Per-turn lifecycle hookssession-start.sh, post-tool-use.sh, user-prompt-submit.sh, session-end.sh (new — the on_final_output equivalent) feeding a session_turns table with per-turn cost, tokens, duration, tool count.
  • 3-behavior tool-guardrail taxonomy (allow / reject_content / deny) in unified-guard.sh + audit-bash.sh + protect-files.sh. reject_content sends Claude a retry hint instead of a wall; deny hard-halts. Every rejection is a typed event.
  • HandoffInputData envelope (scripts/lib/handoff.py) — zlib-compressed base64 payload carrying input_history, pre_handoff_items, new_items, run_context. 176 KB history → 752 B on the wire (0.43% ratio). Eliminates the "re-derive context via RAG" cost on escalation.
  • Transcript compaction (scripts/compact-handoff-history.py) — opt-in per escalation. Local gemma3:12b with Haiku fallback; circuit-breaker aware.
  • Agent-as-tool wrapper (scripts/agent_as_tool.py) — wraps the 10 sub-agent definitions as callable tools so the orchestrator LLM can conditionally invoke them in the ambiguous-risk (0.4–0.6) band, complementing our deterministic routing.
  • Handoff depth counter + cycle detection (scripts/lib/handoff_depth.py) — handoff_depth >= 5 forces [POLL]; >= 10 hard-halts; any agent twice in the chain is refused and logged as handoff_cycle_detected.
  • Immutable per-turn snapshots (scripts/lib/snapshot.py) — a snapshot is captured BEFORE each mutating tool call (Bash, Edit, Write, Task; read-only tools skipped); rollback_to(id) restores any prior sessions row. 7-day retention.

Four new SQLite tables (event_log, handoff_log, session_state_snapshot, session_turns) bring the total to 35. Migrations 006–011 apply idempotently on both fresh and legacy DBs. Two follow-ups since then — the A/B prompt patcher (IFRNLLEI01PRD-645, prompt_patch_trial + session_trial_assignment) and the CLI-session RAG capture pipeline (-646/-647/-648, no new tables; chunks + tool calls + knowledge rows tagged issue_id='cli-<uuid>' on the existing schema) — bring the live total to 39.

CLI-Session RAG Capture — interactive claude sessions flow into RAG too (2026-04-20)

Before this, only YT-backed Runner sessions had their transcripts/tool-calls/extracted knowledge written into the shared RAG tables. Interactive claude CLI sessions (human-in-the-loop dev work) were only captured by poll-claude-usage.sh for cost/tokens — their content was lost to retrieval.

A 3-tier pipeline (IFRNLLEI01PRD-646/-647/-648) closes the gap. A single cron line chains three idempotent steps over every CLI JSONL:

  1. archive-session-transcript.py chunks exchange pairs → session_transcripts + nomic-embed-text embeddings + doc-chain refined summary at chunk_index=-1 (sessions ≥ 5000 assistant chars).
  2. parse-tool-calls.py extracts tool_use / tool_result pairs → tool_call_log (issue_id resolves to cli-<uuid> via patched path inference).
  3. extract-cli-knowledge.py runs gemma3:12b in strict-JSON mode over the summary rows → incident_knowledge with project='chatops-cli', embedded for retrieval.

Retrieval weights chatops-cli rows at CLI_INCIDENT_WEIGHT=0.75 by default so real infra incidents still win close ties. Byte-offset watermark skips unchanged files. Soak test (10 files): 12 chunks + 245 tool-call rows + 4 knowledge extractions — gemma correctly classified one sample as subsystem=sqlite-schema, tags=[schema, migration, versioning, data] at 0.95 confidence.

QA Suite — 283/285 PASS (99.3%)

scripts/qa/run-qa-suite.sh runs 23 per-issue suite files + 6 e2e + 2 bench = 31 files (~45 s) with JSON scorecard + summary output:

  • Per-issue suites — sanity + QA + integration for every adoption, plus 16 tests for the preference-iterating patcher (-645) and 12 tests for the CLI-session RAG pipeline (-646/-647/-648).
  • Writer coverage — every script that INSERTs into a versioned table is asserted to stamp schema_version=1; same for all 5 n8n-workflow INSERT sites.
  • Pattern-by-pattern coverage — 53 deny-pattern tests + 32 reject-pattern tests.
  • Payload shape — every one of the 13 event types round-trips through the CLI + Python paths.
  • Concurrent-bump fuzz — 8 parallel handoff_depth.bump() calls with no-lost-updates assertion. Surfaced and fixed a real race condition.
  • Mock HTTP server (scripts/qa/lib/mock_http.py) — stdlib-only fake ollama/anthropic endpoints for testing successful compaction offline.
  • 6 e2e scenarios — happy path (all 9 adoptions in one flow), cycle prevention, crash + rollback, schema forward-compat, envelope-to-subagent, compaction in handoff.
  • Benchmarks — p95 latencies for event emit (111 ms), handoff bump (108 ms), envelope encode (76 ms), snapshot capture (86 ms), unified-guard hook (198 ms), migration on a 10K-row legacy DB (~200 ms).

Architecture

Alert → n8n → OpenClaw (GPT-5.1, 7-21s) → Haiku Planner (+AWX) → Claude Code (Opus 4.6, 5-15min) → Human (Matrix)
Component Role
n8n 25 workflows (424 nodes) — alert intake, session management, knowledge population
OpenClaw v2026.4.11 (GPT-5.1) Tier 1 — fast triage with 17 skills + Active Memory, handles 80%+ without escalation
Claude Code (Opus 4.6) Tier 2 — 10 sub-agents, ReAct reasoning, interactive [POLL] approval
AWX 41 Ansible playbooks wired into AI planner
Matrix (Synapse) Human-in-the-loop — polls, reactions, replies
Prometheus + Grafana 10 dashboards, 64+ panels, 10 metric exporters
OpenObserve OTel tracing — 39K spans, OTLP export
Ollama (RTX 3090 Ti) Local embeddings — nomic-embed-text, query rewriting
Compiled Wiki 44 articles from 7+ sources, daily recompilation

Safety — 7 Layers

The system investigates freely but never executes infrastructure changes without human approval:

  1. Claude Code hooks — 7 injection detection groups + 59 destructive/exfiltration patterns blocked deterministically. Now emits the 3-behavior taxonomy (allow / reject_content / deny) — recoverable patterns get a retry hint instead of a wall. Every rejection lands in event_log as a typed tool_guardrail_rejection event.
  2. safe-exec.sh — code-level blocklist that prompt injection cannot bypass
  3. exec-approvals.json — 36 specific skill patterns (no wildcards)
  4. Evaluator-Optimizer — Haiku screens high-stakes responses before posting
  5. Confidence gating — < 0.5 stops, < 0.7 escalates
  6. Budget ceilings — EUR 5/session warning, $25/day plan-only mode
  7. Credential scanning — 16 PII patterns redacted, 39 credentials tracked with rotation

Plus: handoff depth counter forces [POLL] at depth ≥ 5 / hard-halts at ≥ 10, and any agent cycling back into its own chain is refused. An audit-risk-decisions.sh weekly invariant check rejects any reject_content event with an empty message (would blind the agent).

Key Numbers

Metric Value
Operational activation audit A (91.8%) — 23 tables populated, 148K+ rows
Agentic design patterns 21/21 at A+ (tri-source audit: 11/11 dimensions)
OpenAI Agents SDK adoption batch 9/9 implemented (issues 635–643), 45 files changed, 6 migrations, 4 new tables
Preference-iterating prompt patcher Live (issue 645) — N-candidate A/B trials, Welch t-test, auto-promote
CLI-session RAG capture Live (issues 646/647/648) — transcripts + tool-calls + knowledge extraction
QA suite 283/285 PASS (99.3%) across 31 suite files — ~45s run, JSON scorecard
Handoff envelope compression 0.43% ratio (176 KB input_history → 752 B on the wire, zlib+b64)
AWX/Ansible runbooks 41 playbooks wired into Plan-and-Execute
Tool call instrumentation 88,448 calls across 108 types, per-tool error rates + latency p50/p95
OTel tracing 39K spans → OpenObserve + Prometheus metrics
Typed session events 13 event classes, queryable event_log table + Prom exporter
GraphRAG knowledge graph 360 entities, 193 relationships
Self-improving prompt patches 5 active (auto-generated from eval scores)
Predictive risk scoring 123 devices scanned daily, 23 at elevated risk
Holistic health check 96%+ — 142 checks (functional + e2e + cross-site)
Session-holistic E2E 100% (23/23) — covers 18 YT issues with before/after scoring
SQLite tables 39 (31 base + 2 risk/circuit breakers [-631/-632] + 4 adoption [-635..643] + 2 prompt-trials [-645])
Industry benchmark 4.10/5.00 (82%) -- 15 dimensions, 23 industry sources, E2E certified (39/39)
RAGAS golden set 33 queries (15 hard-eval tagged) — multi-hop / temporal / negation / meta / cross-corpus
Weekly hard-eval (50-q) judge-graded hit@5 = 0.90, p50 5.7s, p95 13.6s
RAGAS RAG quality Faithfulness 0.88, Precision 0.86, Recall 0.88 (18 evaluations via Claude Haiku)
NIST behavioral telemetry 5/5 AG-MS.1 signals active (action velocity, permission escalation, cross-boundary, delegation depth, exception rate)
Adversarial red-team 54 tests (32 baseline + 22 adversarial), quarterly schedule, 12 bypass vectors hardened
Governance compliance EU AI Act limited-risk assessment, QMS (Art. 17), NIST oversight boundary framework
Supply chain security CycloneDX SBOM in CI, model provenance chain, agent decommissioning procedure

Documentation

Document What it covers
Operational Activation Audit Scores data activation — 21/21 tables, 109K rows
Tri-Source Audit 11/11 dimensions A+ (Gulli + Anthropic + industry)
External Source Mapping atlas-agents + claude-code-from-source techniques applied
Agentic Patterns Audit 21/21 pattern scorecard
Evaluation Process 3-set eval, flywheel, CI gate
ACI Tool Audit 10 MCP tools against 8-point checklist
Compiled Wiki 44 auto-compiled articles
Industry Benchmark 15-dimension scored assessment against 23 industry sources
EU AI Act Assessment Risk classification + article mapping
Tool Risk Classification 153 MCP tools classified (NIST AG-MP.1)
Agent Decommissioning Per-tier lifecycle procedures
Installation Guide Setup steps + cron configuration

Quick Start

git clone https://github.com/papadopouloskyriakos/agentic-chatops.git
cd agentic-chatops
cp .env.example .env   # Add your credentials

See the Installation Guide for full setup.

References

  1. Agentic Design Patterns by Antonio Gulli (Springer, 2025) — 21 patterns, all implemented
  2. Claude Certified Architect – Foundations (Anthropic) — sub-agent design
  3. Industry References — Anthropic, OpenAI, LangChain, Microsoft
  4. atlas-agents + claude-code-from-source — external techniques applied

License

Sanitized mirror of a private GitLab repository. Provided as-is for educational and reference purposes.


Built by a solo infrastructure operator who got tired of waking up at 3am for alerts that an AI could triage.

About

3-tier agentic ChatOps (n8n + GPT-4o + Claude Code) implementing all 21 patterns from "Agentic Design Patterns" — solo operator managing 137 devices

Topics

Resources

Stars

Watchers

Forks

Contributors

No contributors