This system monitors whether LLM services are moving into dangerous operational states through cost, latency, errors, and guardrail stress signals—not whether the LLM generates harmful content. LLM Ops Observability Demo
This repository contains an LLM operations observability prototype built for a hackathon setting. The focus of this project is operational risk detection—tracking whether an LLM service is drifting into unsafe operational states due to cost, latency, error rates, or guardrail pressure.
This system does not attempt to judge model output quality or content correctness. Instead, it treats the LLM as a black-box service and monitors it the same way SRE teams monitor critical production systems.
🎯 Design Goal
Most LLM monitoring today focuses on:
Prompt/output inspection
Content moderation results
This project explores a different question:
“Can we detect dangerous LLM behavior purely from operational signals?”
The system demonstrates:
Cost and latency anomaly detection
Safety-pattern rate monitoring
Signal correlation across metrics, logs, and traces
Incident creation using standard observability tooling
🧱 Technology Stack
Backend: Python (FastAPI)
Frontend: Next.js + Tailwind CSS
LLM Provider: Google Gemini / Vertex AI
Runtime: Google Cloud Run
Observability: Datadog (Metrics, Logs, Traces)
Telemetry: OpenTelemetry (OTLP)
Load & Failure Simulation: Python-based traffic generator
🏗️ System Architecture
Key architectural decisions:
Single LLM Entry Point All LLM requests flow through a single endpoint:
POST /api/chat
This guarantees consistent instrumentation and simplifies monitoring.
Unified Telemetry Pipeline Logs, metrics, and traces are generated from the backend using OpenTelemetry and exported to Datadog.
End-to-End Correlation A request_id is propagated across:
application logs
distributed traces allowing direct correlation during incident analysis.
Low-Cardinality Metrics by Design Metrics intentionally avoid high-cardinality tags (e.g., request_id) to keep Datadog cost and query performance under control.
🧪 Failure Injection (Signal Generation)
To validate monitors and dashboards, the system includes explicit failure injection modes that simulate realistic LLM operational issues.
Mode Behavior What It Simulates latency_spike 3–8s artificial delay Model slowness / upstream latency cost_spike Token padding (≤ 2000 tokens) Cost runaway / quota exhaustion error_burst ~40% HTTP 500 responses Partial service outage safety_injection Pattern-based prompt manipulation Guardrail stress / abuse patterns
Detected safety patterns:
prefix_override
role_shift
instruction_loop
delimiter_attack
These are treated as signals, not definitive attacks.
📊 Observability Strategy (Datadog) Metrics
llm.request.latency_ms (Histogram)
llm.request.tokens_total (Counter)
llm.request.cost_usd_est (Counter)
llm.request.error.count (Counter)
llm.safety.injection.count (Counter, tagged by pattern_type)
Rate-Based Detection (Not Threshold Spam)
Instead of triggering alerts on single events, the system uses time-windowed rate thresholds.
Examples:
Safety Pattern Monitor Triggers only when injection patterns exceed a sustained rate over 5 minutes.
Cost + Safety Correlation Monitor Flags scenarios where token usage spikes coincide with safety-pattern activity, a common signal in instruction-loop attacks.
This approach reduces noise and better reflects real operational risk.
📈 SLO Definitions
Availability SLO: 99.5% successful requests
Latency SLO: 95th percentile < 2000ms
SLOs are evaluated using Datadog’s native SLO framework.
🚦 Traffic Generator
A standalone traffic generator is included to exercise different operational states:
normal_operation – baseline healthy traffic
cost_attack – token explosion scenario
instruction_loop_attack – combined safety + cost stress
violation_threshold_test – sustained rate increase to trigger monitors
This allows repeatable validation of dashboards, monitors, and incident flows.
🚀 Deployment Requirements
Google Cloud project with Vertex AI / Gemini enabled
Datadog API key and site
GOOGLE_API_KEY or GCP service account credentials
Local (Docker) docker-compose up --build
Cloud Run sh deploy.sh
🚨 Incident Handling
When monitors trigger:
Primary path
Automatic Datadog Case / Incident creation
Fallback path
Datadog Event emission
Webhook notification (e.g., Slack or Discord)
This ensures visibility even if incident automation permissions are limited.
📁 Repository Structure
backend/ – FastAPI service and telemetry logic
frontend/ – Next.js dashboard UI
traffic_generator/ – failure and load simulation scripts
datadog/ – exported dashboards, monitors, and SLO definitions
🧠 What This Project Is (and Is Not)
This is:
An ops-focused LLM monitoring prototype
A demonstration of signal-based risk detection
Aligned with SRE and observability best practices
This is not:
A content moderation system
A jailbreak classifier
A prompt quality evaluator