LLM-Ops-Observability

This system monitors whether LLM services are moving into dangerous operational states through cost, latency, errors, and guardrail stress signals—not whether the LLM generates harmful content. LLM Ops Observability Demo

This repository contains an LLM operations observability prototype built for a hackathon setting. The focus of this project is operational risk detection—tracking whether an LLM service is drifting into unsafe operational states due to cost, latency, error rates, or guardrail pressure.

This system does not attempt to judge model output quality or content correctness. Instead, it treats the LLM as a black-box service and monitors it the same way SRE teams monitor critical production systems.

🎯 Design Goal

Most LLM monitoring today focuses on:

Prompt/output inspection

Content moderation results

This project explores a different question:

“Can we detect dangerous LLM behavior purely from operational signals?”

The system demonstrates:

Cost and latency anomaly detection

Safety-pattern rate monitoring

Signal correlation across metrics, logs, and traces

Incident creation using standard observability tooling

🧱 Technology Stack

Backend: Python (FastAPI)

Frontend: Next.js + Tailwind CSS

LLM Provider: Google Gemini / Vertex AI

Runtime: Google Cloud Run

Observability: Datadog (Metrics, Logs, Traces)

Telemetry: OpenTelemetry (OTLP)

Load & Failure Simulation: Python-based traffic generator

🏗️ System Architecture

Key architectural decisions:

Single LLM Entry Point All LLM requests flow through a single endpoint:

POST /api/chat

This guarantees consistent instrumentation and simplifies monitoring.

Unified Telemetry Pipeline Logs, metrics, and traces are generated from the backend using OpenTelemetry and exported to Datadog.

End-to-End Correlation A request_id is propagated across:

application logs

distributed traces allowing direct correlation during incident analysis.

Low-Cardinality Metrics by Design Metrics intentionally avoid high-cardinality tags (e.g., request_id) to keep Datadog cost and query performance under control.

🧪 Failure Injection (Signal Generation)

To validate monitors and dashboards, the system includes explicit failure injection modes that simulate realistic LLM operational issues.

Mode Behavior What It Simulates latency_spike 3–8s artificial delay Model slowness / upstream latency cost_spike Token padding (≤ 2000 tokens) Cost runaway / quota exhaustion error_burst ~40% HTTP 500 responses Partial service outage safety_injection Pattern-based prompt manipulation Guardrail stress / abuse patterns

Detected safety patterns:

prefix_override

role_shift

instruction_loop

delimiter_attack

These are treated as signals, not definitive attacks.

📊 Observability Strategy (Datadog) Metrics

llm.request.latency_ms (Histogram)

llm.request.tokens_total (Counter)

llm.request.cost_usd_est (Counter)

llm.request.error.count (Counter)

llm.safety.injection.count (Counter, tagged by pattern_type)

Rate-Based Detection (Not Threshold Spam)

Instead of triggering alerts on single events, the system uses time-windowed rate thresholds.

Examples:

Safety Pattern Monitor Triggers only when injection patterns exceed a sustained rate over 5 minutes.

Cost + Safety Correlation Monitor Flags scenarios where token usage spikes coincide with safety-pattern activity, a common signal in instruction-loop attacks.

This approach reduces noise and better reflects real operational risk.

📈 SLO Definitions

Availability SLO: 99.5% successful requests

Latency SLO: 95th percentile < 2000ms

SLOs are evaluated using Datadog’s native SLO framework.

🚦 Traffic Generator

A standalone traffic generator is included to exercise different operational states:

normal_operation – baseline healthy traffic

cost_attack – token explosion scenario

instruction_loop_attack – combined safety + cost stress

violation_threshold_test – sustained rate increase to trigger monitors

This allows repeatable validation of dashboards, monitors, and incident flows.

🚀 Deployment Requirements

Google Cloud project with Vertex AI / Gemini enabled

Datadog API key and site

GOOGLE_API_KEY or GCP service account credentials

Local (Docker) docker-compose up --build

Cloud Run sh deploy.sh

🚨 Incident Handling

When monitors trigger:

Primary path

Automatic Datadog Case / Incident creation

Fallback path

Datadog Event emission

Webhook notification (e.g., Slack or Discord)

This ensures visibility even if incident automation permissions are limited.

📁 Repository Structure

backend/ – FastAPI service and telemetry logic

frontend/ – Next.js dashboard UI

traffic_generator/ – failure and load simulation scripts

datadog/ – exported dashboards, monitors, and SLO definitions

🧠 What This Project Is (and Is Not)

This is:

An ops-focused LLM monitoring prototype

A demonstration of signal-based risk detection

Aligned with SRE and observability best practices

This is not:

A content moderation system

A jailbreak classifier

A prompt quality evaluator

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
backend		backend
datadog		datadog
frontend		frontend
traffic_generator		traffic_generator
.env.example		.env.example
README.md		README.md
cloudbuild.yaml		cloudbuild.yaml
deploy.sh		deploy.sh
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM-Ops-Observability

About

Uh oh!

Releases

Packages

Languages

suh004757/LLM-Ops-Observability

Folders and files

Latest commit

History

Repository files navigation

LLM-Ops-Observability

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages