Conversation
* feat(ui/): new guardrails monitor 'demo mock representation of what guardrails monitor looks like * fix: ui updates * style(ui/): fix styling * feat: enable running ai monitor on individual guardrails * feat: add backend logic for guardrail monitoring
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Greptile SummaryThis PR adds a Guardrails Monitor feature — a new dashboard page for viewing guardrail and policy performance metrics (pass/fail rates, trends, logs). It introduces:
Key concerns:
Confidence Score: 3/5
|
| Filename | Overview |
|---|---|
| litellm/proxy/guardrails/usage_endpoints.py | New guardrail/policy usage API endpoints. Unbounded previous-period query in detail endpoint will grow over time; unbounded guardrails table fetch in overview. |
| litellm/proxy/guardrails/usage_tracking.py | New usage tracking logic that processes spend logs for guardrail metrics. Well-structured with proper error handling and skip_duplicates. Sequential upserts could be batched for performance at scale. |
| litellm/proxy/utils.py | Modified spend log processing to pop batch once and pass to both spend log writer and guardrail usage tracker. TOCTOU race condition between two separate lock acquisitions in update_spend_logs_job. Import formatting changes throughout. |
| litellm/integrations/custom_guardrail.py | Changed fallback behavior from logging a warning to creating empty metadata dict and attaching guardrail info. Minor risk of overwriting metadata if key is added concurrently. |
| litellm/proxy/schema.prisma | Adds three new models (DailyGuardrailMetrics, DailyPolicyMetrics, SpendLogGuardrailIndex) with appropriate composite keys and indexes. Also removes spec_path from MCPServerTable. |
| ui/litellm-dashboard/src/components/GuardrailsMonitor/GuardrailsOverview.tsx | Overview dashboard component using react-query. Uses deprecated Tremor components (Card, Col, Grid, Title) per AGENTS.md guidance. Functional logic is sound. |
| ui/litellm-dashboard/src/components/GuardrailsMonitor/GuardrailDetail.tsx | Detail view for individual guardrails with metrics, root cause analysis (hardcoded/demo), and logs. Uses deprecated Tremor components. Functional logic is clean. |
| ui/litellm-dashboard/src/components/GuardrailsMonitor/ScoreChart.tsx | Chart component using deprecated Tremor BarChart. Should be replaced with a non-Tremor chart library per project guidelines. |
Sequence Diagram
sequenceDiagram
participant UI as Dashboard UI
participant API as Usage Endpoints
participant DB as PostgreSQL
participant SLQ as Spend Log Queue
participant UT as Usage Tracking
Note over SLQ,UT: Background Job Flow
SLQ->>SLQ: _monitor_spend_logs_queue polls
SLQ->>DB: update_spend_logs (batch write)
SLQ->>UT: process_spend_logs_guardrail_usage
UT->>DB: create_many SpendLogGuardrailIndex
UT->>DB: upsert DailyGuardrailMetrics
Note over UI,DB: Dashboard Request Flow
UI->>API: GET /guardrails/usage/overview
API->>DB: find_many GuardrailsTable
API->>DB: find_many DailyGuardrailMetrics (current + prev period)
API-->>UI: UsageOverviewResponse (rows, chart, totals)
UI->>API: GET /guardrails/usage/detail/{id}
API->>DB: find_unique GuardrailsTable
API->>DB: find_many DailyGuardrailMetrics
API-->>UI: UsageDetailResponse (metrics, time_series)
UI->>API: GET /guardrails/usage/logs
API->>DB: find_many SpendLogGuardrailIndex
API->>DB: find_many SpendLogs (by request_ids)
API-->>UI: UsageLogsResponse (paginated logs)
Last reviewed commit: 886f1a3
| # Atomically pop batch from queue | ||
| async with prisma_client._spend_log_transactions_lock: | ||
| queue_size = len(prisma_client.spend_log_transactions) | ||
|
|
||
| if queue_size == 0: | ||
| return | ||
|
|
||
| async with prisma_client._spend_log_transactions_lock: | ||
| logs_to_process = prisma_client.spend_log_transactions[ | ||
| :MAX_LOGS_PER_INTERVAL | ||
| ] | ||
| prisma_client.spend_log_transactions = prisma_client.spend_log_transactions[ | ||
| len(logs_to_process) : | ||
| ] |
There was a problem hiding this comment.
TOCTOU race between queue-size check and pop
The queue size is checked under one lock acquisition (line 4298-4299), then released, and the actual pop happens under a second lock acquisition (line 4303-4309). Between the two, another coroutine could drain the queue, making this check unreliable. While the consequences are minor (an empty pop is harmless), the double-lock pattern is unnecessary overhead and misleading — it implies atomicity that isn't actually achieved.
Consider merging both into a single lock acquisition:
| # Atomically pop batch from queue | |
| async with prisma_client._spend_log_transactions_lock: | |
| queue_size = len(prisma_client.spend_log_transactions) | |
| if queue_size == 0: | |
| return | |
| async with prisma_client._spend_log_transactions_lock: | |
| logs_to_process = prisma_client.spend_log_transactions[ | |
| :MAX_LOGS_PER_INTERVAL | |
| ] | |
| prisma_client.spend_log_transactions = prisma_client.spend_log_transactions[ | |
| len(logs_to_process) : | |
| ] | |
| # Atomically pop batch from queue | |
| async with prisma_client._spend_log_transactions_lock: | |
| queue_size = len(prisma_client.spend_log_transactions) | |
| if queue_size == 0: | |
| return | |
| logs_to_process = prisma_client.spend_log_transactions[ | |
| :MAX_LOGS_PER_INTERVAL | |
| ] | |
| prisma_client.spend_log_transactions = prisma_client.spend_log_transactions[ | |
| len(logs_to_process) : | |
| ] |
| metrics_prev = await prisma_client.db.litellm_dailyguardrailmetrics.find_many( | ||
| where={"guardrail_id": guardrail_id, "date": {"lt": start}} |
There was a problem hiding this comment.
Unbounded query fetches all historical metrics
metrics_prev has no lower-bound on date, so it fetches every historical row for this guardrail before start. Over time this will pull increasingly large result sets into memory. The overview endpoint (guardrails_usage_overview) correctly limits the previous period to 7 days — this endpoint should do the same.
| metrics_prev = await prisma_client.db.litellm_dailyguardrailmetrics.find_many( | |
| where={"guardrail_id": guardrail_id, "date": {"lt": start}} | |
| start_prev = ( | |
| datetime.strptime(start, "%Y-%m-%d") - timedelta(days=7) | |
| ).strftime("%Y-%m-%d") | |
| metrics_prev = await prisma_client.db.litellm_dailyguardrailmetrics.find_many( | |
| where={"guardrail_id": guardrail_id, "date": {"gte": start_prev, "lt": start}} | |
| ) |
|
|
||
| try: | ||
| # Guardrails from DB | ||
| guardrails = await prisma_client.db.litellm_guardrailstable.find_many() |
There was a problem hiding this comment.
Unbounded find_many() loads all guardrails into memory
find_many() without any take limit will return the full guardrails table. For most deployments this is fine, but for large instances with hundreds of guardrails this could be expensive per dashboard page load. Consider adding a reasonable limit or pagination to align with the custom rule about avoiding heavy DB queries on dashboard endpoints.
Context Used: Rule from dashboard - What: Avoid creating new database requests or Router objects in the critical request path.
Why: Cre... (source)
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| @@ -0,0 +1,39 @@ | |||
| import { BarChart, Card, Title } from "@tremor/react"; | |||
There was a problem hiding this comment.
Deprecated Tremor components used in new feature
Per AGENTS.md, Tremor is deprecated for new features — the only exception is the Tremor Table component. This file uses BarChart, Card, and Title from @tremor/react. Consider replacing these with Ant Design or native HTML equivalents to stay consistent with the project's direction away from Tremor.
Context Used: Context from dashboard - AGENTS.md (source)
| WarningOutlined, | ||
| } from "@ant-design/icons"; | ||
| import { useQuery } from "@tanstack/react-query"; | ||
| import { Card, Col, Grid, Title } from "@tremor/react"; |
There was a problem hiding this comment.
Deprecated Tremor components used in new feature
Per AGENTS.md, Tremor is deprecated — only the Table component is exempt. This file imports Card, Col, Grid, and Title from @tremor/react. The same applies to GuardrailDetail.tsx. Consider using Ant Design components instead for new features.
Context Used: Context from dashboard - AGENTS.md (source)
| request_data["metadata"] = {} | ||
| _append_guardrail_info(request_data["metadata"]) |
There was a problem hiding this comment.
Unconditionally overwrites request_data["metadata"]
If request_data has neither "metadata" nor "litellm_metadata" keys, this creates request_data["metadata"] = {}. However, if something else in the pipeline later checks for the absence of "metadata" to decide whether to initialize it, this empty dict could interfere. The original code logged a warning here, which was safer. Consider using request_data.setdefault("metadata", {}) to avoid overwriting if a key is somehow added concurrently:
| request_data["metadata"] = {} | |
| _append_guardrail_info(request_data["metadata"]) | |
| request_data.setdefault("metadata", {}) | |
| _append_guardrail_info(request_data["metadata"]) |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
mock representation of what guardrails monitor looks like
fix: ui updates
style(ui/): fix styling
feat: enable running ai monitor on individual guardrails
feat: add backend logic for guardrail monitoring
Relevant issues
Pre-Submission checklist
Please complete all items before asking a LiteLLM maintainer to review your PR
tests/litellm/directory, Adding at least 1 test is a hard requirement - see detailsmake test-unit@greptileaiand received a Confidence Score of at least 4/5 before requesting a maintainer reviewCI (LiteLLM team)
Branch creation CI run
Link:
CI run for the last commit
Link:
Merge / cherry-pick CI run
Links:
Type
🆕 New Feature
🐛 Bug Fix
🧹 Refactoring
📖 Documentation
🚄 Infrastructure
✅ Test
Changes