Policy engine reliability#21873
Policy engine reliability#21873krrishdholakia wants to merge 10 commits intolitellm_dev_02_21_2026_p2_cleanfrom
Conversation
[Docs] store_model_in_db Release Docs
1. Fix stale _policies_by_id cache after status transitions:
- Add _update_policies_by_id_cache() helper method
- Update cache when draft->published transition occurs
- Remove entry from cache when promoting to production (resolved by name)
2. Fix race condition in create_new_version:
- Wrap find_first + update_many + create in a Prisma transaction
- Prevents concurrent version number collisions and orphaned is_latest state
3. Validate version_status query parameter in list_policies:
- Use Literal['draft', 'published', 'production'] type
- Returns 422 for invalid values instead of silently returning empty results
4. Add Literal validation to PolicyVersionStatusUpdateRequest:
- Change version_status field from str to Literal['published', 'production']
- Validates at request parsing level rather than at runtime
5. Fix duplicate auth dependency in endpoints:
- Remove decorator-level dependencies=[Depends(user_api_key_auth)] when
the function parameter already uses Depends(user_api_key_auth)
- Prevents auth check from running twice per request
6. Update tests to mock Prisma transaction context manager
Co-authored-by: Krish Dholakia <krrishdholakia@gmail.com>
|
Cursor Agent can help with this pull request. Just |
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
|
Greptile SummaryThis PR addresses code review feedback from PR #21862, improving the policy engine's reliability around versioning. Key changes:
Also includes unrelated documentation additions for the Store Model in DB UI setting. Confidence Score: 4/5
|
| Filename | Overview |
|---|---|
| litellm/proxy/policy_engine/policy_endpoints.py | Adds versioning endpoints, removes duplicate auth dependencies, adds version_status query param validation via Literal type, and updates attachment creation to verify production version exists. Minor style issues: duplicate DB read in update path, and fetching all production policies to validate one name. |
| litellm/proxy/policy_engine/policy_registry.py | Major additions: version lifecycle methods (create_new_version, update_version_status, compare_versions, delete_all_versions), _policies_by_id cache, transaction usage for create_new_version and promote-to-production. Properly handles cache invalidation for most status transitions. Extracted _row_to_policy_db_response helper. |
| litellm/types/proxy/policy_engine/resolver_types.py | Adds versioning fields to PolicyDBResponse and new types: PolicyVersionCreateRequest, PolicyVersionStatusUpdateRequest (with Literal validation), PolicyVersionListResponse, PolicyVersionCompareResponse. Well-structured with proper Pydantic validation. |
| litellm/types/proxy/policy_engine/init.py | Re-exports the four new versioning types. Import style changed to parenthesized single-line format (cosmetic only). |
| tests/test_litellm/proxy/policy_engine/test_policy_versioning.py | Comprehensive unit tests covering: _row_to_policy_db_response, sync with production-only filter, draft-only updates, delete cache cleanup, create_new_version with transaction mocking, status transitions, delete_all_versions cleanup, compare_versions, and singleton behavior. All mock-based, no real network calls. |
| tests/test_litellm/proxy/policy_engine/test_policy_versioning_e2e.py | Integration-style lifecycle test covering create -> draft -> edit -> publish -> promote flow, plus attachment resolution against production. All mock-based with proper transaction context manager mocking. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["create_policy\n(v1 production)"] --> B["_policies\n(by name, in-memory)"]
C["create_new_version\n(TX: find latest, update is_latest, create)"] --> D["New draft\n(DB only)"]
D --> E["update_policy_in_db\n(draft only)"]
E --> D
D --> F["update_version_status\n(draft → published)"]
F --> G["Published version"]
G --> H["_policies_by_id\n(cache updated)"]
G --> I["update_version_status\n(TX: demote old prod, promote)"]
I --> J["New production"]
J --> B
J --> K["Remove from\n_policies_by_id"]
L["delete_policy_from_db"] -->|production| M["remove_policy\n(clear _policies)"]
L -->|draft/published| N["pop from\n_policies_by_id"]
O["delete_all_versions"] --> M
O --> P["Clean all matching\n_policies_by_id entries"]
Last reviewed commit: 33cceaf
| # Demote current production to published | ||
| await prisma_client.db.litellm_policytable.update_many( | ||
| where={ | ||
| "policy_name": policy_name, | ||
| "version_status": "production", | ||
| }, | ||
| data={ | ||
| "version_status": "published", | ||
| "updated_at": now, | ||
| "updated_by": updated_by, | ||
| }, | ||
| ) | ||
|
|
||
| # Promote this version to production | ||
| updated = await prisma_client.db.litellm_policytable.update( | ||
| where={"policy_id": policy_id}, | ||
| data={ | ||
| "version_status": "production", | ||
| "production_at": now, | ||
| "updated_at": now, | ||
| "updated_by": updated_by, | ||
| }, | ||
| ) |
There was a problem hiding this comment.
Race condition in promote-to-production path
The update_version_status promotion path (demote old production + promote new version) is not wrapped in a database transaction, unlike create_new_version which correctly uses prisma_client.db.tx(). If two concurrent requests promote different versions of the same policy to production, both update_many calls may target the same row and both update calls can succeed, leaving two versions with version_status="production" in the database.
Consider wrapping lines 846-867 in a Prisma transaction, mirroring what create_new_version does:
async with prisma_client.db.tx() as tx:
await tx.litellm_policytable.update_many(
where={
"policy_name": policy_name,
"version_status": "production",
},
data={
"version_status": "published",
"updated_at": now,
"updated_by": updated_by,
},
)
updated = await tx.litellm_policytable.update(
where={"policy_id": policy_id},
data={
"version_status": "production",
"production_at": now,
"updated_at": now,
"updated_by": updated_by,
},
)
| try: | ||
| await prisma_client.db.litellm_policytable.delete_many( | ||
| where={"policy_name": policy_name} | ||
| ) | ||
| self.remove_policy(policy_name) | ||
| return { | ||
| "message": f"All versions of policy '{policy_name}' deleted successfully" | ||
| } | ||
| except Exception as e: | ||
| verbose_proxy_logger.exception(f"Error deleting all versions: {e}") | ||
| raise Exception(f"Error deleting all versions: {str(e)}") |
There was a problem hiding this comment.
Stale _policies_by_id cache after delete_all_versions
delete_all_versions removes the production entry from _policies via self.remove_policy(), but does not clean up _policies_by_id, which may still contain entries for draft/published versions of this policy. After deletion, get_policy_by_id_for_request() could return stale data for deleted policy versions.
Consider adding cleanup like:
| try: | |
| await prisma_client.db.litellm_policytable.delete_many( | |
| where={"policy_name": policy_name} | |
| ) | |
| self.remove_policy(policy_name) | |
| return { | |
| "message": f"All versions of policy '{policy_name}' deleted successfully" | |
| } | |
| except Exception as e: | |
| verbose_proxy_logger.exception(f"Error deleting all versions: {e}") | |
| raise Exception(f"Error deleting all versions: {str(e)}") | |
| try: | |
| await prisma_client.db.litellm_policytable.delete_many( | |
| where={"policy_name": policy_name} | |
| ) | |
| self.remove_policy(policy_name) | |
| # Also clean up draft/published versions from the by-id cache | |
| stale_ids = [ | |
| pid for pid, (pname, _) in self._policies_by_id.items() | |
| if pname == policy_name | |
| ] | |
| for pid in stale_ids: | |
| del self._policies_by_id[pid] | |
| return { | |
| "message": f"All versions of policy '{policy_name}' deleted successfully" | |
| } |
| version_status = getattr(policy, "version_status", "production") | ||
| policy_name = policy.policy_name | ||
|
|
||
| # Delete from DB | ||
| await prisma_client.db.litellm_policytable.delete( | ||
| where={"policy_id": policy_id} | ||
| ) | ||
|
|
||
| # Remove from in-memory registry | ||
| self.remove_policy(policy.policy_name) | ||
| result: Dict[str, Any] = { | ||
| "message": f"Policy {policy_id} deleted successfully" | ||
| } | ||
|
|
||
| # Remove from in-memory registry only if this was the production version | ||
| if version_status == "production": | ||
| self.remove_policy(policy_name) | ||
| result["warning"] = ( | ||
| "Production version was deleted. No other version was promoted. " | ||
| "Promote another version to production if this policy should remain active." | ||
| ) | ||
|
|
||
| return {"message": f"Policy {policy_id} deleted successfully"} | ||
| return result |
There was a problem hiding this comment.
Stale _policies_by_id cache after deleting draft/published version
When deleting a draft or published version, this method does not remove the deleted policy_id from _policies_by_id. Subsequent calls to get_policy_by_id_for_request(policy_id) would still return the deleted version's data from the in-memory cache.
| version_status = getattr(policy, "version_status", "production") | |
| policy_name = policy.policy_name | |
| # Delete from DB | |
| await prisma_client.db.litellm_policytable.delete( | |
| where={"policy_id": policy_id} | |
| ) | |
| # Remove from in-memory registry | |
| self.remove_policy(policy.policy_name) | |
| result: Dict[str, Any] = { | |
| "message": f"Policy {policy_id} deleted successfully" | |
| } | |
| # Remove from in-memory registry only if this was the production version | |
| if version_status == "production": | |
| self.remove_policy(policy_name) | |
| result["warning"] = ( | |
| "Production version was deleted. No other version was promoted. " | |
| "Promote another version to production if this policy should remain active." | |
| ) | |
| return {"message": f"Policy {policy_id} deleted successfully"} | |
| return result | |
| version_status = getattr(policy, "version_status", "production") | |
| policy_name = policy.policy_name | |
| # Delete from DB | |
| await prisma_client.db.litellm_policytable.delete( | |
| where={"policy_id": policy_id} | |
| ) | |
| result: Dict[str, Any] = { | |
| "message": f"Policy {policy_id} deleted successfully" | |
| } | |
| # Remove from in-memory registry only if this was the production version | |
| if version_status == "production": | |
| self.remove_policy(policy_name) | |
| result["warning"] = ( | |
| "Production version was deleted. No other version was promoted. " | |
| "Promote another version to production if this policy should remain active." | |
| ) | |
| else: | |
| # Remove draft/published from the by-id cache | |
| self._policies_by_id.pop(policy_id, None) |
| """ | ||
| Update a policy version's status. Valid transitions: | ||
| - draft -> published (sets published_at) | ||
| - published -> production (sets production_at, demotes current production to published, updates in-memory) | ||
| - production -> published (demotes, removes from in-memory) | ||
| - draft -> production: NOT allowed (must publish first) | ||
| - published -> draft: NOT allowed |
There was a problem hiding this comment.
Docstring documents unimplemented transition
The docstring states production -> published (demotes, removes from in-memory) is a valid transition, but the code at line 818 rejects it: "Only draft versions can be published." This means a production version cannot be demoted via this method, which contradicts both this docstring and the endpoint docstring in policy_endpoints.py:233. Either implement the production -> published transition or update both docstrings to reflect the actual one-way escalation path (draft -> published -> production).
Additional Comments (1)
The endpoint fetches the policy via Consider either removing the draft check here (letting Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! |
1. Fix race condition in update_version_status promote-to-production:
- Wrap demote (update_many) + promote (update) in a Prisma transaction
- Prevents concurrent promote calls from leaving two versions as production
2. Fix stale _policies_by_id cache in delete methods:
- delete_policy_from_db: remove entry from _policies_by_id when deleting
a draft/published version
- delete_all_versions: scan and remove all _policies_by_id entries
matching the deleted policy_name
3. Fix docstring for update_version_status:
- Remove documented-but-unimplemented 'production -> published' transition
- Clarify which transitions are valid vs invalid
4. Add tests for new behavior:
- test_published_to_production_removes_from_policies_by_id
- test_delete_draft_removes_from_policies_by_id_cache
- test_delete_all_versions_cleans_policies_by_id_cache
- Update existing promote-to-production tests to mock transaction
Co-authored-by: Krish Dholakia <krrishdholakia@gmail.com>
|
@greptile please re-review |
Additional Comments (2)
The endpoint calls Consider removing the pre-check here and letting Context Used: Rule from Why: Cre... (source)
Context Used: Rule from Why: Cre... (source) |
Relevant issues
Addresses code review comments from Greptile on PR #21862. Specifically:
create_new_version.version_statusquery parameter.PolicyVersionStatusUpdateRequest.Pre-Submission checklist
Please complete all items before asking a LiteLLM maintainer to review your PR
tests/litellm/directory, Adding at least 1 test is a hard requirement - see detailsmake test-unit@greptileaiand received a Confidence Score of at least 4/5 before requesting a maintainer reviewCI (LiteLLM team)
Branch creation CI run
Link:
CI run for the last commit
Link:
Merge / cherry-pick CI run
Links:
Type
🐛 Bug Fix
🧹 Refactoring
✅ Test
Changes
This PR addresses several issues identified in the policy engine, including:
_policies_by_idcache after status transitions (policy_registry.py):_update_policies_by_id_cache()to refresh or remove cached policy entries when their status changes (e.g., draft to published, published to production).create_new_version(policy_registry.py):find_first,update_many,create) within a Prisma interactive transaction (prisma_client.db.tx()) to ensure atomicity.version_statusquery parameter validation (policy_endpoints.py):list_policiesendpoint to useLiteral["draft", "published", "production"]for theversion_statusquery parameter, providing early validation.PolicyVersionStatusUpdateRequestvalidation (resolver_types.py):version_statusinPolicyVersionStatusUpdateRequesttoLiteral["published", "production"]for Pydantic-level validation.policy_endpoints.py):dependencies=[Depends(user_api_key_auth)]from six API endpoint decorators whereuser_api_key_authwas already present as a function parameter, preventing double execution of the auth check.test_policy_versioning.pyandtest_policy_versioning_e2e.pyto correctly mock the Prisma transaction context manager used bycreate_new_version.