Skip to content

Comments

Policy engine reliability#21873

Open
krrishdholakia wants to merge 10 commits intolitellm_dev_02_21_2026_p2_cleanfrom
cursor/policy-engine-reliability-d681
Open

Policy engine reliability#21873
krrishdholakia wants to merge 10 commits intolitellm_dev_02_21_2026_p2_cleanfrom
cursor/policy-engine-reliability-d681

Conversation

@krrishdholakia
Copy link
Member

Relevant issues

Addresses code review comments from Greptile on PR #21862. Specifically:

  • Stale cache after policy status transitions.
  • Race condition in create_new_version.
  • Missing validation for version_status query parameter.
  • Missing validation for PolicyVersionStatusUpdateRequest.
  • Duplicate authentication dependency in API endpoints.

Pre-Submission checklist

Please complete all items before asking a LiteLLM maintainer to review your PR

  • I have Added testing in the tests/litellm/ directory, Adding at least 1 test is a hard requirement - see details
  • My PR passes all unit tests on make test-unit
  • My PR's scope is as isolated as possible, it only solves 1 specific problem
  • I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

CI (LiteLLM team)

CI status guideline:

  • 50-55 passing tests: main is stable with minor issues.
  • 45-49 passing tests: acceptable but needs attention
  • <= 40 passing tests: unstable; be careful with your merges and assess the risk.
  • Branch creation CI run
    Link:

  • CI run for the last commit
    Link:

  • Merge / cherry-pick CI run
    Links:

Type

🐛 Bug Fix
🧹 Refactoring
✅ Test

Changes

This PR addresses several issues identified in the policy engine, including:

  1. Stale _policies_by_id cache after status transitions (policy_registry.py):
    • Implemented _update_policies_by_id_cache() to refresh or remove cached policy entries when their status changes (e.g., draft to published, published to production).
  2. Race condition in create_new_version (policy_registry.py):
    • Wrapped the version number computation and database updates (find_first, update_many, create) within a Prisma interactive transaction (prisma_client.db.tx()) to ensure atomicity.
  3. version_status query parameter validation (policy_endpoints.py):
    • Updated the list_policies endpoint to use Literal["draft", "published", "production"] for the version_status query parameter, providing early validation.
  4. PolicyVersionStatusUpdateRequest validation (resolver_types.py):
    • Changed version_status in PolicyVersionStatusUpdateRequest to Literal["published", "production"] for Pydantic-level validation.
  5. Duplicate auth dependency (policy_endpoints.py):
    • Removed redundant dependencies=[Depends(user_api_key_auth)] from six API endpoint decorators where user_api_key_auth was already present as a function parameter, preventing double execution of the auth check.
  6. Test updates:
    • Modified test_policy_versioning.py and test_policy_versioning_e2e.py to correctly mock the Prisma transaction context manager used by create_new_version.

Open in Web Open in Cursor 

yuneng-jiang and others added 9 commits February 21, 2026 16:06
1. Fix stale _policies_by_id cache after status transitions:
   - Add _update_policies_by_id_cache() helper method
   - Update cache when draft->published transition occurs
   - Remove entry from cache when promoting to production (resolved by name)

2. Fix race condition in create_new_version:
   - Wrap find_first + update_many + create in a Prisma transaction
   - Prevents concurrent version number collisions and orphaned is_latest state

3. Validate version_status query parameter in list_policies:
   - Use Literal['draft', 'published', 'production'] type
   - Returns 422 for invalid values instead of silently returning empty results

4. Add Literal validation to PolicyVersionStatusUpdateRequest:
   - Change version_status field from str to Literal['published', 'production']
   - Validates at request parsing level rather than at runtime

5. Fix duplicate auth dependency in endpoints:
   - Remove decorator-level dependencies=[Depends(user_api_key_auth)] when
     the function parameter already uses Depends(user_api_key_auth)
   - Prevents auth check from running twice per request

6. Update tests to mock Prisma transaction context manager

Co-authored-by: Krish Dholakia <krrishdholakia@gmail.com>
@cursor
Copy link

cursor bot commented Feb 22, 2026

Cursor Agent can help with this pull request. Just @cursor in comments and I'll start working on changes in this branch.
Learn more about Cursor Agents

@vercel
Copy link

vercel bot commented Feb 22, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
litellm Ready Ready Preview, Comment Feb 22, 2026 4:00am

Request Review

@chatgpt-codex-connector
Copy link

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@CLAassistant
Copy link

CLAassistant commented Feb 22, 2026

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
2 out of 3 committers have signed the CLA.

✅ yuneng-jiang
✅ ishaan-jaff
❌ cursoragent
You have signed the CLA already but the status is still pending? Let us recheck it.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 22, 2026

Greptile Summary

This PR addresses code review feedback from PR #21862, improving the policy engine's reliability around versioning. Key changes:

  • Stale cache fix: Adds _update_policies_by_id_cache() to refresh in-memory cache entries when policy versions transition status (draft → published, published → production), and cleans up _policies_by_id on deletions and delete_all_versions.
  • Race condition fix: Wraps create_new_version's version-number computation and row creation in a Prisma interactive transaction (prisma_client.db.tx()), and similarly wraps the promote-to-production path in update_version_status.
  • Input validation: Uses Literal["draft", "published", "production"] on the list_policies query param and Literal["published", "production"] on PolicyVersionStatusUpdateRequest for Pydantic-level validation.
  • Duplicate auth removal: Removes redundant dependencies=[Depends(user_api_key_auth)] from six endpoint decorators where user_api_key_auth was already injected as a function parameter.
  • Draft-only editing: update_policy_in_db now rejects updates to non-draft versions, and the endpoint adds an early HTTP 400 guard.
  • Attachment validation: create_policy_attachment now verifies the policy has a production version before creating an attachment.
  • Tests: Adds comprehensive mock-based unit and integration tests covering the full version lifecycle, cache cleanup, and status transitions.

Also includes unrelated documentation additions for the Store Model in DB UI setting.

Confidence Score: 4/5

  • This PR is safe to merge — it fixes real reliability issues with proper transaction usage and cache management, and includes thorough test coverage.
  • The core changes (transaction wrapping, cache invalidation, input validation, duplicate auth removal) are well-implemented and address concrete reliability issues. Tests are comprehensive and mock-only. The only concerns are minor style/performance issues: a duplicate DB read in the update endpoint and an unnecessarily broad query in attachment creation validation. No functional bugs found.
  • litellm/proxy/policy_engine/policy_endpoints.py has a duplicate DB read on the update path and an inefficient full-table scan for attachment validation.

Important Files Changed

Filename Overview
litellm/proxy/policy_engine/policy_endpoints.py Adds versioning endpoints, removes duplicate auth dependencies, adds version_status query param validation via Literal type, and updates attachment creation to verify production version exists. Minor style issues: duplicate DB read in update path, and fetching all production policies to validate one name.
litellm/proxy/policy_engine/policy_registry.py Major additions: version lifecycle methods (create_new_version, update_version_status, compare_versions, delete_all_versions), _policies_by_id cache, transaction usage for create_new_version and promote-to-production. Properly handles cache invalidation for most status transitions. Extracted _row_to_policy_db_response helper.
litellm/types/proxy/policy_engine/resolver_types.py Adds versioning fields to PolicyDBResponse and new types: PolicyVersionCreateRequest, PolicyVersionStatusUpdateRequest (with Literal validation), PolicyVersionListResponse, PolicyVersionCompareResponse. Well-structured with proper Pydantic validation.
litellm/types/proxy/policy_engine/init.py Re-exports the four new versioning types. Import style changed to parenthesized single-line format (cosmetic only).
tests/test_litellm/proxy/policy_engine/test_policy_versioning.py Comprehensive unit tests covering: _row_to_policy_db_response, sync with production-only filter, draft-only updates, delete cache cleanup, create_new_version with transaction mocking, status transitions, delete_all_versions cleanup, compare_versions, and singleton behavior. All mock-based, no real network calls.
tests/test_litellm/proxy/policy_engine/test_policy_versioning_e2e.py Integration-style lifecycle test covering create -> draft -> edit -> publish -> promote flow, plus attachment resolution against production. All mock-based with proper transaction context manager mocking.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["create_policy\n(v1 production)"] --> B["_policies\n(by name, in-memory)"]
    
    C["create_new_version\n(TX: find latest, update is_latest, create)"] --> D["New draft\n(DB only)"]
    
    D --> E["update_policy_in_db\n(draft only)"]
    E --> D
    
    D --> F["update_version_status\n(draft → published)"]
    F --> G["Published version"]
    G --> H["_policies_by_id\n(cache updated)"]
    
    G --> I["update_version_status\n(TX: demote old prod, promote)"]
    I --> J["New production"]
    J --> B
    J --> K["Remove from\n_policies_by_id"]
    
    L["delete_policy_from_db"] -->|production| M["remove_policy\n(clear _policies)"]
    L -->|draft/published| N["pop from\n_policies_by_id"]
    
    O["delete_all_versions"] --> M
    O --> P["Clean all matching\n_policies_by_id entries"]
Loading

Last reviewed commit: 33cceaf

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 5 comments

Edit Code Review Agent Settings | Greptile

Comment on lines 845 to 867
# Demote current production to published
await prisma_client.db.litellm_policytable.update_many(
where={
"policy_name": policy_name,
"version_status": "production",
},
data={
"version_status": "published",
"updated_at": now,
"updated_by": updated_by,
},
)

# Promote this version to production
updated = await prisma_client.db.litellm_policytable.update(
where={"policy_id": policy_id},
data={
"version_status": "production",
"production_at": now,
"updated_at": now,
"updated_by": updated_by,
},
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race condition in promote-to-production path

The update_version_status promotion path (demote old production + promote new version) is not wrapped in a database transaction, unlike create_new_version which correctly uses prisma_client.db.tx(). If two concurrent requests promote different versions of the same policy to production, both update_many calls may target the same row and both update calls can succeed, leaving two versions with version_status="production" in the database.

Consider wrapping lines 846-867 in a Prisma transaction, mirroring what create_new_version does:

async with prisma_client.db.tx() as tx:
    await tx.litellm_policytable.update_many(
        where={
            "policy_name": policy_name,
            "version_status": "production",
        },
        data={
            "version_status": "published",
            "updated_at": now,
            "updated_by": updated_by,
        },
    )
    updated = await tx.litellm_policytable.update(
        where={"policy_id": policy_id},
        data={
            "version_status": "production",
            "production_at": now,
            "updated_at": now,
            "updated_by": updated_by,
        },
    )

Comment on lines 966 to 976
try:
await prisma_client.db.litellm_policytable.delete_many(
where={"policy_name": policy_name}
)
self.remove_policy(policy_name)
return {
"message": f"All versions of policy '{policy_name}' deleted successfully"
}
except Exception as e:
verbose_proxy_logger.exception(f"Error deleting all versions: {e}")
raise Exception(f"Error deleting all versions: {str(e)}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale _policies_by_id cache after delete_all_versions

delete_all_versions removes the production entry from _policies via self.remove_policy(), but does not clean up _policies_by_id, which may still contain entries for draft/published versions of this policy. After deletion, get_policy_by_id_for_request() could return stale data for deleted policy versions.

Consider adding cleanup like:

Suggested change
try:
await prisma_client.db.litellm_policytable.delete_many(
where={"policy_name": policy_name}
)
self.remove_policy(policy_name)
return {
"message": f"All versions of policy '{policy_name}' deleted successfully"
}
except Exception as e:
verbose_proxy_logger.exception(f"Error deleting all versions: {e}")
raise Exception(f"Error deleting all versions: {str(e)}")
try:
await prisma_client.db.litellm_policytable.delete_many(
where={"policy_name": policy_name}
)
self.remove_policy(policy_name)
# Also clean up draft/published versions from the by-id cache
stale_ids = [
pid for pid, (pname, _) in self._policies_by_id.items()
if pname == policy_name
]
for pid in stale_ids:
del self._policies_by_id[pid]
return {
"message": f"All versions of policy '{policy_name}' deleted successfully"
}

Comment on lines 418 to 438
version_status = getattr(policy, "version_status", "production")
policy_name = policy.policy_name

# Delete from DB
await prisma_client.db.litellm_policytable.delete(
where={"policy_id": policy_id}
)

# Remove from in-memory registry
self.remove_policy(policy.policy_name)
result: Dict[str, Any] = {
"message": f"Policy {policy_id} deleted successfully"
}

# Remove from in-memory registry only if this was the production version
if version_status == "production":
self.remove_policy(policy_name)
result["warning"] = (
"Production version was deleted. No other version was promoted. "
"Promote another version to production if this policy should remain active."
)

return {"message": f"Policy {policy_id} deleted successfully"}
return result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale _policies_by_id cache after deleting draft/published version

When deleting a draft or published version, this method does not remove the deleted policy_id from _policies_by_id. Subsequent calls to get_policy_by_id_for_request(policy_id) would still return the deleted version's data from the in-memory cache.

Suggested change
version_status = getattr(policy, "version_status", "production")
policy_name = policy.policy_name
# Delete from DB
await prisma_client.db.litellm_policytable.delete(
where={"policy_id": policy_id}
)
# Remove from in-memory registry
self.remove_policy(policy.policy_name)
result: Dict[str, Any] = {
"message": f"Policy {policy_id} deleted successfully"
}
# Remove from in-memory registry only if this was the production version
if version_status == "production":
self.remove_policy(policy_name)
result["warning"] = (
"Production version was deleted. No other version was promoted. "
"Promote another version to production if this policy should remain active."
)
return {"message": f"Policy {policy_id} deleted successfully"}
return result
version_status = getattr(policy, "version_status", "production")
policy_name = policy.policy_name
# Delete from DB
await prisma_client.db.litellm_policytable.delete(
where={"policy_id": policy_id}
)
result: Dict[str, Any] = {
"message": f"Policy {policy_id} deleted successfully"
}
# Remove from in-memory registry only if this was the production version
if version_status == "production":
self.remove_policy(policy_name)
result["warning"] = (
"Production version was deleted. No other version was promoted. "
"Promote another version to production if this policy should remain active."
)
else:
# Remove draft/published from the by-id cache
self._policies_by_id.pop(policy_id, None)

Comment on lines 784 to 790
"""
Update a policy version's status. Valid transitions:
- draft -> published (sets published_at)
- published -> production (sets production_at, demotes current production to published, updates in-memory)
- production -> published (demotes, removes from in-memory)
- draft -> production: NOT allowed (must publish first)
- published -> draft: NOT allowed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring documents unimplemented transition

The docstring states production -> published (demotes, removes from in-memory) is a valid transition, but the code at line 818 rejects it: "Only draft versions can be published." This means a production version cannot be demoted via this method, which contradicts both this docstring and the endpoint docstring in policy_endpoints.py:233. Either implement the production -> published transition or update both docstrings to reflect the actual one-way escalation path (draft -> published -> production).

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 22, 2026

Additional Comments (1)

litellm/proxy/policy_engine/policy_endpoints.py
Duplicate draft-status check causes redundant DB query

The endpoint fetches the policy via get_policy_by_id_from_db (line 391) to check existence and draft status, then update_policy_in_db (line 406) immediately performs the same find_unique and draft-status check again internally. This results in two DB queries for the same policy_id on every update request.

Consider either removing the draft check here (letting update_policy_in_db handle it and catching the exception as a 400), or removing the duplicate check from update_policy_in_db and having it trust the caller. Either way, one of the two checks should be removed to avoid the extra DB round-trip.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@krrishdholakia krrishdholakia changed the base branch from main to litellm_dev_02_21_2026_p2_clean February 22, 2026 03:54
1. Fix race condition in update_version_status promote-to-production:
   - Wrap demote (update_many) + promote (update) in a Prisma transaction
   - Prevents concurrent promote calls from leaving two versions as production

2. Fix stale _policies_by_id cache in delete methods:
   - delete_policy_from_db: remove entry from _policies_by_id when deleting
     a draft/published version
   - delete_all_versions: scan and remove all _policies_by_id entries
     matching the deleted policy_name

3. Fix docstring for update_version_status:
   - Remove documented-but-unimplemented 'production -> published' transition
   - Clarify which transitions are valid vs invalid

4. Add tests for new behavior:
   - test_published_to_production_removes_from_policies_by_id
   - test_delete_draft_removes_from_policies_by_id_cache
   - test_delete_all_versions_cleans_policies_by_id_cache
   - Update existing promote-to-production tests to mock transaction

Co-authored-by: Krish Dholakia <krrishdholakia@gmail.com>
@krrishdholakia
Copy link
Member Author

@greptile please re-review

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 22, 2026

Additional Comments (2)

litellm/proxy/policy_engine/policy_endpoints.py
Duplicate DB read on update path

The endpoint calls get_policy_by_id_from_db (line 391, which does find_unique) to check existence and draft status, then calls update_policy_in_db (line 406), which internally does another find_unique and repeats the same draft-status check at policy_registry.py:345-354. This results in two database round-trips for the same policy on every update request.

Consider removing the pre-check here and letting update_policy_in_db handle both the not-found and non-draft cases, catching its exceptions and mapping them to the appropriate HTTP status codes — similar to how update_policy_version_status works at line 251.

Context Used: Rule from dashboard - What: Avoid creating new database requests or Router objects in the critical request path.

Why: Cre... (source)


litellm/proxy/policy_engine/policy_endpoints.py
Fetches all production policies to validate one name

get_all_policies_from_db(prisma_client, version_status="production") loads every production policy row from the database just to check whether request.policy_name exists. For deployments with many policies this is wasteful. A targeted find_first(where={"policy_name": request.policy_name, "version_status": "production"}) would be more efficient.

Context Used: Rule from dashboard - What: Avoid creating new database requests or Router objects in the critical request path.

Why: Cre... (source)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants