feat: vllm=0.16.0, LMCache, uv installer, add messages and responses endpoints.#277
feat: vllm=0.16.0, LMCache, uv installer, add messages and responses endpoints.#277velaraptor-runpod wants to merge 7 commits intomainfrom
Conversation
velaraptor-runpod
commented
Mar 18, 2026
- Add BUILD_ARG for LMCache (https://docs.lmcache.ai/)
- Update vllm to 0.16.0
- Use uv instead of pip
- add responses & messages endpoints (note these will not be exposed with normal queue delay)
- auto fix if lmcache is enabled, require HMA to be disabled
|
also much faster docker build times with uv. |
TimPietruskyRunPod
left a comment
There was a problem hiding this comment.
Review: PR #277
Thanks for the work here, Chris! The uv migration and LMCache support look great. I have a few concerns to address before merging.
Bug: ErrorResponse attribute access in _handle_messages_request
In src/engine.py, the messages error handler does:
if isinstance(response, ErrorResponse):
yield AnthropicErrorResponse(
error=AnthropicError(type=response.error.type, message=response.error.message)
).model_dump()ErrorResponse (from vllm.entrypoints.openai.engine.protocol) has top-level .type and .message attributes — there is no .error nested object. This will raise an AttributeError at runtime. Should be:
error=AnthropicError(type=response.type, message=response.message)Major version bump: transformers>=5.2.0
This jumps from >=4.57.0 to >=5.2.0 — a major version change. Is this actually required by vLLM 0.16.0 or the new endpoints? If not strictly necessary, I'd prefer keeping the lower bound at 4.x to avoid breaking existing builds. If it is required, let's call it out explicitly in the PR description so we know the reasoning.
Missing newline at end of engine.py
The diff shows \ No newline at end of file. Please add a trailing newline.
PR title is misleading: vLLM is already at 0.16.0 on main
The title says "vllm=0.16.0" but main already has vllm[flashinfer]==0.16.0 (merged in #272). The actual changes here are the uv migration, LMCache support, and the new endpoints. Consider updating the title to reflect what's actually new, e.g.:
feat: uv installer, LMCache support, add /v1/responses and /v1/messages endpoints
LMCache: no version pin
uv pip install --system lmcache has no version constraint. For reproducible builds, pin it (e.g., lmcache==x.y.z or at least lmcache>=x.y).
New endpoints not documented
The /v1/responses and /v1/messages routes are added but not mentioned in any docs or README. The PR description says "note these will not be exposed with normal queue delay" — can you elaborate? If they're user-facing, they should be documented. If they're experimental/internal, a code comment would help future readers.
"RunPod" → "Runpod" branding change
Is this an official branding decision? The codebase (handler.py comments, engine.py comments, CLAUDE.md, etc.) still uses "RunPod" extensively. If this is intentional, it should probably be a separate follow-up PR that does a complete sweep, not mixed in with feature work.
Minor: inconsistent engine initialization params
responses_engine gets enable_log_outputs but messages_engine does not. Is that intentional, or should messages_engine also support it (if the AnthropicServingMessages constructor accepts it)?
Summary
The core changes (uv, LMCache, new endpoints) are solid. Main blockers:
- Bug: Fix
response.error.type→response.typein messages handler - Clarify: Is
transformers>=5.2.0required? - Pin: lmcache version
The rest are smaller items. Happy to re-review once the above are addressed!
|
Correction to my review: Disregard the point about the PR title being misleading. The vLLM package was bumped in #272, but this PR is about wiring up the new 0.16.0 features (Anthropic The remaining items from my review still stand:
|