Skip to content

feat: vllm=0.16.0, LMCache, uv installer, add messages and responses endpoints.#277

Open
velaraptor-runpod wants to merge 7 commits intomainfrom
feat/lmcache
Open

feat: vllm=0.16.0, LMCache, uv installer, add messages and responses endpoints.#277
velaraptor-runpod wants to merge 7 commits intomainfrom
feat/lmcache

Conversation

@velaraptor-runpod
Copy link
Copy Markdown
Contributor

  • Add BUILD_ARG for LMCache (https://docs.lmcache.ai/)
  • Update vllm to 0.16.0
  • Use uv instead of pip
  • add responses & messages endpoints (note these will not be exposed with normal queue delay)
  • auto fix if lmcache is enabled, require HMA to be disabled

@velaraptor-runpod
Copy link
Copy Markdown
Contributor Author

also much faster docker build times with uv.

Copy link
Copy Markdown
Contributor

@TimPietruskyRunPod TimPietruskyRunPod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: PR #277

Thanks for the work here, Chris! The uv migration and LMCache support look great. I have a few concerns to address before merging.


Bug: ErrorResponse attribute access in _handle_messages_request

In src/engine.py, the messages error handler does:

if isinstance(response, ErrorResponse):
    yield AnthropicErrorResponse(
        error=AnthropicError(type=response.error.type, message=response.error.message)
    ).model_dump()

ErrorResponse (from vllm.entrypoints.openai.engine.protocol) has top-level .type and .message attributes — there is no .error nested object. This will raise an AttributeError at runtime. Should be:

error=AnthropicError(type=response.type, message=response.message)

Major version bump: transformers>=5.2.0

This jumps from >=4.57.0 to >=5.2.0 — a major version change. Is this actually required by vLLM 0.16.0 or the new endpoints? If not strictly necessary, I'd prefer keeping the lower bound at 4.x to avoid breaking existing builds. If it is required, let's call it out explicitly in the PR description so we know the reasoning.


Missing newline at end of engine.py

The diff shows \ No newline at end of file. Please add a trailing newline.


PR title is misleading: vLLM is already at 0.16.0 on main

The title says "vllm=0.16.0" but main already has vllm[flashinfer]==0.16.0 (merged in #272). The actual changes here are the uv migration, LMCache support, and the new endpoints. Consider updating the title to reflect what's actually new, e.g.:
feat: uv installer, LMCache support, add /v1/responses and /v1/messages endpoints


LMCache: no version pin

uv pip install --system lmcache has no version constraint. For reproducible builds, pin it (e.g., lmcache==x.y.z or at least lmcache>=x.y).


New endpoints not documented

The /v1/responses and /v1/messages routes are added but not mentioned in any docs or README. The PR description says "note these will not be exposed with normal queue delay" — can you elaborate? If they're user-facing, they should be documented. If they're experimental/internal, a code comment would help future readers.


"RunPod" → "Runpod" branding change

Is this an official branding decision? The codebase (handler.py comments, engine.py comments, CLAUDE.md, etc.) still uses "RunPod" extensively. If this is intentional, it should probably be a separate follow-up PR that does a complete sweep, not mixed in with feature work.


Minor: inconsistent engine initialization params

responses_engine gets enable_log_outputs but messages_engine does not. Is that intentional, or should messages_engine also support it (if the AnthropicServingMessages constructor accepts it)?


Summary

The core changes (uv, LMCache, new endpoints) are solid. Main blockers:

  1. Bug: Fix response.error.typeresponse.type in messages handler
  2. Clarify: Is transformers>=5.2.0 required?
  3. Pin: lmcache version

The rest are smaller items. Happy to re-review once the above are addressed!

@TimPietruskyRunPod
Copy link
Copy Markdown
Contributor

Correction to my review: Disregard the point about the PR title being misleading. The vLLM package was bumped in #272, but this PR is about wiring up the new 0.16.0 features (Anthropic /v1/messages, OpenAI /v1/responses, LMCache support) — so the title is accurate. The transformers>=5.2.0 bump is also likely required by these new vLLM 0.16.0 APIs, though it'd still be good to confirm.

The remaining items from my review still stand:

  1. Bug: response.error.typeresponse.type in _handle_messages_request
  2. Pin: lmcache version for reproducible builds
  3. Minor: missing newline at EOF in engine.py, docs for new endpoints, branding consistency, enable_log_outputs parity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants