Skip to content

Add anthropic endpoint #21341

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
2 changes: 1 addition & 1 deletion docs/community/meetups.md
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc changes persists

Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Meetups

We host regular meetups in San Francisco Bay Area every 2 months. We will share the project updates from the vLLM team and have guest speakers from the industry to share their experience and insights. Please find the materials of our previous meetups below:
We host regular meetups at San Francisco Bay Area every 2 months. We will share the project updates from the vLLM team and have guest speakers from the industry to share their experience and insights. Please find the materials of our previous meetups below:

- [NYC vLLM Meetup](https://lu.ma/c1rqyf1f), May 7th, 2025. [[Slides]](https://docs.google.com/presentation/d/1_q_aW_ioMJWUImf1s1YM-ZhjXz8cUeL0IJvaquOYBeA/edit?usp=sharing)
- [Asia Developer Day](https://www.sginnovate.com/event/limited-availability-morning-evening-slots-remaining-inaugural-vllm-asia-developer-day), April 3rd 2025. [[Slides]](https://docs.google.com/presentation/d/19cp6Qu8u48ihB91A064XfaXruNYiBOUKrBxAmDOllOo/edit?usp=sharing).
Expand Down
2 changes: 1 addition & 1 deletion docs/models/extensions/fastsafetensor.md
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc changes persists

Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@ Loading Model weights with fastsafetensors
===================================================================

Using fastsafetensors library enables loading model weights to GPU memory by leveraging GPU direct storage. See [their GitHub repository](https://github.com/foundation-model-stack/fastsafetensors) for more details.
For enabling this feature, set the environment variable ``USE_FASTSAFETENSOR`` to ``true``
To enable this feature, set the environment variable ``USE_FASTSAFETENSOR`` to ``true``
41 changes: 41 additions & 0 deletions vllm/entrypoints/openai/api_server.py
Original file line number Diff line number Diff line change
Expand Up @@ -106,13 +106,54 @@
from vllm.v1.metrics.prometheus import get_prometheus_registry
from vllm.version import __version__ as VLLM_VERSION

from fastapi import APIRouter, Request, HTTPException
from uuid import uuid4
from .schemas import AnthropicMessagesRequest, AnthropicMessagesResponse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The import path for the Anthropic schemas appears to be incorrect. The new schemas are defined in vllm/entrypoints/openai/schemas_anthropic.py, but you are importing from .schemas. This will likely cause an ImportError at runtime.

Suggested change
from .schemas import AnthropicMessagesRequest, AnthropicMessagesResponse
from .schemas_anthropic import AnthropicMessagesRequest, AnthropicMessagesResponse


prometheus_multiproc_dir: tempfile.TemporaryDirectory

# Cannot use __name__ (https://github.com/vllm-project/vllm/pull/4765)
logger = init_logger('vllm.entrypoints.openai.api_server')

_running_tasks: set[asyncio.Task] = set()

router = APIRouter()

@router.post("/v1/messages")
async def anthropic_messages(request: Request):
body = await request.json()
# Validate Anthropic headers and fields
api_key = request.headers.get("x-api-key")
version = request.headers.get("anthropic-version")
if not api_key or not version:
raise HTTPException(status_code=400, detail="Missing required Anthropic headers.")

# Convert messages to prompt
prompt = convert_messages_to_prompt(body["messages"])

Check failure on line 132 in vllm/entrypoints/openai/api_server.py

View workflow job for this annotation

GitHub Actions / pre-commit

Name "convert_messages_to_prompt" is not defined [name-defined]

Check failure on line 133 in vllm/entrypoints/openai/api_server.py

View workflow job for this annotation

GitHub Actions / pre-commit

Ruff (F821)

vllm/entrypoints/openai/api_server.py:133:14: F821 Undefined name `convert_messages_to_prompt`
# Call existing vLLM generation logic
llm_response = await vllm_generate(

Check failure on line 135 in vllm/entrypoints/openai/api_server.py

View workflow job for this annotation

GitHub Actions / pre-commit

Name "vllm_generate" is not defined [name-defined]
model=body["model"],

Check failure on line 136 in vllm/entrypoints/openai/api_server.py

View workflow job for this annotation

GitHub Actions / pre-commit

Ruff (F821)

vllm/entrypoints/openai/api_server.py:136:26: F821 Undefined name `vllm_generate`
prompt=prompt,
max_tokens=body.get("max_tokens", 1024)
)

# Return response in Anthropic format
output = {
"id": f"msg_{uuid4().hex[:24]}",
"type": "message",
"role": "assistant",
"content": [{"type": "text", "text": llm_response["text"]}],
"model": body["model"],
"stop_reason": llm_response.get("stop_reason", "end_turn"),
"stop_sequence": None,
"usage": {
"input_tokens": llm_response["prompt_tokens"],
"output_tokens": llm_response["completion_tokens"],
}
}
return output
Comment on lines +122 to +155
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This endpoint implementation has several critical issues that will prevent it from working correctly:

  1. Undefined Functions: convert_messages_to_prompt and vllm_generate are not defined or imported in this file, which will cause a NameError at runtime.
  2. Unsafe Request Handling: The request body is parsed manually with request.json(), and dictionary keys are accessed directly (e.g., body["messages"]). This is unsafe and can lead to KeyError exceptions and 500 errors if the request is malformed. You should leverage FastAPI's Pydantic integration for automatic request validation and parsing.
  3. Not Reusing Existing Logic: The PR description mentions reusing existing logic, but the call to the undefined vllm_generate function doesn't do that. The server already has a robust completion generation pipeline that should be used here.

I suggest rewriting this function to address these points by using the AnthropicMessagesRequest Pydantic model for validation and calling the existing completion handler. This will make the implementation robust and consistent with the rest of the API server.

You will need to add the following imports at the top of the file:

from vllm.entrypoints.openai.protocol import (CompletionRequest,
                                              CompletionResponse, ErrorResponse)
from vllm.entrypoints.openai.tool_parsers.utils import (
    convert_messages_to_prompt)

Here is the suggested implementation for the anthropic_messages function:

@router.post("/v1/messages", response_model=AnthropicMessagesResponse)
async def anthropic_messages(anthropic_request: AnthropicMessagesRequest,
                                 raw_request: Request):
    # Validate Anthropic headers
    api_key = raw_request.headers.get("x-api-key")
    version = raw_request.headers.get("anthropic-version")
    if not api_key or not version:
        raise HTTPException(status_code=400,
                            detail="Missing required Anthropic headers.")

    # Convert messages to prompt
    prompt = convert_messages_to_prompt(anthropic_request.messages)

    # Create a vLLM CompletionRequest
    completion_request = CompletionRequest(
        model=anthropic_request.model,
        prompt=prompt,
        max_tokens=anthropic_request.max_tokens,
        stream=False,  # This endpoint is non-streaming.
    )

    # Get the completion handler and call it
    completion_handler = completion(raw_request)
    if completion_handler is None:
        raise HTTPException(status_code=500,
                            detail="Completion handler is not available.")

    result = await completion_handler.create_completion(completion_request,
                                                        raw_request)

    if isinstance(result, ErrorResponse):
        raise HTTPException(status_code=result.code, detail=result.message)

    assert isinstance(result, CompletionResponse)

    # Return response in Anthropic format
    return AnthropicMessagesResponse(
        id=f"msg_{uuid4().hex[:24]}",
        type="message",
        role="assistant",
        content=[
            {
                "type": "text",
                "text": result.choices[0].text
            }
        ],
        model=anthropic_request.model,
        stop_reason=result.choices[0].finish_reason,
        stop_sequence=None,
        usage={
            "input_tokens": result.usage.prompt_tokens,
            "output_tokens": result.usage.completion_tokens,
        })



@asynccontextmanager
async def lifespan(app: FastAPI):
Expand Down
23 changes: 23 additions & 0 deletions vllm/entrypoints/openai/protocol_anthropic.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to ask you to move this again, could it instead be moved to vllm/entrypoints/anthropic/protocol.py?

Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
from pydantic import BaseModel
from typing import List, Dict, Any, Optional

Check failure on line 2 in vllm/entrypoints/openai/protocol_anthropic.py

View workflow job for this annotation

GitHub Actions / pre-commit

Ruff (UP035)

vllm/entrypoints/openai/protocol_anthropic.py:2:1: UP035 `typing.Dict` is deprecated, use `dict` instead

Check failure on line 2 in vllm/entrypoints/openai/protocol_anthropic.py

View workflow job for this annotation

GitHub Actions / pre-commit

Ruff (UP035)

vllm/entrypoints/openai/protocol_anthropic.py:2:1: UP035 `typing.List` is deprecated, use `list` instead

class AnthropicMessageBlock(BaseModel):
role: str # "user" | "assistant"
content: Any

class AnthropicMessagesRequest(BaseModel):
model: str
messages: List[AnthropicMessageBlock]
max_tokens: int
system: Optional[str] = None

Check failure on line 12 in vllm/entrypoints/openai/protocol_anthropic.py

View workflow job for this annotation

GitHub Actions / pre-commit

Ruff (UP006)

vllm/entrypoints/openai/protocol_anthropic.py:12:15: UP006 Use `list` instead of `List` for type annotation
# Add further optional fields per API docs

class AnthropicMessagesResponse(BaseModel):
id: str
type: str = "message"
role: str = "assistant"
content: List[Dict[str, Any]]
model: str
stop_reason: Optional[str]
stop_sequence: Optional[str]

Check failure on line 22 in vllm/entrypoints/openai/protocol_anthropic.py

View workflow job for this annotation

GitHub Actions / pre-commit

Ruff (UP006)

vllm/entrypoints/openai/protocol_anthropic.py:22:19: UP006 Use `dict` instead of `Dict` for type annotation

Check failure on line 22 in vllm/entrypoints/openai/protocol_anthropic.py

View workflow job for this annotation

GitHub Actions / pre-commit

Ruff (UP006)

vllm/entrypoints/openai/protocol_anthropic.py:22:14: UP006 Use `list` instead of `List` for type annotation
usage: Dict[str, int]
11 changes: 11 additions & 0 deletions vllm/entrypoints/openai/tool_parsers/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -122,3 +122,14 @@ def consume_space(i: int, s: str) -> int:
while i < len(s) and s[i].isspace():
i += 1
return i

def convert_messages_to_prompt(messages):
# Converts an Anthropic-style conversation to a plain prompt string.
prompt = ""
for msg in messages:
if msg["role"] == "user":
prompt += f"Human: {msg['content']}\n"
elif msg["role"] == "assistant":
prompt += f"Assistant: {msg['content']}\n"
return prompt
Comment on lines +126 to +134
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This function is not robust enough for the Anthropic API and has several issues:

  1. No Type Hints: The function signature lacks type hints, which makes it harder to understand and use correctly.
  2. Unsafe Dictionary Access: It uses direct dictionary access (e.g., msg["role"]), which is unsafe and will raise a KeyError if a message is malformed, leading to a 500 error.
  3. Incorrect Content Handling: It assumes msg['content'] is always a string. However, the Anthropic API allows content to be a string or a list of content blocks (e.g., [{"type": "text", "text": "..."}]). Simply stringifying a list of blocks will result in an incorrect prompt.

I suggest a more robust implementation that handles these cases gracefully. You will need to add from typing import Any, Dict, List to the imports at the top of the file.

def convert_messages_to_prompt(messages: List[Dict[str, Any]]) -> str:
    # Converts an Anthropic-style conversation to a plain prompt string.
    prompt = ""
    for msg in messages:
        role = msg.get("role")
        content = msg.get("content")

        if role == "user":
            role_str = "Human"
        elif role == "assistant":
            role_str = "Assistant"
        else:
            # Skip unknown roles
            continue

        text_content = ""
        if isinstance(content, str):
            text_content = content
        elif isinstance(content, list):
            for block in content:
                if isinstance(block, dict) and block.get("type") == "text":
                    text_content += block.get("text", "")

        if text_content:
            prompt += f"{role_str}: {text_content}\n"
    return prompt

Comment on lines +126 to +134
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is a tool parser


10 changes: 10 additions & 0 deletions vllm/v1/engine/output_processor.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project

import time
import asyncio
from collections.abc import Iterable
from dataclasses import dataclass
Expand All @@ -10,6 +11,7 @@

from vllm.outputs import (CompletionOutput, PoolingOutput,
PoolingRequestOutput, RequestOutput)
from vllm.sequence import RequestMetrics
from vllm.sampling_params import RequestOutputKind
from vllm.transformers_utils.tokenizer import AnyTokenizer
from vllm.transformers_utils.tokenizer_group import TokenizerGroup
Expand Down Expand Up @@ -410,6 +412,14 @@ def process_outputs(
if request_output := req_state.make_request_output(
new_token_ids, pooling_output, finish_reason, stop_reason,
kv_transfer_params, num_cached_tokens):
request_output.metrics = RequestMetrics(
arrival_time=req_state.stats.arrival_time,
last_token_time=req_state.stats.last_token_ts,
first_scheduled_time=req_state.stats.scheduled_ts,
first_token_time=req_state.stats.first_token_ts,
time_in_queue=req_state.stats.scheduled_ts - req_state.stats.arrival_time,
finished_time=time.monotonic()
)
if req_state.queue is not None:
# AsyncLLM: put into queue for handling by generate().
req_state.queue.put(request_output)
Expand Down
2 changes: 1 addition & 1 deletion vllm/v1/engine/processor.py
Original file line number Diff line number Diff line change
Expand Up @@ -247,7 +247,7 @@ def process_inputs(
f"is out of range [0, {data_parallel_size}).")

if arrival_time is None:
arrival_time = time.time()
arrival_time = time.monotonic()

# Process inputs, which includes:
# 1. Tokenize text prompt, with LoRA request if one exists.
Expand Down
Loading