Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 34 additions & 2 deletions .runpod/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
![vLLM worker banner](https://cpjrphpz3t5wbwfe.public.blob.vercel-storage.com/worker-vllm_banner.jpeg)
![vLLM worker banner](https://image.runpod.ai/preview/vllm/vllm-banner.png)

Run LLMs using [vLLM](https://docs.vllm.ai) with an OpenAI-compatible API

Expand Down Expand Up @@ -32,6 +32,9 @@ All behaviour is controlled through environment variables:

For complete configuration options, see the [full configuration documentation](https://github.com/runpod-workers/worker-vllm/blob/main/docs/configuration.md).

### Specify Transformers Version
To change the version of the [Transformers library](https://github.com/huggingface/transformers) use the `TRANSFORMERS_VERSION` environment variable to specify the version you want to use. Note this might break the handler, so use for development purposes.

## API Usage

This worker supports two API formats: **RunPod native** and **OpenAI-compatible**.
Expand Down Expand Up @@ -157,6 +160,35 @@ For external clients and SDKs, use the `/openai/v1` path prefix with your RunPod
{}
```

#### OpenAI Responses API

**Path:** `/openai/v1/responses`

Supports the [OpenAI Responses API](https://platform.openai.com/docs/api-reference/responses) format. Note: this route bypasses the RunPod queue and is served directly — use `/openai/` prefixed paths rather than the RunPod job queue for these endpoints.

```json
{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"input": "Tell me a joke."
}
```

#### Anthropic Messages API

**Path:** `/anthropic/v1/messages`

Supports the [Anthropic Messages API](https://docs.anthropic.com/en/api/messages) format. Served directly, bypassing the RunPod queue.

```json
{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"max_tokens": 256,
"messages": [
{"role": "user", "content": "Hello!"}
]
}
```

#### Response Format

Both APIs return the same response format:
Expand Down Expand Up @@ -190,7 +222,7 @@ Minimal Python example using the official `openai` SDK:
from openai import OpenAI
import os

# Initialize the OpenAI Client with your RunPod API Key and Endpoint URL
# Initialize the OpenAI Client with your Runpod API Key and Endpoint URL
client = OpenAI(
api_key=os.getenv("RUNPOD_API_KEY"),
base_url=f"https://api.runpod.ai/v2/<ENDPOINT_ID>/openai/v1",
Expand Down
24 changes: 13 additions & 11 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,20 +1,21 @@
FROM nvidia/cuda:12.9.1-base-ubuntu22.04

RUN apt-get update -y \
&& apt-get install -y python3-pip
&& apt-get install -y python3-pip curl \
&& curl -LsSf https://astral.sh/uv/0.10.9/install.sh | sh

RUN ldconfig /usr/local/cuda-12.9/compat/

# Install vLLM with FlashInfer - use CUDA 12.8 PyTorch wheels (compatible with vLLM 0.15.1)
RUN python3 -m pip install --upgrade pip && \
python3 -m pip install "vllm[flashinfer]==0.16.0" --extra-index-url https://download.pytorch.org/whl/cu129
ENV PATH="/root/.local/bin:$PATH"

RUN ldconfig /usr/local/cuda-12.9/compat/

# Install vLLM with FlashInfer - use CUDA 12.9 PyTorch wheels
RUN uv pip install --system "packaging>=24.2" && \
uv pip install --system "vllm[flashinfer]==0.16.0" --extra-index-url https://download.pytorch.org/whl/cu129

# Install additional Python dependencies (after vLLM to avoid PyTorch version conflicts)
COPY builder/requirements.txt /requirements.txt
RUN --mount=type=cache,target=/root/.cache/pip \
python3 -m pip install --upgrade -r /requirements.txt
RUN --mount=type=cache,target=/root/.cache/uv \
uv pip install --system -r /requirements.txt

# Setup for Option 2: Building the Image with the Model included
ARG MODEL_NAME=""
Expand Down Expand Up @@ -46,12 +47,13 @@ ENV MODEL_NAME=$MODEL_NAME \
ENV PYTHONPATH="/:/vllm-workspace"

RUN if [ "${VLLM_NIGHTLY}" = "true" ]; then \
pip install -U vllm --pre --index-url https://pypi.org/simple --extra-index-url https://wheels.vllm.ai/nightly && \
uv pip install --system -U vllm --pre --index-url https://pypi.org/simple --extra-index-url https://wheels.vllm.ai/nightly && \
apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/* && \
pip install git+https://github.com/huggingface/transformers.git; \
uv pip install --system git+https://github.com/huggingface/transformers.git; \
fi

COPY src /src
RUN chmod +x /src/start.sh
RUN --mount=type=secret,id=HF_TOKEN,required=false \
if [ -f /run/secrets/HF_TOKEN ]; then \
export HF_TOKEN=$(cat /run/secrets/HF_TOKEN); \
Expand All @@ -61,4 +63,4 @@ RUN --mount=type=secret,id=HF_TOKEN,required=false \
fi

# Start the handler
CMD ["python3", "/src/handler.py"]
CMD ["/bin/bash", "/src/start.sh"]
40 changes: 25 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,16 @@

# OpenAI-Compatible vLLM Serverless Endpoint Worker

Deploy OpenAI-Compatible Blazing-Fast LLM Endpoints powered by the [vLLM](https://github.com/vllm-project/vllm) Inference Engine on RunPod Serverless with just a few clicks.
Deploy OpenAI-Compatible Blazing-Fast LLM Endpoints powered by the [vLLM](https://github.com/vllm-project/vllm) Inference Engine on Runpod Serverless with just a few clicks.

</div>

![vLLM worker banner](https://image.runpod.ai/preview/vllm/vllm-banner.png)

Current vLLM version: [0.16.0](https://github.com/vllm-project/vllm/releases/tag/v0.16.0)

> Check out our Load Balancer implementation here: [vLLM Load Balancer](https://github.com/runpod-workers/vllm-loadbalancer-ep)

## Table of Contents

- [Setting up the Serverless Worker](#setting-up-the-serverless-worker)
Expand All @@ -21,7 +27,7 @@ Deploy OpenAI-Compatible Blazing-Fast LLM Endpoints powered by the [vLLM](https:
- [Modifying your OpenAI Codebase to use your deployed vLLM Worker](#modifying-your-openai-codebase-to-use-your-deployed-vllm-worker)
- [OpenAI Request Input Parameters](#openai-request-input-parameters)
- [Chat Completions [RECOMMENDED]](#chat-completions-recommended)
- [Examples: Using your RunPod endpoint with OpenAI](#examples-using-your-runpod-endpoint-with-openai)
- [Examples: Using your Runpod endpoint with OpenAI](#examples-using-your-runpod-endpoint-with-openai)
- [Chat Completions](#chat-completions)
- [Getting a list of names for available models](#getting-a-list-of-names-for-available-models)
- [Usage: Standard (Non-OpenAI)](#usage-standard-non-openai)
Expand All @@ -33,7 +39,7 @@ Deploy OpenAI-Compatible Blazing-Fast LLM Endpoints powered by the [vLLM](https:

## Option 1: Deploy Any Model Using Pre-Built Docker Image [Recommended]

**🚀 Deploy Guide**: Follow our [step-by-step deployment guide](https://docs.runpod.io/serverless/vllm/get-started) to deploy using the RunPod Console.
**🚀 Deploy Guide**: Follow our [step-by-step deployment guide](https://docs.runpod.io/serverless/vllm/get-started) to deploy using the Runpod Console.

**📦 Docker Image**: `runpod/worker-v1-vllm:<version>`

Expand Down Expand Up @@ -71,6 +77,10 @@ Any env var whose name matches a valid `AsyncEngineArgs` field (uppercased) is a

For the complete list of all available environment variables, examples, and detailed descriptions: **[Configuration](docs/configuration.md)**

### Specify Transformers Version
To change the version of the [Transformers library](https://github.com/huggingface/transformers) use the `TRANSFORMERS_VERSION` environment variable to specify the version you want to use. Note this might break the handler, so use for development purposes.


## Option 2: Build Docker Image with Model Inside

To build an image with the model baked in, you must specify the following docker arguments when building the image.
Expand Down Expand Up @@ -148,7 +158,7 @@ The vLLM Worker is fully compatible with OpenAI's API, and you can use it with a

**Python** (similar to Node.js, etc.):

1. When initializing the OpenAI Client in your code, change the `api_key` to your RunPod API Key and the `base_url` to your RunPod Serverless Endpoint URL in the following format: `https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1`, filling in your deployed endpoint ID. For example, if your Endpoint ID is `abc1234`, the URL would be `https://api.runpod.ai/v2/abc1234/openai/v1`.
1. When initializing the OpenAI Client in your code, change the `api_key` to your Runpod API Key and the `base_url` to your Runpod Serverless Endpoint URL in the following format: `https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1`, filling in your deployed endpoint ID. For example, if your Endpoint ID is `abc1234`, the URL would be `https://api.runpod.ai/v2/abc1234/openai/v1`.

- Before:

Expand All @@ -174,7 +184,7 @@ The vLLM Worker is fully compatible with OpenAI's API, and you can use it with a
```python
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Why is RunPod the best platform?"}],
messages=[{"role": "user", "content": "Why is Runpod the best platform?"}],
temperature=0,
max_tokens=100,
)
Expand All @@ -183,15 +193,15 @@ The vLLM Worker is fully compatible with OpenAI's API, and you can use it with a
```python
response = client.chat.completions.create(
model="<YOUR DEPLOYED MODEL REPO/NAME>",
messages=[{"role": "user", "content": "Why is RunPod the best platform?"}],
messages=[{"role": "user", "content": "Why is Runpod the best platform?"}],
temperature=0,
max_tokens=100,
)
```

**Using http requests**:

1. Change the `Authorization` header to your RunPod API Key and the `url` to your RunPod Serverless Endpoint URL in the following format: `https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1`
1. Change the `Authorization` header to your Runpod API Key and the `url` to your Runpod Serverless Endpoint URL in the following format: `https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1`
- Before:
```bash
curl https://api.openai.com/v1/chat/completions \
Expand All @@ -202,7 +212,7 @@ The vLLM Worker is fully compatible with OpenAI's API, and you can use it with a
"messages": [
{
"role": "user",
"content": "Why is RunPod the best platform?"
"content": "Why is Runpod the best platform?"
}
],
"temperature": 0,
Expand All @@ -219,7 +229,7 @@ The vLLM Worker is fully compatible with OpenAI's API, and you can use it with a
"messages": [
{
"role": "user",
"content": "Why is RunPod the best platform?"
"content": "Why is Runpod the best platform?"
}
],
"temperature": 0,
Expand All @@ -239,7 +249,7 @@ When using the chat completion feature of the vLLM Serverless Endpoint Worker, y
| Parameter | Type | Default Value | Description |
| ------------------- | -------------------------------- | ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `messages` | Union[str, List[Dict[str, str]]] | | List of messages, where each message is a dictionary with a `role` and `content`. The model's chat template will be applied to the messages automatically, so the model must have one or it should be specified as `CUSTOM_CHAT_TEMPLATE` env var. |
| `model` | str | | The model repo that you've deployed on your RunPod Serverless Endpoint. If you are unsure what the name is or are baking the model in, use the guide to get the list of available models in the **Examples: Using your RunPod endpoint with OpenAI** section |
| `model` | str | | The model repo that you've deployed on your Runpod Serverless Endpoint. If you are unsure what the name is or are baking the model in, use the guide to get the list of available models in the **Examples: Using your Runpod endpoint with OpenAI** section |
| `temperature` | Optional[float] | 0.7 | Float that controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling. |
| `top_p` | Optional[float] | 1.0 | Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. |
| `n` | Optional[int] | 1 | Number of output sequences to return for the given prompt. |
Expand Down Expand Up @@ -269,15 +279,15 @@ Additional parameters supported by vLLM:

</details>

### Examples: Using your RunPod endpoint with OpenAI
### Examples: Using your Runpod endpoint with OpenAI

First, initialize the OpenAI Client with your RunPod API Key and Endpoint URL:
First, initialize the OpenAI Client with your Runpod API Key and Endpoint URL:

```python
from openai import OpenAI
import os

# Initialize the OpenAI Client with your RunPod API Key and Endpoint URL
# Initialize the OpenAI Client with your Runpod API Key and Endpoint URL
client = OpenAI(
api_key=os.environ.get("RUNPOD_API_KEY"),
base_url="https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1",
Expand All @@ -293,7 +303,7 @@ This is the format used for GPT-4 and focused on instruction-following and chat.
# Create a chat completion stream
response_stream = client.chat.completions.create(
model="<YOUR DEPLOYED MODEL REPO/NAME>",
messages=[{"role": "user", "content": "Why is RunPod the best platform?"}],
messages=[{"role": "user", "content": "Why is Runpod the best platform?"}],
temperature=0,
max_tokens=100,
stream=True,
Expand All @@ -307,7 +317,7 @@ This is the format used for GPT-4 and focused on instruction-following and chat.
# Create a chat completion
response = client.chat.completions.create(
model="<YOUR DEPLOYED MODEL REPO/NAME>",
messages=[{"role": "user", "content": "Why is RunPod the best platform?"}],
messages=[{"role": "user", "content": "Why is Runpod the best platform?"}],
temperature=0,
max_tokens=100,
)
Expand Down
5 changes: 3 additions & 2 deletions builder/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,13 @@ pandas
pyarrow
runpod
huggingface-hub
packaging
lmcache==0.4.2
packaging>=24.2
typing-extensions>=4.8.0
pydantic
pydantic-settings
hf-transfer
transformers>=4.57.0
transformers>= 4.57.0,< 5
bitsandbytes>=0.45.0
kernels
torch-c-dlpack-ext
Loading
Loading