Skip to content

Commit a0fe1df

Browse files
feat: better hub support & concise README for the main repo (#215)
* feat: moved config into docs; added banner; auto detect "messages" in input * docs: moved config into docs * chore: added .DS_Store * chore: get the original stuff working again * chore: remove all changes * docs: reduced toc and added small config table --------- Co-authored-by: Tim Pietrusky <tim.pietrusky@runpod.io>
1 parent 0e0d6df commit a0fe1df

File tree

7 files changed

+444
-134
lines changed

7 files changed

+444
-134
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,4 @@ runpod.toml
44
.env
55
test/*
66
vllm-base/vllm-*
7+
.DS_Store

.runpod/README.md

Lines changed: 260 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,260 @@
1+
![vLLM worker banner](https://cpjrphpz3t5wbwfe.public.blob.vercel-storage.com/worker-vllm_banner.jpeg)
2+
3+
Run LLMs using [vLLM](https://docs.vllm.ai) with an OpenAI-compatible API
4+
5+
---
6+
7+
[![RunPod](https://api.runpod.io/badge/runpod-workers/worker-vllm)](https://www.runpod.io/console/hub/runpod-workers/worker-vllm)
8+
9+
---
10+
11+
## Endpoint Configuration
12+
13+
All behaviour is controlled through environment variables:
14+
15+
| Environment Variable | Description | Default | Options |
16+
| ----------------------------------- | ------------------------------------------------- | ------------------- | ------------------------------------------------------------------ |
17+
| `MODEL_NAME` | Path of the model weights | "facebook/opt-125m" | Local folder or Hugging Face repo ID |
18+
| `HF_TOKEN` | HuggingFace access token for gated/private models | | Your HuggingFace access token |
19+
| `MAX_MODEL_LEN` | Model's maximum context length | | Integer (e.g., 4096) |
20+
| `QUANTIZATION` | Quantization method | | "awq", "gptq", "squeezellm", "bitsandbytes" |
21+
| `TENSOR_PARALLEL_SIZE` | Number of GPUs | 1 | Integer |
22+
| `GPU_MEMORY_UTILIZATION` | Fraction of GPU memory to use | 0.95 | Float between 0.0 and 1.0 |
23+
| `MAX_NUM_SEQS` | Maximum number of sequences per iteration | 256 | Integer |
24+
| `CUSTOM_CHAT_TEMPLATE` | Custom chat template override | | Jinja2 template string |
25+
| `ENABLE_AUTO_TOOL_CHOICE` | Enable automatic tool selection | false | boolean (true or false) |
26+
| `TOOL_CALL_PARSER` | Parser for tool calls | | "mistral", "hermes", "llama3_json", "granite", "deepseek_v3", etc. |
27+
| `OPENAI_SERVED_MODEL_NAME_OVERRIDE` | Override served model name in API | | String |
28+
| `MAX_CONCURRENCY` | Maximum concurrent requests | 300 | Integer |
29+
30+
For complete configuration options, see the [full configuration documentation](https://github.com/runpod-workers/worker-vllm/blob/main/docs/configuration.md).
31+
32+
## API Usage
33+
34+
This worker supports two API formats: **RunPod native** and **OpenAI-compatible**.
35+
36+
### RunPod Native API
37+
38+
For testing directly in the RunPod UI, use these examples in your endpoint's request tab.
39+
40+
#### Chat Completions
41+
42+
```json
43+
{
44+
"input": {
45+
"messages": [
46+
{ "role": "system", "content": "You are a helpful assistant." },
47+
{ "role": "user", "content": "What is the capital of France?" }
48+
],
49+
"sampling_params": {
50+
"max_tokens": 100,
51+
"temperature": 0.7
52+
}
53+
}
54+
}
55+
```
56+
57+
#### Chat Completions (Streaming)
58+
59+
```json
60+
{
61+
"input": {
62+
"messages": [
63+
{ "role": "user", "content": "Write a short story about a robot." }
64+
],
65+
"sampling_params": {
66+
"max_tokens": 500,
67+
"temperature": 0.8
68+
},
69+
"stream": true
70+
}
71+
}
72+
```
73+
74+
#### Text Generation
75+
76+
For direct text generation without chat format:
77+
78+
```json
79+
{
80+
"input": {
81+
"prompt": "The capital of France is",
82+
"sampling_params": {
83+
"max_tokens": 64,
84+
"temperature": 0.0
85+
}
86+
}
87+
}
88+
```
89+
90+
#### List Models
91+
92+
```json
93+
{
94+
"input": {
95+
"openai_route": "/v1/models"
96+
}
97+
}
98+
```
99+
100+
---
101+
102+
### OpenAI-Compatible API
103+
104+
For external clients and SDKs, use the `/openai/v1` path prefix with your RunPod API key.
105+
106+
#### Chat Completions
107+
108+
**Path:** `/openai/v1/chat/completions`
109+
110+
```json
111+
{
112+
"model": "meta-llama/Llama-2-7b-chat-hf",
113+
"messages": [
114+
{ "role": "system", "content": "You are a helpful assistant." },
115+
{ "role": "user", "content": "What is the capital of France?" }
116+
],
117+
"max_tokens": 100,
118+
"temperature": 0.7
119+
}
120+
```
121+
122+
#### Chat Completions (Streaming)
123+
124+
```json
125+
{
126+
"model": "meta-llama/Llama-2-7b-chat-hf",
127+
"messages": [
128+
{ "role": "user", "content": "Write a short story about a robot." }
129+
],
130+
"max_tokens": 500,
131+
"temperature": 0.8,
132+
"stream": true
133+
}
134+
```
135+
136+
#### Text Completions
137+
138+
**Path:** `/openai/v1/completions`
139+
140+
```json
141+
{
142+
"model": "meta-llama/Llama-2-7b-chat-hf",
143+
"prompt": "The capital of France is",
144+
"max_tokens": 100,
145+
"temperature": 0.7
146+
}
147+
```
148+
149+
#### List Models
150+
151+
**Path:** `/openai/v1/models`
152+
153+
```json
154+
{}
155+
```
156+
157+
#### Response Format
158+
159+
Both APIs return the same response format:
160+
161+
```json
162+
{
163+
"choices": [
164+
{
165+
"index": 0,
166+
"message": { "role": "assistant", "content": "Paris." },
167+
"finish_reason": "stop"
168+
}
169+
],
170+
"usage": { "prompt_tokens": 9, "completion_tokens": 1, "total_tokens": 10 }
171+
}
172+
```
173+
174+
---
175+
176+
## Usage
177+
178+
Below are minimal `python` snippets so you can copy-paste to get started quickly.
179+
180+
> Replace `<ENDPOINT_ID>` with your endpoint ID and `<API_KEY>` with a [RunPod API key](https://docs.runpod.io/get-started/api-keys).
181+
182+
### OpenAI compatible API
183+
184+
Minimal Python example using the official `openai` SDK:
185+
186+
```python
187+
from openai import OpenAI
188+
import os
189+
190+
# Initialize the OpenAI Client with your RunPod API Key and Endpoint URL
191+
client = OpenAI(
192+
api_key=os.getenv("RUNPOD_API_KEY"),
193+
base_url=f"https://api.runpod.ai/v2/<ENDPOINT_ID>/openai/v1",
194+
)
195+
```
196+
197+
`Chat Completions (Non-Streaming)`
198+
199+
```python
200+
response = client.chat.completions.create(
201+
model="meta-llama/Llama-2-7b-chat-hf",
202+
messages=[{"role": "user", "content": "Explain quantum computing in simple terms"}],
203+
temperature=0,
204+
max_tokens=100,
205+
)
206+
print(f"Response: {response.choices[0].message.content}")
207+
```
208+
209+
`Chat Completions (Streaming)`
210+
211+
```python
212+
response_stream = client.chat.completions.create(
213+
model="meta-llama/Llama-2-7b-chat-hf",
214+
messages=[{"role": "user", "content": "Explain quantum computing in simple terms"}],
215+
temperature=0,
216+
max_tokens=100,
217+
stream=True
218+
)
219+
for response in response_stream:
220+
print(response.choices[0].delta.content or "", end="", flush=True)
221+
```
222+
223+
### RunPod Native API
224+
225+
```python
226+
import requests
227+
228+
response = requests.post(
229+
"https://api.runpod.ai/v2/<ENDPOINT_ID>/run",
230+
headers={"Authorization": "Bearer <API_KEY>"},
231+
json={
232+
"input": {
233+
"messages": [
234+
{"role": "system", "content": "You are a helpful assistant."},
235+
{"role": "user", "content": "Explain quantum computing in simple terms"}
236+
],
237+
"sampling_params": {
238+
"temperature": 0.7,
239+
"max_tokens": 150
240+
}
241+
}
242+
}
243+
)
244+
245+
result = response.json()
246+
print(result["output"])
247+
```
248+
249+
## Compatibility
250+
251+
For supported models, see the [vLLM supported models documentation](https://docs.vllm.ai/en/latest/models/supported_models.html).
252+
253+
Anything not recognized by worker-vllm is forwarded to vLLM's engine, so advanced options in the vLLM docs (guided generation, LoRA, speculative decoding, etc.) also work.
254+
255+
## Documentation
256+
257+
- **[🚀 Deployment Guide](https://docs.runpod.io/serverless/vllm/get-started)** - Step-by-step setup
258+
- **[📖 Configuration Reference](https://github.com/runpod-workers/worker-vllm/blob/main/docs/configuration.md)** - All environment variables
259+
- **[🏗️ Advanced Deployment](https://github.com/runpod-workers/worker-vllm/blob/main/docs/deployment.md)** - Custom builds and strategies
260+
- **[🔧 Development Guide](https://github.com/runpod-workers/worker-vllm/blob/main/docs/conventions.md)** - Architecture and patterns

.runpod/tests.json

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,6 @@
1212
"input": {
1313
"openai_route": "/v1/chat/completions",
1414
"openai_input": {
15-
"model": "HuggingFaceTB/SmolLM2-135M-Instruct",
1615
"messages": [
1716
{
1817
"role": "system",
@@ -23,8 +22,8 @@
2322
"content": "Explain what a neural network is in one sentence."
2423
}
2524
],
26-
"max_tokens": 50,
27-
"temperature": 0.7
25+
"max_tokens": 200,
26+
"temperature": 0.1
2827
}
2928
},
3029
"timeout": 30000

0 commit comments

Comments
 (0)