- OPT by Facebook
- Qwen by QwenAI
- Granite by iBM
- LlaMa by Meta
- R1 by DeepSeek
Models larger than 10B don't fit within our GPUs (24 GB limited VRAM space)!
| INX | Model | Access Link | Number of Parameters | Type |
|---|---|---|---|---|
| 1 | facebook/opt-125m |
facebook/opt-125m | 125M | Text generation (base LLM) |
| 2 | facebook/opt-350m |
facebook/opt-350m | 350M | Text generation |
| 3 | facebook/opt-1.3b |
facebook/opt-1.3b | 3B | Text generation |
| 4 | facebook/opt-6.7b |
facebook/opt-6.7b | 7B | Text generation |
| 5 | facebook/opt-13b |
facebook/opt-13b | 13B | Text generation |
| 6 | facebook/opt-30b |
facebook/opt-30b | 30B | Text generation |
| 7 | ibm-granite/granite-3.3-2b-instruct |
ibm-granite/granite-3.3-2b-instruct | 2B | Instruction-following assistant |
| 8 | ibm-granite/granite-3.3-8b-instruct |
ibm-granite/granite-3.3-8b-instruct | 8B | Instruction-following assistant |
| 9 | Qwen/Qwen3-0.6B |
Qwen/Qwen3-0.6B | 800M | Text generation |
| 10 | Qwen/Qwen3-1.7B |
Qwen/Qwen3-1.7B | 1.7B | Text generation |
| 11 | Qwen/Qwen3-4B |
Qwen/Qwen3-4B | 4B | Text generation / reasoning |
| 12 | Qwen/Qwen3-8B |
Qwen/Qwen3-8B | 8B | Text generation / reasoning |
| 13 | Qwen/Qwen3-14B |
Qwen/Qwen3-14B | 14B | Text generation / reasoning |
| 14 | Qwen/Qwen3-32B |
Qwen/Qwen3-32B | 32B | Text generation / reasoning |
| 15 | meta-llama/Llama-3.2-1B-Instruct |
meta-llama/Llama-3.2-1B-Instruct | 1B | Instruction-following assistant (chat, code, reasoning) |
| 16 | meta-llama/Llama-3.2-3B-Instruct |
meta-llama/Llama-3.2-3B-Instruct | 3B | Instruction-following assistant (chat, code, reasoning) |
| 17 | meta-llama/Llama-3.1-8B-Instruct |
meta-llama/Llama-3.1-8B-Instruct | 8B | Instruction-following assistant (chat, code, reasoning) |
| 18 | deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B |
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | 1.5B | Text generation |
| 19 | deepseek-ai/DeepSeek-R1-Distill-Qwen-7B |
deepseek-ai/DeepSeek-R1-Distill-Qwen-7B | 7B | Text generation |
| 20 | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B |
deepseek-ai/DeepSeek-R1-Distill-Qwen-14B | 14B | Text generation |
| 21 | deepseek-ai/DeepSeek-R1-Distill-Qwen-32B |
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | 32B | Text generation |
Each model requires GPU cards based on the number of parameters it has. The required GPU memory is roughly twice the number of parameters in GB. The following table shows the minimum GPU memory utilization (the minimal value that still enables KV cache offloading) required to run the model with the maximum possible model length.
| INX | Model | Max Model Length | Model Size (Gb) | GPU Memory Utilization | KV Cache (Gb) | GPUs |
|---|---|---|---|---|---|---|
| 1 | facebook/opt-125m |
2048 | 0.24 | 10% | 1.62 | 1 |
| 2 | facebook/opt-350m |
2048 | 0.62 | 10% | 1.25 | 1 |
| 3 | facebook/opt-1.3b |
2048 | 2.45 | 20% | 1.77 | 1 |
| 4 | facebook/opt-6.7b |
2048 | 12.4 | 60% | 1.26 | 1 |
| 5 | facebook/opt-13b |
2048 | 12.6 | 60% | 1.80 | 2 |
| 6 | facebook/opt-30b |
2048 | 14.0 | 65% | 2.29 | 4 |
| 7 | ibm-granite/granite-3.3-2b-instruct |
36000 | 4.74 | 35% | 3.04 | 1 |
| 8 | ibm-granite/granite-3.3-8b-instruct |
36000 | 15.2 | 90% | 5.52 | 1 |
| 9 | Qwen/Qwen3-0.6B |
2048 | 1.12 | 15% | 1.00 | 1 |
| 10 | Qwen/Qwen3-1.7B |
2048 | 3.22 | 30% | 2.45 | 1 |
| 11 | Qwen/Qwen3-4B |
20480 | 7.56 | 50% | 2.83 | 1 |
| 12 | Qwen/Qwen3-8B |
20480 | 7.64 | 50% | 3.53 | 2 |
| 13 | Qwen/Qwen3-14B |
20480 | 13.8 | 70% | 1.86 | 2 |
| 14 | Qwen/Qwen3-32B |
20480 | 15.4 | 80% | 3.08 | 4 |
| 15 | meta-llama/Llama-3.2-1B-Instruct |
36000 | 2.32 | 20% | 1.20 | 1 |
| 16 | meta-llama/Llama-3.2-3B-Instruct |
36000 | 6.02 | 50% | 4.59 | 1 |
| 17 | meta-llama/Llama-3.1-8B-Instruct |
36000 | 15.0 | 90% | 5.06 | 1 |
| 18 | deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B |
20480 | 3.35 | 30% | 2.32 | 1 |
| 19 | deepseek-ai/DeepSeek-R1-Distill-Qwen-7B |
20480 | 14.3 | 80% | 3.20 | 1 |
| 20 | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B |
20480 | 14.0 | 80% | 4.16 | 2 |
| 21 | deepseek-ai/DeepSeek-R1-Distill-Qwen-32B |
20480 | 15.4 | 80% | 3.05 | 4 |
Based on our GPU hardware, the rule for setting GPU Memory Utilization is:
- Parallel Config:
- tensor-parallel-size
- Model Config:
- model
- max-model-len
- quantization
- Load Config:
- download-dir
- safetensors-load-strategy
- Cache Config:
- gpu-memory-utilization
- swap-space
- enable-prefix-caching
- cpu-offload-gb
- block-size
- kv-cache-memory-bytes
- kv-cache-dtype
- kv-offloading-size
- kv-offloading-backend
- Compilation Config:
- cudagraph-capture-sizes
- max-cudagraph-capture-size
Using OpenAI API exported by vLLM to send our requests.
- Alpaca
- A dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine.
- LongBench
- The first benchmark for bilingual, multitask, and comprehensive assessment of long context understanding capabilities of large language models.
- WMT16
- German to English translation dataset for shared-prefix tasks.
- ShareGPT
- ShareGPT-Chinese-English-90k bilingual human-machine QA dataset.
- Single Prompt Single Response
- Alpaca
- Shared Prefix
- WMT16
- Chatbot Evaluation
- ShareGPT
- Question Answering
- LongBench/NarrativeQA
- Summarization
- LongBench/QMSum
We categorize our prompts into three buckets, in terms of size:
- Not cached: Prompts with length smaller than a block size (16 tokens).
- Cached but lost: Prompts with length bigger than a block size, but smaller than GPU KV cache size.
- Cached and saved: Prompts with length bigger than both a block size and GPU KV cache size.
We categorize our prompts into two buckets, in terms of prefix size:
- No prefix: Prompts with no common prefix.
- With prefix: Prompts with a shared prefix.
We categorize our clients into two buckets, in terms of number:
- Single client: Processing one request at a time.
- Multiple clients: Processing more than one clients at a time.
We categorize our clients into two buckets, in terms of request type:
- Same request: All clients sending same request.
- Random requests: Clients sending random requests.
Deploy:
helm install -f models/qwen/2b.yaml vllm-qwen3-2b models/List deployments:
helm listUninstall deployment:
helm uninstall vllm-qwen3-2bList pods:
kubectl -n llm-servings get podsExport pod logs:
kubectl -n llm-servings logs $POD_NAME > "$POD_NAME.log"Extract timestamps:
./scripts/extract_ts.sh "$POD_NAME.log"Get metrics (make sure to edit the values in the collect_all.sh script):
./scripts/metrics/collect_all.shGet your vLLM instance port (use the Node port not container port):
kubectl get svc -n llm-servings $POD_NAME -o wide