vLLM study: I/O perspective

Models

OPT by Facebook
Qwen by QwenAI
Granite by iBM
LlaMa by Meta
R1 by DeepSeek

Models larger than 10B don't fit within our GPUs (24 GB limited VRAM space)!

INX	Model	Access Link	Number of Parameters	Type
1	`facebook/opt-125m`	facebook/opt-125m	125M	Text generation (base LLM)
2	`facebook/opt-350m`	facebook/opt-350m	350M	Text generation
3	`facebook/opt-1.3b`	facebook/opt-1.3b	3B	Text generation
4	`facebook/opt-6.7b`	facebook/opt-6.7b	7B	Text generation
5	`facebook/opt-13b`	facebook/opt-13b	13B	Text generation
6	`facebook/opt-30b`	facebook/opt-30b	30B	Text generation
7	`ibm-granite/granite-3.3-2b-instruct`	ibm-granite/granite-3.3-2b-instruct	2B	Instruction-following assistant
8	`ibm-granite/granite-3.3-8b-instruct`	ibm-granite/granite-3.3-8b-instruct	8B	Instruction-following assistant
9	`Qwen/Qwen3-0.6B`	Qwen/Qwen3-0.6B	800M	Text generation
10	`Qwen/Qwen3-1.7B`	Qwen/Qwen3-1.7B	1.7B	Text generation
11	`Qwen/Qwen3-4B`	Qwen/Qwen3-4B	4B	Text generation / reasoning
12	`Qwen/Qwen3-8B`	Qwen/Qwen3-8B	8B	Text generation / reasoning
13	`Qwen/Qwen3-14B`	Qwen/Qwen3-14B	14B	Text generation / reasoning
14	`Qwen/Qwen3-32B`	Qwen/Qwen3-32B	32B	Text generation / reasoning
15	`meta-llama/Llama-3.2-1B-Instruct`	meta-llama/Llama-3.2-1B-Instruct	1B	Instruction-following assistant (chat, code, reasoning)
16	`meta-llama/Llama-3.2-3B-Instruct`	meta-llama/Llama-3.2-3B-Instruct	3B	Instruction-following assistant (chat, code, reasoning)
17	`meta-llama/Llama-3.1-8B-Instruct`	meta-llama/Llama-3.1-8B-Instruct	8B	Instruction-following assistant (chat, code, reasoning)
18	`deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B`	deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B	1.5B	Text generation
19	`deepseek-ai/DeepSeek-R1-Distill-Qwen-7B`	deepseek-ai/DeepSeek-R1-Distill-Qwen-7B	7B	Text generation
20	`deepseek-ai/DeepSeek-R1-Distill-Qwen-14B`	deepseek-ai/DeepSeek-R1-Distill-Qwen-14B	14B	Text generation
21	`deepseek-ai/DeepSeek-R1-Distill-Qwen-32B`	deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	32B	Text generation

Details

Each model requires GPU cards based on the number of parameters it has. The required GPU memory is roughly twice the number of parameters in GB. The following table shows the minimum GPU memory utilization (the minimal value that still enables KV cache offloading) required to run the model with the maximum possible model length.

INX	Model	Max Model Length	Model Size (Gb)	GPU Memory Utilization	KV Cache (Gb)	GPUs
1	`facebook/opt-125m`	2048	0.24	10%	1.62	1
2	`facebook/opt-350m`	2048	0.62	10%	1.25	1
3	`facebook/opt-1.3b`	2048	2.45	20%	1.77	1
4	`facebook/opt-6.7b`	2048	12.4	60%	1.26	1
5	`facebook/opt-13b`	2048	12.6	60%	1.80	2
6	`facebook/opt-30b`	2048	14.0	65%	2.29	4
7	`ibm-granite/granite-3.3-2b-instruct`	36000	4.74	35%	3.04	1
8	`ibm-granite/granite-3.3-8b-instruct`	36000	15.2	90%	5.52	1
9	`Qwen/Qwen3-0.6B`	2048	1.12	15%	1.00	1
10	`Qwen/Qwen3-1.7B`	2048	3.22	30%	2.45	1
11	`Qwen/Qwen3-4B`	20480	7.56	50%	2.83	1
12	`Qwen/Qwen3-8B`	20480	7.64	50%	3.53	2
13	`Qwen/Qwen3-14B`	20480	13.8	70%	1.86	2
14	`Qwen/Qwen3-32B`	20480	15.4	80%	3.08	4
15	`meta-llama/Llama-3.2-1B-Instruct`	36000	2.32	20%	1.20	1
16	`meta-llama/Llama-3.2-3B-Instruct`	36000	6.02	50%	4.59	1
17	`meta-llama/Llama-3.1-8B-Instruct`	36000	15.0	90%	5.06	1
18	`deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B`	20480	3.35	30%	2.32	1
19	`deepseek-ai/DeepSeek-R1-Distill-Qwen-7B`	20480	14.3	80%	3.20	1
20	`deepseek-ai/DeepSeek-R1-Distill-Qwen-14B`	20480	14.0	80%	4.16	2
21	`deepseek-ai/DeepSeek-R1-Distill-Qwen-32B`	20480	15.4	80%	3.05	4

Based on our GPU hardware, the rule for setting GPU Memory Utilization is:

$$ GPU Memory Utilization \geq \frac{2 \times Mode Size + KV Cache}{24} $$

Parameters for study

Parallel Config:
- tensor-parallel-size
Model Config:
- model
- max-model-len
- quantization
Load Config:
- download-dir
- safetensors-load-strategy
Cache Config:
- gpu-memory-utilization
- swap-space
- enable-prefix-caching
- cpu-offload-gb
- block-size
- kv-cache-memory-bytes
- kv-cache-dtype
- kv-offloading-size
- kv-offloading-backend
Compilation Config:
- cudagraph-capture-sizes
- max-cudagraph-capture-size

LLM Benchmarks

Using OpenAI API exported by vLLM to send our requests.

Datasets

Alpaca
- A dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine.
LongBench
- The first benchmark for bilingual, multitask, and comprehensive assessment of long context understanding capabilities of large language models.
WMT16
- German to English translation dataset for shared-prefix tasks.
ShareGPT
- ShareGPT-Chinese-English-90k bilingual human-machine QA dataset.

Tasks

Single Prompt Single Response
- Alpaca
Shared Prefix
- WMT16
Chatbot Evaluation
- ShareGPT
Question Answering
- LongBench/NarrativeQA
Summarization
- LongBench/QMSum

Experiments

We categorize our prompts into three buckets, in terms of size:

Not cached: Prompts with length smaller than a block size (16 tokens).
Cached but lost: Prompts with length bigger than a block size, but smaller than GPU KV cache size.
Cached and saved: Prompts with length bigger than both a block size and GPU KV cache size.

We categorize our prompts into two buckets, in terms of prefix size:

No prefix: Prompts with no common prefix.
With prefix: Prompts with a shared prefix.

We categorize our clients into two buckets, in terms of number:

Single client: Processing one request at a time.
Multiple clients: Processing more than one clients at a time.

We categorize our clients into two buckets, in terms of request type:

Same request: All clients sending same request.
Random requests: Clients sending random requests.

NOTES

helm

Deploy:

helm install -f models/qwen/2b.yaml vllm-qwen3-2b models/

List deployments:

helm list

Uninstall deployment:

helm uninstall vllm-qwen3-2b

logs & metrics

List pods:

kubectl -n llm-servings get pods

Export pod logs:

kubectl -n llm-servings logs $POD_NAME > "$POD_NAME.log"

Extract timestamps:

./scripts/extract_ts.sh "$POD_NAME.log"

Get metrics (make sure to edit the values in the collect_all.sh script):

./scripts/metrics/collect_all.sh

Get your vLLM instance port (use the Node port not container port):

kubectl get svc -n llm-servings $POD_NAME -o wide

Name		Name	Last commit message	Last commit date
Latest commit History 245 Commits
archive		archive
kubernetes		kubernetes
models		models
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vLLM study: I/O perspective

Models

Details

Parameters for study

LLM Benchmarks

Datasets

Tasks

Experiments

NOTES

helm

logs & metrics

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

vLLM study: I/O perspective

Models

Details

Parameters for study

LLM Benchmarks

Datasets

Tasks

Experiments

NOTES

helm

logs & metrics

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages