Skip to content

sbu-fsl/vllm-study

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

245 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

vLLM study: I/O perspective

Models

  1. OPT by Facebook
  2. Qwen by QwenAI
  3. Granite by iBM
  4. LlaMa by Meta
  5. R1 by DeepSeek

Models larger than 10B don't fit within our GPUs (24 GB limited VRAM space)!

INX Model Access Link Number of Parameters Type
1 facebook/opt-125m facebook/opt-125m 125M Text generation (base LLM)
2 facebook/opt-350m facebook/opt-350m 350M Text generation
3 facebook/opt-1.3b facebook/opt-1.3b 3B Text generation
4 facebook/opt-6.7b facebook/opt-6.7b 7B Text generation
5 facebook/opt-13b facebook/opt-13b 13B Text generation
6 facebook/opt-30b facebook/opt-30b 30B Text generation
7 ibm-granite/granite-3.3-2b-instruct ibm-granite/granite-3.3-2b-instruct 2B Instruction-following assistant
8 ibm-granite/granite-3.3-8b-instruct ibm-granite/granite-3.3-8b-instruct 8B Instruction-following assistant
9 Qwen/Qwen3-0.6B Qwen/Qwen3-0.6B 800M Text generation
10 Qwen/Qwen3-1.7B Qwen/Qwen3-1.7B 1.7B Text generation
11 Qwen/Qwen3-4B Qwen/Qwen3-4B 4B Text generation / reasoning
12 Qwen/Qwen3-8B Qwen/Qwen3-8B 8B Text generation / reasoning
13 Qwen/Qwen3-14B Qwen/Qwen3-14B 14B Text generation / reasoning
14 Qwen/Qwen3-32B Qwen/Qwen3-32B 32B Text generation / reasoning
15 meta-llama/Llama-3.2-1B-Instruct meta-llama/Llama-3.2-1B-Instruct 1B Instruction-following assistant (chat, code, reasoning)
16 meta-llama/Llama-3.2-3B-Instruct meta-llama/Llama-3.2-3B-Instruct 3B Instruction-following assistant (chat, code, reasoning)
17 meta-llama/Llama-3.1-8B-Instruct meta-llama/Llama-3.1-8B-Instruct 8B Instruction-following assistant (chat, code, reasoning)
18 deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B 1.5B Text generation
19 deepseek-ai/DeepSeek-R1-Distill-Qwen-7B deepseek-ai/DeepSeek-R1-Distill-Qwen-7B 7B Text generation
20 deepseek-ai/DeepSeek-R1-Distill-Qwen-14B deepseek-ai/DeepSeek-R1-Distill-Qwen-14B 14B Text generation
21 deepseek-ai/DeepSeek-R1-Distill-Qwen-32B deepseek-ai/DeepSeek-R1-Distill-Qwen-32B 32B Text generation

Details

Each model requires GPU cards based on the number of parameters it has. The required GPU memory is roughly twice the number of parameters in GB. The following table shows the minimum GPU memory utilization (the minimal value that still enables KV cache offloading) required to run the model with the maximum possible model length.

INX Model Max Model Length Model Size (Gb) GPU Memory Utilization KV Cache (Gb) GPUs
1 facebook/opt-125m 2048 0.24 10% 1.62 1
2 facebook/opt-350m 2048 0.62 10% 1.25 1
3 facebook/opt-1.3b 2048 2.45 20% 1.77 1
4 facebook/opt-6.7b 2048 12.4 60% 1.26 1
5 facebook/opt-13b 2048 12.6 60% 1.80 2
6 facebook/opt-30b 2048 14.0 65% 2.29 4
7 ibm-granite/granite-3.3-2b-instruct 36000 4.74 35% 3.04 1
8 ibm-granite/granite-3.3-8b-instruct 36000 15.2 90% 5.52 1
9 Qwen/Qwen3-0.6B 2048 1.12 15% 1.00 1
10 Qwen/Qwen3-1.7B 2048 3.22 30% 2.45 1
11 Qwen/Qwen3-4B 20480 7.56 50% 2.83 1
12 Qwen/Qwen3-8B 20480 7.64 50% 3.53 2
13 Qwen/Qwen3-14B 20480 13.8 70% 1.86 2
14 Qwen/Qwen3-32B 20480 15.4 80% 3.08 4
15 meta-llama/Llama-3.2-1B-Instruct 36000 2.32 20% 1.20 1
16 meta-llama/Llama-3.2-3B-Instruct 36000 6.02 50% 4.59 1
17 meta-llama/Llama-3.1-8B-Instruct 36000 15.0 90% 5.06 1
18 deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B 20480 3.35 30% 2.32 1
19 deepseek-ai/DeepSeek-R1-Distill-Qwen-7B 20480 14.3 80% 3.20 1
20 deepseek-ai/DeepSeek-R1-Distill-Qwen-14B 20480 14.0 80% 4.16 2
21 deepseek-ai/DeepSeek-R1-Distill-Qwen-32B 20480 15.4 80% 3.05 4

Based on our GPU hardware, the rule for setting GPU Memory Utilization is:

$$ GPU Memory Utilization \geq \frac{2 \times Mode Size + KV Cache}{24} $$

Parameters for study

  • Parallel Config:
    • tensor-parallel-size
  • Model Config:
    • model
    • max-model-len
    • quantization
  • Load Config:
    • download-dir
    • safetensors-load-strategy
  • Cache Config:
    • gpu-memory-utilization
    • swap-space
    • enable-prefix-caching
    • cpu-offload-gb
    • block-size
    • kv-cache-memory-bytes
    • kv-cache-dtype
    • kv-offloading-size
    • kv-offloading-backend
  • Compilation Config:
    • cudagraph-capture-sizes
    • max-cudagraph-capture-size

LLM Benchmarks

Using OpenAI API exported by vLLM to send our requests.

Datasets

  • Alpaca
    • A dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine.
  • LongBench
    • The first benchmark for bilingual, multitask, and comprehensive assessment of long context understanding capabilities of large language models.
  • WMT16
    • German to English translation dataset for shared-prefix tasks.
  • ShareGPT
    • ShareGPT-Chinese-English-90k bilingual human-machine QA dataset.

Tasks

  • Single Prompt Single Response
    • Alpaca
  • Shared Prefix
    • WMT16
  • Chatbot Evaluation
    • ShareGPT
  • Question Answering
    • LongBench/NarrativeQA
  • Summarization
    • LongBench/QMSum

Experiments

We categorize our prompts into three buckets, in terms of size:

  1. Not cached: Prompts with length smaller than a block size (16 tokens).
  2. Cached but lost: Prompts with length bigger than a block size, but smaller than GPU KV cache size.
  3. Cached and saved: Prompts with length bigger than both a block size and GPU KV cache size.

We categorize our prompts into two buckets, in terms of prefix size:

  1. No prefix: Prompts with no common prefix.
  2. With prefix: Prompts with a shared prefix.

We categorize our clients into two buckets, in terms of number:

  1. Single client: Processing one request at a time.
  2. Multiple clients: Processing more than one clients at a time.

We categorize our clients into two buckets, in terms of request type:

  1. Same request: All clients sending same request.
  2. Random requests: Clients sending random requests.

NOTES

helm

Deploy:

helm install -f models/qwen/2b.yaml vllm-qwen3-2b models/

List deployments:

helm list

Uninstall deployment:

helm uninstall vllm-qwen3-2b

logs & metrics

List pods:

kubectl -n llm-servings get pods

Export pod logs:

kubectl -n llm-servings logs $POD_NAME > "$POD_NAME.log"

Extract timestamps:

./scripts/extract_ts.sh "$POD_NAME.log"

Get metrics (make sure to edit the values in the collect_all.sh script):

./scripts/metrics/collect_all.sh

Get your vLLM instance port (use the Node port not container port):

kubectl get svc -n llm-servings $POD_NAME -o wide

About

vLLM study (IO perspective).

Topics

Resources

Stars

Watchers

Forks

Contributors