Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
1d097a8
support new model ovis2_5
myselvess Aug 4, 2025
a076041
Merge remote-tracking branch 'upstream/main' into ovis2_5
myselvess Aug 4, 2025
59e35e7
Update vllm/transformers_utils/configs/ovis2_5.py
myselvess Aug 4, 2025
7339792
Update vllm/model_executor/models/ovis2_5.py
myselvess Aug 4, 2025
86ee33c
Update vllm/model_executor/models/ovis2_5.py
myselvess Aug 4, 2025
f5f32fe
update ovis25
myselvess Aug 4, 2025
5fa4392
update ovis25: fix pre-commit error
myselvess Aug 5, 2025
a4aca4b
update ovis25: rm useless code & lazy import fa & add sdpa fallback
myselvess Aug 6, 2025
48bb18b
[Sampler] Support returning all logprobs or logits (#21792)
22quinn Aug 4, 2025
034a6e6
[Doc] Update pooling model docs (#22186)
DarkLight1337 Aug 4, 2025
2b6b1b0
Fix Arcee model weight loading: Add custom load_weights (#21725)
alyosha-swamy Aug 4, 2025
8aca87d
[Responses API] Ignore `store=True` and process the request by defaul…
WoosukKwon Aug 4, 2025
6c50176
[Bug] Update auto_tune.sh to separate benchmarking and profiling. (#2…
ericehanley Aug 4, 2025
e852784
[Bugfix][V1][P/D]Fix the uneven polling issue in the toy proxy for P2…
Abatom Aug 4, 2025
3990320
[NVIDIA] Auto detect modelopt quant and fix DSR1-FP4 weight loading (…
nvpohanh Aug 5, 2025
3da80f1
[Bugfix] V1 Fix the cursor leakage issue during request scheduling. (…
CLFutureX Aug 5, 2025
4da7635
Revert "[Bugfix] V1 Fix the cursor leakage issue during request sched…
WoosukKwon Aug 5, 2025
35f4186
[V1] reduce block size for tree attention correctness test to fix 'ou…
TheEpicDolphin Aug 5, 2025
a071de6
[V0 deprecation][P/D] Deprecate v0 `KVConnectorBase` code (1/2) (#21785)
lk-chen Aug 5, 2025
de61a28
[FEAT] Refactor ROPE into module (#22192)
tjtanaa Aug 5, 2025
700ef18
[ROCm][Bugfix] Compilation passes fix (#22202)
gshtras Aug 5, 2025
37bf7c9
self.gate dtype update for GLM-4.5 (#22203)
zRzRzRzRzRzRzR Aug 5, 2025
203763e
[Log] DeepGEMM Update Log for Unaligned Problem Size (#22208)
yewentao256 Aug 5, 2025
c201acd
fix: kimi_k2 return empty tool call list (#22149)
tlipoca9 Aug 5, 2025
a0e7148
[Misc] Remove pass_config from CompilationConfig dump_json excluded (…
elvischenv Aug 5, 2025
53839cc
[Doc] add backend to doc string of initialize_model_parallel (#22142)
andyxning Aug 5, 2025
b522428
[Misc] log more detailed message for ensure_model_parallel_initialize…
andyxning Aug 5, 2025
56f61a2
Optimize configuration access with LRU cache in custom ops (#22204)
skyloevil Aug 5, 2025
e08e133
[Bugfix] Misaligned params in TreeAttentionImpl (#22226)
DarkLight1337 Aug 5, 2025
cef7bc7
[UX] Fail if an invalid attention backend is specified (#22217)
mgoin Aug 5, 2025
2fae634
[Core] Factor out common logic for MM budget calculation (#22228)
DarkLight1337 Aug 5, 2025
288df33
[Model] Pooling model activation supports per request control by Pool…
noooop Aug 5, 2025
0f8fd2a
[Docs][TPU] Highlight TPU Software version selection (#22242)
NickLucche Aug 5, 2025
546d1fc
Migrate KimiVLImagePixelInputs to TensorSchema (#21769)
bbeckca Aug 5, 2025
cadd196
[Feature] Non-contiguous Support for FP8 Quantization (#21961)
yewentao256 Aug 5, 2025
8115c0b
[NVIDIA] Support Flashinfer TRT-LLM Prefill Attention Kernel (#22095)
elvischenv Aug 5, 2025
9484461
[Misc] correct static type check for GroupCoordinator (#21946)
andyxning Aug 5, 2025
d3a2319
[V0 Deprecation][TPU] Remove V1 flag check from tests (#22248)
NickLucche Aug 5, 2025
ef71436
Use UV_LINK_MODE=copy in Dockerfile to avoid hardlink fail (#22128)
mgoin Aug 5, 2025
16756f8
[CI/Build] Update flashinfer to 0.2.9 (#22233)
mgoin Aug 5, 2025
8deb9c6
[Refactor] Remove Unused Environment Variable `VLLM_NO_DEPRECATION_WA…
yewentao256 Aug 5, 2025
2aa4220
[V1] port xformers backend to v1 (#21342)
TheEpicDolphin Aug 5, 2025
8e5c32e
[bugfix] fix blackwell deepep installation (#22255)
youkaichao Aug 5, 2025
a3aa86d
[CI][TPU] Fix docker clean up (#22271)
lsy323 Aug 5, 2025
641e798
[Bugfix] Remove faulty test for oot attention backend (#22286)
mgoin Aug 6, 2025
5f60595
[Bugfix] Fix 3D input passed into cutlass_scaled_mm (#22278)
mgoin Aug 6, 2025
6f1c763
[Bugfix] Fix MoE BNB version (#22260)
jeejeelee Aug 6, 2025
b9da0de
[Perf] Parallelize fill_bitmask to accelerate high-throughput guided …
benchislett Aug 6, 2025
8616bab
[Bugfix] Skip dead and non-GPU nodes for Ray DP engine allocation (#2…
ruisearch42 Aug 6, 2025
f68f297
[Bugfix][CI/Build][ROCm] Make sure to use the headers from the build …
gshtras Aug 6, 2025
32d32a4
Upgrade FA3 for attention sink (#22313)
WoosukKwon Aug 6, 2025
ad71837
Increase openai-python version (#22316)
WoosukKwon Aug 6, 2025
0896a44
Add attention sink in attention backends (#22320)
WoosukKwon Aug 6, 2025
dd600d6
Update transformers to `v4.55` (#21931)
hmellor Aug 6, 2025
d79c7c9
Add GPT-OSS model code and config [1/N] (#22327)
WoosukKwon Aug 6, 2025
2eb3e32
[ROCm] Add attention sink to use_rocm_custom_paged_attention (#22329)
WoosukKwon Aug 6, 2025
4d6bd07
[GptOss] Add GptOss reasoning parser to support structure output (#22…
heheda12345 Aug 6, 2025
ce5af1d
[gpt-oss] flashinfer attention sink init (#22330)
zyongye Aug 6, 2025
7d60990
[gpt-oss] Add openai-harmony as default dependency (#22332)
WoosukKwon Aug 6, 2025
d9f0efc
[Misc] Clean up duplicated hf overrides (#22311)
Isotr0py Aug 6, 2025
80e78fe
[gpt-oss] Add Tool/ConversationContext classes and harmony_utils (#22…
WoosukKwon Aug 6, 2025
c8e4a3f
[gpt-oss] add model to supported models doc (#22336)
Aug 6, 2025
f6e9640
[gpt-oss] Support chat completion api (#22342)
WoosukKwon Aug 6, 2025
49c138b
[Minor] Fix type (#22347)
WoosukKwon Aug 6, 2025
c975712
update ovis25: rm useless code & fix get_image_size_with_most_feature…
myselvess Aug 7, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions .buildkite/scripts/hardware_ci/run-tpu-v1-test-part2.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,7 @@ set -xu


remove_docker_container() {
docker rm -f tpu-test || true;
docker rm -f vllm-tpu || true;
docker rm -f tpu-test || true;
}

trap remove_docker_container EXIT
Expand Down
1 change: 0 additions & 1 deletion .buildkite/scripts/hardware_ci/run-tpu-v1-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@ set -xu

remove_docker_container() {
docker rm -f tpu-test || true;
docker rm -f vllm-tpu || true;
}

trap remove_docker_container EXIT
Expand Down
2 changes: 1 addition & 1 deletion .buildkite/scripts/tpu/config_v6e_1.env
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Environment config
TEST_NAME=llama8b
CONTAINER_NAME=vllm-tpu
CONTAINER_NAME=tpu-test

# vllm config
MODEL=meta-llama/Llama-3.1-8B-Instruct
Expand Down
2 changes: 0 additions & 2 deletions .buildkite/scripts/tpu/docker_run_bm.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,6 @@ source /etc/environment
source $ENV_FILE

remove_docker_container() {
docker rm -f tpu-test || true;
docker rm -f vllm-tpu || true;
docker rm -f $CONTAINER_NAME || true;
}

Expand Down
2 changes: 1 addition & 1 deletion .buildkite/scripts/tpu/quantized_v6e_1.env
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Environment config
TEST_NAME=llama8bw8a8
CONTAINER_NAME=vllm-tpu
CONTAINER_NAME=tpu-test

# vllm config
MODEL=RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8
Expand Down
3 changes: 1 addition & 2 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -664,7 +664,7 @@ steps:
# Attention
# num_heads2 broken by https://github.com/flashinfer-ai/flashinfer/issues/1353
- pytest -v -s tests/kernels/attention/test_flashinfer.py -k 'not num_heads2'
- pytest -v -s tests/kernels/attention/test_flashinfer_trtllm_decode_attention.py
- pytest -v -s tests/kernels/attention/test_flashinfer_trtllm_attention.py
- pytest -v -s tests/kernels/test_cutlass_mla_decode.py
# Quantization
- pytest -v -s tests/kernels/quantization/test_cutlass_scaled_mm.py -k 'fp8'
Expand Down Expand Up @@ -749,7 +749,6 @@ steps:
# this test fails consistently.
# TODO: investigate and fix
- VLLM_USE_V1=0 CUDA_VISIBLE_DEVICES=0,1 pytest -v -s test_sharded_state_loader.py
- VLLM_USE_V1=0 CUDA_VISIBLE_DEVICES=0,1 pytest -v -s kv_transfer/test_disagg.py
- CUDA_VISIBLE_DEVICES=0,1 pytest -v -s v1/shutdown
- pytest -v -s models/multimodal/generation/test_maverick.py

Expand Down
123 changes: 80 additions & 43 deletions benchmarks/auto_tune/auto_tune.sh
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ best_throughput=0
best_max_num_seqs=0
best_num_batched_tokens=0
best_goodput=0
best_request_rate=0

start_server() {
local gpu_memory_utilization=$1
Expand All @@ -57,18 +58,35 @@ start_server() {
local vllm_log=$4
local profile_dir=$5

pkill -f vllm
pkill -if vllm

VLLM_USE_V1=1 VLLM_SERVER_DEV_MODE=1 VLLM_TORCH_PROFILER_DIR=$profile_dir vllm serve $MODEL \
--port 8004 \
--gpu-memory-utilization $gpu_memory_utilization \
--max-num-seqs $max_num_seqs \
--max-num-batched-tokens $max_num_batched_tokens \
--tensor-parallel-size $TP \
--enable-prefix-caching \
--load-format dummy \
--download-dir "$DOWNLOAD_DIR" \
--max-model-len $MAX_MODEL_LEN > "$vllm_log" 2>&1 &
# Define the common arguments as a bash array.
# Each argument and its value are separate elements.
local common_args_array=(
"$MODEL"
"--disable-log-requests"
"--port" "8004"
"--gpu-memory-utilization" "$gpu_memory_utilization"
"--max-num-seqs" "$max_num_seqs"
"--max-num-batched-tokens" "$max_num_batched_tokens"
"--tensor-parallel-size" "$TP"
"--enable-prefix-caching"
"--load-format" "dummy"
"--download-dir" "$DOWNLOAD_DIR"
"--max-model-len" "$MAX_MODEL_LEN"
)

# Use the array expansion "${common_args_array[@]}"
# This correctly passes each element as a separate argument.
if [[ -n "$profile_dir" ]]; then
# Start server with profiling enabled
VLLM_USE_V1=1 VLLM_SERVER_DEV_MODE=1 VLLM_TORCH_PROFILER_DIR=$profile_dir \
vllm serve "${common_args_array[@]}" > "$vllm_log" 2>&1 &
else
# Start server without profiling
VLLM_USE_V1=1 VLLM_SERVER_DEV_MODE=1 \
vllm serve "${common_args_array[@]}" > "$vllm_log" 2>&1 &
fi

# wait for 10 minutes...
server_started=0
Expand All @@ -82,6 +100,7 @@ start_server() {
sleep 10
fi
done

if (( ! server_started )); then
echo "server did not start within 10 minutes. Please check server log at $vllm_log".
return 1
Expand All @@ -90,37 +109,20 @@ start_server() {
fi
}

update_best_profile() {
local profile_dir=$1
local profile_index=$2
sorted_paths=($(find "$profile_dir" -maxdepth 1 -not -path "$profile_dir" | sort))
selected_profile_file=
if [[ "$SYSTEM" == "TPU" ]]; then
selected_profile_file="${sorted_paths[$profile_index]}/*.xplane.pb"
fi
if [[ "$SYSTEM" == "GPU" ]]; then
selected_profile_file="${sorted_paths[$profile_index]}"
fi
rm -f $PROFILE_PATH/*
cp $selected_profile_file $PROFILE_PATH
}

run_benchmark() {
local max_num_seqs=$1
local max_num_batched_tokens=$2
local gpu_memory_utilization=$3
echo "max_num_seq: $max_num_seqs, max_num_batched_tokens: $max_num_batched_tokens"
local vllm_log="$LOG_FOLDER/vllm_log_${max_num_seqs}_${max_num_batched_tokens}.txt"
local profile_dir="$LOG_FOLDER/profile_${max_num_seqs}_${max_num_batched_tokens}"
echo "vllm_log: $vllm_log"
echo
rm -f $vllm_log
mkdir -p $profile_dir
pkill -f vllm
local profile_index=0
pkill -if vllm

echo "starting server..."
start_server $gpu_memory_utilization $max_num_seqs $max_num_batched_tokens $vllm_log $profile_dir
# Call start_server without a profile_dir to avoid profiling overhead
start_server $gpu_memory_utilization $max_num_seqs $max_num_batched_tokens $vllm_log ""
result=$?
if [[ "$result" -eq 1 ]]; then
echo "server failed to start. gpu_memory_utilization:$gpu_memory_utilization, max_num_seqs:$max_num_seqs, max_num_batched_tokens: $max_num_batched_tokens"
Expand All @@ -134,7 +136,8 @@ run_benchmark() {
# get a basic qps by using request-rate inf
bm_log="$LOG_FOLDER/bm_log_${max_num_seqs}_${max_num_batched_tokens}_requestrate_inf.txt"
prefix_len=$(( INPUT_LEN * MIN_CACHE_HIT_PCT / 100 ))
adjusted_input_len=$(( INPUT_LEN - prefix_len ))
adjusted_input_len=$(( INPUT_LEN - prefix_len ))
# --profile flag is removed from this call
vllm bench serve \
--backend vllm \
--model $MODEL \
Expand All @@ -148,8 +151,7 @@ adjusted_input_len=$(( INPUT_LEN - prefix_len ))
--goodput e2el:$MAX_LATENCY_ALLOWED_MS \
--num-prompts 1000 \
--random-prefix-len $prefix_len \
--port 8004 \
--profile &> "$bm_log"
--port 8004 &> "$bm_log"
throughput=$(grep "Request throughput (req/s):" "$bm_log" | sed 's/[^0-9.]//g')
e2el=$(grep "P99 E2EL (ms):" "$bm_log" | awk '{print $NF}')
goodput=$(grep "Request goodput (req/s):" "$bm_log" | sed 's/[^0-9.]//g')
Expand All @@ -163,7 +165,6 @@ adjusted_input_len=$(( INPUT_LEN - prefix_len ))
# start from request-rate as int(throughput) + 1
request_rate=$((${throughput%.*} + 1))
while ((request_rate > 0)); do
profile_index=$((profile_index+1))
# clear prefix cache
curl -X POST http://0.0.0.0:8004/reset_prefix_cache
sleep 5
Expand Down Expand Up @@ -201,12 +202,7 @@ adjusted_input_len=$(( INPUT_LEN - prefix_len ))
best_max_num_seqs=$max_num_seqs
best_num_batched_tokens=$max_num_batched_tokens
best_goodput=$goodput
if [[ "$SYSTEM" == "TPU" ]]; then
update_best_profile "$profile_dir/plugins/profile" $profile_index
fi
if [[ "$SYSTEM" == "GPU" ]]; then
update_best_profile "$profile_dir" $profile_index
fi
best_request_rate=$request_rate
fi
else
echo "max_num_seqs: $max_num_seqs, max_num_batched_tokens: $max_num_batched_tokens does not meet latency requirement ${MAX_LATENCY_ALLOWED_MS}"
Expand All @@ -215,7 +211,7 @@ adjusted_input_len=$(( INPUT_LEN - prefix_len ))

echo "best_max_num_seqs: $best_max_num_seqs, best_num_batched_tokens: $best_num_batched_tokens, best_throughput: $best_throughput"

pkill vllm
pkill -if vllm
sleep 10
printf '=%.0s' $(seq 1 20)
return 0
Expand All @@ -228,7 +224,8 @@ read -r -a num_batched_tokens_list <<< "$NUM_BATCHED_TOKENS_LIST"
gpu_memory_utilization=0.98
find_gpu_memory_utilization=0
while (( $(echo "$gpu_memory_utilization >= 0.9" | bc -l) )); do
start_server $gpu_memory_utilization "${num_seqs_list[-1]}" "${num_batched_tokens_list[-1]}" "$LOG_FOLDER/vllm_log_gpu_memory_utilization_$gpu_memory_utilization.log"
# Pass empty string for profile_dir argument
start_server $gpu_memory_utilization "${num_seqs_list[-1]}" "${num_batched_tokens_list[-1]}" "$LOG_FOLDER/vllm_log_gpu_memory_utilization_$gpu_memory_utilization.log" ""
result=$?
if [[ "$result" -eq 0 ]]; then
find_gpu_memory_utilization=1
Expand All @@ -251,5 +248,45 @@ for num_seqs in "${num_seqs_list[@]}"; do
done
done
echo "finish permutations"

# =================================================================================
# FINAL PROFILING RUN FOR THE BEST CONFIGURATION
# =================================================================================
if (( $(echo "$best_throughput > 0" | bc -l) )); then
echo
echo "Benchmark tuning finished. Now running profiling on the best configuration found..."
echo "Best config: max_num_seqs: $best_max_num_seqs, max_num_batched_tokens: $best_num_batched_tokens, throughput: $best_throughput"
echo

vllm_log="$LOG_FOLDER/vllm_log_BEST_PROFILE.txt"
bm_log="$LOG_FOLDER/bm_log_BEST_PROFILE.txt"

# Start server with the best params and profiling ENABLED
echo "Starting server for profiling..."
start_server $gpu_memory_utilization $best_max_num_seqs $best_num_batched_tokens "$vllm_log" "$PROFILE_PATH"

# Run benchmark with the best params and the --profile flag
echo "Running benchmark with profiling..."
prefix_len=$(( INPUT_LEN * MIN_CACHE_HIT_PCT / 100 ))
adjusted_input_len=$(( INPUT_LEN - prefix_len ))
vllm bench serve \
--backend vllm \
--model $MODEL \
--dataset-name random \
--random-input-len $adjusted_input_len \
--random-output-len $OUTPUT_LEN \
--ignore-eos \
--disable-tqdm \
--request-rate $best_request_rate \
--percentile-metrics ttft,tpot,itl,e2el \
--goodput e2el:$MAX_LATENCY_ALLOWED_MS \
--num-prompts 100 \
--random-prefix-len $prefix_len \
--port 8004 \
--profile &> "$bm_log"
else
echo "No configuration met the latency requirements. Skipping final profiling run."
fi
pkill -if vllm
echo "best_max_num_seqs: $best_max_num_seqs, best_num_batched_tokens: $best_num_batched_tokens, best_throughput: $best_throughput, profile saved in: $PROFILE_PATH"
echo "best_max_num_seqs: $best_max_num_seqs, best_num_batched_tokens: $best_num_batched_tokens, best_throughput: $best_throughput, profile saved in: $PROFILE_PATH" >> "$RESULT"
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,6 @@ def benchmark_decode(
device = "cuda"
torch.manual_seed(0)

# Currently only HEAD_GRP_SIZE == 8 is supported
HEAD_GRP_SIZE = 8
MAX_SEQ_LEN = max_seq_len

Expand Down
Loading