Releases · vllm-project/vllm

24 Jul 22:43

github-actions

v0.10.0

6d8d0a2

v0.10.0 Latest

Latest

Highlights

v0.10.0 release includes 308 commits, 168 contributors (62 new!).

NOTE: This release begins the cleanup of V0 engine codebase. We have removed V0 CPU/XPU/TPU/HPU backends (#20412), long context LoRA (#21169), Prompt Adapters (#20588), Phi3-Small & BlockSparse Attention (#21217), and Spec Decode workers (#21152) so far and plan to continued to delete code that is no longer used.

Model Support

New families: Llama 4 with EAGLE support (#20591), EXAONE 4.0 (#21060), Microsoft Phi-4-mini-flash-reasoning (#20702), Hunyuan V1 Dense + A13B with reasoning/tool parsing (#21368, #20625, #20820), Ling MoE models (#20680), JinaVL Reranker (#20260), Nemotron-Nano-VL-8B-V1 (#20349), Arcee (#21296), Voxtral (#20970).
Enhanced compatibility: BERT/RoBERTa with AutoWeightsLoader (#20534), HF format support for MiniMax (#20211), Gemini configuration (#20971), GLM-4 updates (#20736).
Architecture expansions: Attention-free model support (#20811), Hybrid SSM/Attention models on V1 (#20016), LlamaForSequenceClassification (#20807), expanded Mamba2 layer support (#20660).
VLM improvements: VLM support with transformers backend (#20543), PrithviMAE on V1 engine (#20577).

Engine Core

Experimental async scheduling --async-scheduling flag to overlap engine core scheduling with GPU runner (#19970).
V1 engine improvements: backend-agnostic local attention (#21093), MLA FlashInfer ragged prefill (#20034), hybrid KV cache with local chunked attention (#19351).
Multi-task support: models can now support multiple tasks (#20771), multiple poolers (#21227), and dynamic pooling parameter configuration (#21128).
RLHF Support: new RPC methods for runtime weight reloading (#20096) and config updates (#20095), logprobs mode for selecting which stage of logprobs to return (#21398).
Enhanced caching: multi-modal caching for transformers backend (#21358), reproducible prefix cache hashing using SHA-256 + CBOR (#20511).
Startup time reduction via CUDA graph capture speedup via frozen GC (#21146).
Elastic expert parallel for dynamic GPU scaling while preserving state (#20775).

Hardwares & Performance

NVIDIA Blackwell/SM100 optimizations: CUTLASS block scaled group GEMM for smaller batches (#20640), FP8 groupGEMM support (#20447), DeepGEMM integration (#20087), FlashInfer MoE blockscale FP8 backend (#20645), CUDNN prefill API for MLA (#20411), Triton Fused MoE kernel config for FP8 E=16 on B200 (#20516).
Performance improvements: 48% request duration reduction via microbatch tokenization for concurrent requests (#19334), fused MLA QKV + strided layernorm (#21116), Triton causal-conv1d for Mamba models (#18218).
Hardware expansion: ARM CPU int8 quantization (#14129), PPC64LE/ARM V1 support (#20554), Intel XPU ray distributed execution (#20659), shared-memory pipeline parallel for CPU (#21289), FlashInfer ARM CUDA support (#21013).

Quantization

New quantization support: MXFP4 for MoE models (#17888), BNB support for Mixtral and additional MoE models (#20893, #21100), in-flight quantization for MoE (#20061).
Hardware-specific: FP8 KV cache quantization on TPU (#19292), FP8 support for BatchedTritonExperts (#18864), optimized INT8 vectorization kernels (#20331).
Performance optimizations: Triton backend for DeepGEMM per-token group quantization (#20841), CUDA kernel for per-token group quantization (#21083), CustomOp abstraction for FP8 (#19830).

API & Frontend

OpenAI compatibility: Responses API implementation (#20504, #20975), image object support in llm.chat (#19635), tool calling with required choice and $defs (#20629).
New endpoints: get_tokenizer_info for tokenizer/chat-template information (#20575), cache_salt support for completions/responses (#20981).
Model loading: Tensorizer S3 integration with arbitrary arguments (#19619), HF repo paths & URLs for GGUF models (#20793), tokenization_kwargs for embedding truncation (#21033).
CLI improvements: --help=page option for enhanced help documentation (#20961), default model changed to Qwen3-0.6B (#20335).

Dependencies

Updated PyTorch to 2.7.1 for CUDA (#21011)
FlashInfer updated to v0.2.8rc1 (#20718)

What's Changed

[Docs] Note that alternative structured output backends are supported by @russellb in #19426
[ROCm][V1] Adding ROCm to the list of plaforms using V1 by default by @gshtras in #19440
[Model] use AutoWeightsLoader for commandr by @py-andy-c in #19399
Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 by @Xu-Wenqing in #19401
[BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 by @zou3519 in #19390
[New Model]: Support Qwen3 Embedding & Reranker by @noooop in #19260
[BugFix] Fix docker build cpu-dev image error by @2niuhe in #19394
Fix test_max_model_len in tests/entrypoints/llm/test_generate.py by @houseroad in #19451
[CI] Disable failing GGUF model test by @mgoin in #19454
[Misc] Remove unused MultiModalHasher.hash_prompt_mm_data by @lgeiger in #19422
Add fused MOE config for Qwen3 30B A3B on B200 by @0xjunhao in #19455
Fix Typo in Documentation and Function Name by @leopardracer in #19442
[ROCm] Add rules to automatically label ROCm related PRs by @houseroad in #19405
[Kernel] Support deep_gemm for linear methods by @artetaout in #19085
[Doc] Update V1 User Guide for Hardware and Models by @DarkLight1337 in #19474
[Doc] Fix quantization link titles by @DarkLight1337 in #19478
[Doc] Support "important" and "announcement" admonitions by @DarkLight1337 in #19479
[Misc] Reduce warning message introduced in env_override by @houseroad in #19476
Support non-string values in JSON keys from CLI by @DarkLight1337 in #19471
Add cache to cuda get_device_capability by @mgoin in #19436
Fix some typo by @Ximingwang-09 in #19475
Support no privileged mode on CPU for docker and kubernetes deployments by @louie-tsai in #19241
[Bugfix] Update the example code, make it work with the latest lmcache by @runzhen in #19453
[CI] Update FlashInfer to 0.2.6.post1 by @mgoin in #19297
[doc] fix "Other AI accelerators" getting started page by @davidxia in #19457
[Misc] Fix misleading ROCm warning by @jeejeelee in #19486
[Docs] Remove WIP features in V1 guide by @WoosukKwon in #19498
[Kernels] Add activation chunking logic to FusedMoEModularKernel by @bnellnm in #19168
[AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger by @rasmith in #17331
[UX] Add Feedback During CUDAGraph Capture by @robertgshaw2-redhat in #19501
[CI/Build] Fix torch nightly CI dependencies by @zou3519 in #19505
[CI] change spell checker from codespell to typos by @andyxning in #18711
[BugFix] Force registration of w8a8_block_fp8_matmul_deepgemm via lazy import by @varun-sundar-rabindranath in #19514
Add Triton Fused MoE kernel config for E=16 on B200 by @b8zhong in #19518
[Frontend] Improve error message in tool_choice validation by @22quinn in #19239
[BugFix] Work-around incremental detokenization edge case error by @njhill in #19449
[BugFix] Handle missing sep_token for Qwen3-Reranker in Score API by @strutive07 in #19522
[AMD][Kernel][BugFix] fix test_rocm_compressed_tensors_w8a8 for rocm by @rasmith in #19509
Fix typo by @2niuhe in #19525
[Security] Prevent new imports of (cloud)pickle by @russellb in #18018
[Bugfix][V1] Allow manual FlashAttention for Blackwell by @mgoin in #19492
[Bugfix] Respect num-gpu-blocks-override in v1 by @jmswen in #19503
[Quantization] Improve AWQ logic by @jeejeelee in #19431
[Doc] Add V1 column to supported models list by @DarkLight1337 in #19523
[NixlConnector] Drop num_blocks check by @NickLucche in #19532
[Perf] Vectorize static / dynamic INT8 quant kernels by @yewentao256 in #19233
Fix TorchAOConfig skip layers by @mobicham in #19265
[torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass by @ProExpertProg in https://github.com/vllm-proj...

Contributors

rasmith, kzjeef, and 198 other contributors

Assets 6

24 Jul 05:04

github-actions

v0.10.0rc2

6d8d0a2

v0.10.0rc2 Pre-release

Pre-release

What's Changed

[Model] use AutoWeightsLoader for bart by @calvin0327 in #18299
[Model] Support VLMs with transformers backend by @zucchini-nlp in #20543
[bugfix] fix syntax warning caused by backslash by @1195343015 in #21251
[CI] Cleanup modelscope version constraint in Dockerfile by @yankay in #21243
[Docs] Add RFC Meeting to Issue Template by @simon-mo in #21279
Add the instruction to run e2e validation manually before release by @huydhn in #21023
[Bugfix] Fix missing placeholder in logger debug by @DarkLight1337 in #21280
[Model][1/N] Support multiple poolers at model level by @DarkLight1337 in #21227
[Docs] Fix hardcoded links in docs by @hmellor in #21287
[Docs] Make tables more space efficient in supported_models.md by @hmellor in #21291
[Misc] unify variable for LLM instance by @andyxning in #20996
Add Nvidia ModelOpt config adaptation by @Edwardf0t1 in #19815
[Misc] Add sliding window to flashinfer test by @WoosukKwon in #21282
[CPU] Enable shared-memory based pipeline parallel for CPU backend by @bigPYJ1151 in #21289
[BugFix] make utils.current_stream thread-safety (#21252) by @simpx in #21253
[Misc] Add dummy maverick test by @minosfuture in #21199
[Attention] Clean up iRoPE in V1 by @LucasWilkinson in #21188
[DP] Fix Prometheus Logging by @robertgshaw2-redhat in #21257
Fix bad lm-eval fork by @mgoin in #21318
[perf] Speed up align sum kernels by @hj-mistral in #21079
[v1][sampler] Inplace logprobs comparison to get the token rank by @houseroad in #21283
[XPU] Enable external_launcher to serve as an executor via torchrun by @chaojun-zhang in #21021
[Doc] Fix CPU doc format by @bigPYJ1151 in #21316
[Intel GPU] Ray Compiled Graph avoid NCCL for Intel GPU by @ratnampa in #21338
Revert "[Performance] Performance improvements in non-blockwise fp8 CUTLASS MoE (#20762) by @minosfuture in #21334
[Core] Minimize number of dict lookup in _maybe_evict_cached_block by @Jialin in #21281
[V1] [Hybrid] Add new test to verify that hybrid views into KVCacheTensor are compatible by @tdoublep in #21300
[Refactor] Fix Compile Warning #1444-D by @yewentao256 in #21208
Fix kv_cache_dtype handling for out-of-tree HPU plugin by @kzawora-intel in #21302
[Misc] DeepEPHighThroughtput - Enable Inductor pass by @varun-sundar-rabindranath in #21311
[Bug] DeepGemm: Fix Cuda Init Error by @yewentao256 in #21312
Update fp4 quantize API by @wenscarl in #21327
[Feature][eplb] add verify ep or tp or dp by @lengrongfu in #21102
Add arcee model by @alyosha-swamy in #21296
[Bugfix] Fix eviction cached blocked logic by @simon-mo in #21357
[Misc] Remove deprecated args in v0.10 by @kebe7jun in #21349
[Core] Optimize update checks in LogitsProcessor by @Jialin in #21245
[benchmark] Port benchmark request sent optimization to benchmark_serving by @Jialin in #21209
[Core] Introduce popleft_n and append_n in FreeKVCacheBlockQueue to further optimize block_pool by @Jialin in #21222
[Misc] unify variable for LLM instance v2 by @andyxning in #21356
[perf] Add fused MLA QKV + strided layernorm by @mickaelseznec in #21116
[feat]: add SM100 support for cutlass FP8 groupGEMM by @djmmoss in #20447
[Perf] Cuda Kernel for Per Token Group Quant by @yewentao256 in #21083
Adds parallel model weight loading for runai_streamer by @bbartels in #21330
[feat] Enable mm caching for transformers backend by @zucchini-nlp in #21358
Revert "[Refactor] Fix Compile Warning #1444-D (#21208)" by @yewentao256 in #21384
Add tokenization_kwargs to encode for embedding model truncation by @Receiling in #21033
[Bugfix] Decode Tokenized IDs to Strings for hf_processor in llm.chat() with model_impl=transformers by @ariG23498 in #21353
[CI/Build] Fix test failure due to updated model repo by @DarkLight1337 in #21375
Fix Flashinfer Allreduce+Norm enable disable calculation based on fi_allreduce_fusion_max_token_num by @xinli-git in #21325
[Model] Add Qwen3CoderToolParser by @ranpox in #21396
[Misc] Copy HF_TOKEN env var to Ray workers by @ruisearch42 in #21406
[BugFix] Fix ray import error mem cleanup bug by @joerunde in #21381
[CI/Build] Fix model executor tests by @DarkLight1337 in #21387
[Bugfix][ROCm][Build] Fix build regression on ROCm by @gshtras in #21393
Simplify weight loading in Transformers backend by @hmellor in #21382
[BugFix] Update python to python3 calls for image; fix prefix & input calculations. by @ericehanley in #21391
[BUGFIX] deepseek-v2-lite failed due to fused_qkv_a_proj name update by @xuechendi in #21414
[Bugfix][CUDA] fixes CUDA FP8 kv cache dtype supported by @elvischenv in #21420
Changing "amdproduction" allocation. by @Alexei-V-Ivanov-AMD in #21409
[Bugfix] Fix nightly transformers CI failure by @Isotr0py in #21427
[Core] Add basic unit test for maybe_evict_cached_block by @Jialin in #21400
[Cleanup] Only log MoE DP setup warning if DP is enabled by @mgoin in #21315
add clear messages for deprecated models by @youkaichao in #21424
[Bugfix] ensure tool_choice is popped when tool_choice:null is passed in json payload by @gcalmettes in #19679
Fixed typo in profiling logs by @sergiopaniego in #21441
[Docs] Fix bullets and grammars in tool_calling.md by @windsonsea in #21440
[Sampler] Introduce logprobs mode for logging by @houseroad in #21398
Mamba V2 Test not Asserting Failures. by @fabianlim in #21379
[Misc] fixed nvfp4_moe test failures due to invalid kwargs by @chenyang78 in #21246
[Docs] Clean up v1/metrics.md by @windsonsea in #21449
[Model] add Hunyuan V1 Dense Model support. by @kzjeef in #21368
[V1] Check all pooling tasks during profiling by @DarkLight1337 in #21299
[Bugfix][Qwen][DCA] fixes bug in dual-chunk-flash-attn backend for qwen 1m models. by @sighingnow in #21364
[Tests] Add tests for headless internal DP LB by @njhill in #21450
[Core][Model] PrithviMAE Enablement on vLLM v1 engine by @christian-pinto in #20577
Add test case for compiling multiple graphs by @sarckk in #21044
[TPU][TEST] Fix the downloading issue in TPU v1 test 11. by @QiliangCui in #21418
[Core] Add reload_weights RPC method by @22quinn in #20096
[V1] Fix local chunked attention always disabled by @sarckk in #21419
[V0 Deprecation] Remove Prompt Adapters by @mgoin in #20588
[Core] Freeze gc during cuda graph capture to speed up init by @mgoin in #21146
feat(gguf_loader): accept HF repo paths & URLs for GGUF by @hardikkgupta in #20793
[Frontend] Set MAX_AUDIO_CLI...

Contributors

kzjeef, simpx, and 64 other contributors

Assets 2

20 Jul 05:17

github-actions

v0.10.0rc1

d1fb65b

v0.10.0rc1 Pre-release

Pre-release

What's Changed

[Kernel] Enable fp8 support for pplx and BatchedTritonExperts. by @bnellnm in #18864
[Misc] Fix Unable to detect current VLLM config. Defaulting to NHD kv cache layout warning by @NickLucche in #20400
[Bugfix] Register reducer even if transformers_modules not available by @eicherseiji in #19510
Change warn_for_unimplemented_methods to debug by @mgoin in #20455
[Platform] Add custom default max tokens by @gmarinho2 in #18557
Add ignore consolidated file in mistral example code by @princepride in #20420
[Misc] small update by @reidliu41 in #20462
[Structured Outputs][V1] Skipping with models doesn't contain tokenizers by @aarnphm in #20365
[Perf] Optimize Vectorization Utils for Int 8 Quantization Kernels by @yewentao256 in #20331
[Misc] Add SPDX-FileCopyrightText by @jeejeelee in #20428
Support Llama 4 for fused_marlin_moe by @mgoin in #20457
[Bug][Frontend] Fix structure of transcription's decoder_prompt by @sangbumlikeagod in #18809
[Model][3/N] Automatic conversion of CrossEncoding model by @noooop in #20168
[Doc] Fix classification table in list of supported models by @DarkLight1337 in #20489
[CI] add kvcache-connector dependency definition and add into CI build by @panpan0000 in #18193
[Misc] Small: Remove global media connector. Each test should have its own test connector object. by @huachenheli in #20395
Enable V1 for Hybrid SSM/Attention Models by @tdoublep in #20016
[feat]: CUTLASS block scaled group gemm for SM100 by @djmmoss in #19757
[CI Bugfix] Fix pre-commit failures on main by @mgoin in #20502
[Doc] fix mutltimodal_inputs.md gh examples link by @GuyStone in #20497
[Misc] Add security warning for development mode endpoints by @reidliu41 in #20508
[doc] small fix by @reidliu41 in #20506
[Misc] Remove the unused LoRA test code by @jeejeelee in #20494
Fix unknown attribute of topk_indices_dtype in CompressedTensorsW8A8Fp8MoECutlassMethod by @luccafong in #20507
[v1] Re-add fp32 support to v1 engine through FlexAttention by @Isotr0py in #19754
[Misc] Add logger.exception for TPU information collection failures by @reidliu41 in #20510
[Misc] remove unused import by @reidliu41 in #20517
test_attention compat with coming xformers change by @bottler in #20487
[BUG] Fix #20484. Support empty sequence in cuda penalty kernel by @vadiklyutiy in #20491
[Bugfix] Fix missing per_act_token parameter in compressed_tensors_moe by @luccafong in #20509
[BugFix] Fix: ImportError when building on hopper systems by @LucasWilkinson in #20513
[TPU][Bugfix] fix the MoE OOM issue by @yaochengji in #20339
[Frontend] Support image object in llm.chat by @sfeng33 in #19635
[Benchmark] Add support for multiple batch size benchmark through CLI in benchmark_moe.py + Add Triton Fused MoE kernel config for FP8 E=16 on B200 by @b8zhong in #20516
[Misc] call the pre-defined func by @reidliu41 in #20518
[V0 deprecation] Remove V0 CPU/XPU/TPU backends by @WoosukKwon in #20412
[V1] Support any head size for FlexAttention backend by @DarkLight1337 in #20467
[BugFix][Spec Decode] Fix spec token ids in model runner by @WoosukKwon in #20530
[Bugfix] Add use_cross_encoder flag to use correct activation in ClassifierPooler by @DarkLight1337 in #20527
Implement OpenAI Responses API [1/N] by @WoosukKwon in #20504
[Misc] add a tip for pre-commit by @reidliu41 in #20536
[Refactor]Abstract Platform Interface for Distributed Backend and Add xccl Support for Intel XPU by @dbyoung18 in #19410
[CI/Build] Enable phi2 lora test by @jeejeelee in #20540
[XPU][CI] add v1/core test in xpu hardware ci by @Liangliang-Ma in #20537
Add docstrings to url_schemes.py to improve readability by @windsonsea in #20545
[XPU] log clean up for XPU platform by @yma11 in #20553
[Docs] Clean up tables in supported_models.md by @windsonsea in #20552
[Misc] remove unused jinaai_serving_reranking by @Abirdcfly in #18878
[Misc] Set the minimum openai version by @jeejeelee in #20539
[Doc] Remove extra whitespace from CI failures doc by @hmellor in #20565
[Doc] Use gh-pr and gh-issue everywhere we can in the docs by @hmellor in #20564
[Doc] Fix internal links so they don't always point to latest by @hmellor in #20563
[Doc] Add outline for content tabs by @hmellor in #20571
[Doc] Fix some MkDocs snippets used in the installation docs by @hmellor in #20572
[Model][Last/4] Automatic conversion of CrossEncoding model by @noooop in #19675
[Bugfix] Prevent IndexError for cached requests when pipeline parallelism is disabled by @panpan0000 in #20486
[Feature] microbatch tokenization by @ztang2370 in #19334
[DP] Copy environment variables to Ray DPEngineCoreActors by @ruisearch42 in #20344
[Kernel] Optimize Prefill Attention in Unified Triton Attention Kernel by @jvlunteren in #20308
[Misc] Add fully interleaved support for multimodal 'string' content format by @Dekakhrone in #14047
[Misc] feat output content in stream response by @lengrongfu in #19608
Fix links in multi-modal model contributing page by @hmellor in #18615
[Config] Refactor mistral configs by @patrickvonplaten in #20570
[Misc] Improve logging for dynamic shape cache compilation by @kyolebu in #20573
[Bugfix] Fix Maverick correctness by filling zero to cache space in cutlass_moe by @minosfuture in #20167
[Optimize] Don't send token ids when kv connector is not used by @WoosukKwon in #20586
Make distinct code and console admonitions so readers are less likely to miss them by @hmellor in #20585
[Bugfix]: Fix messy code when using logprobs by @chaunceyjiang in #19209
[Doc] Syntax highlight request responses as JSON instead of bash by @hmellor in #20582
[Docs] Rewrite offline inference guide by @crypdick in #20594
[Docs] Improve docstring for ray data llm example by @crypdick in #20597
[Docs] Add Ray Serve LLM section to openai compatible server guide by @crypdick in #20595
[Docs] Add Anyscale to frameworks by @crypdick in #20590
[Misc] improve error msg by @reidliu41 in #20604
[CI/Build][CPU] Fix CPU CI and remove all CPU V0 files by @bigPYJ1151 in #20560
[TPU] Temporary fix vmem oom for long model len by reducing page size by @Chenyaaang in #20278
[Frontend] [Core] Integrate Tensorizer in to S3 loading machinery, allow passing arbitrary arguments during save/load by @sangstar in #19619
[PD][Nixl] Remote consumer READ timeout for clearing request blocks by @NickLucche in #20139
[Docs] Improve documentation for Deepseek R1 on Ray Serve LLM by @crypdick in #20601
Remove unnecessary explicit title anchors and use relative links instead by @hmellor in #20620
Stop using title frontmatter and fix doc that can only be ...

Contributors

rabi, kzjeef, and 133 other contributors

Assets 2

07 Jul 17:05

github-actions

v0.9.2

a5dd03c

v0.9.2

Highlights

This release contains 452 commits from 167 contributors (31 new!)

NOTE: This is the last version where V0 engine code and features stay intact. We highly recommend migrating to V1 engine.

Engine Core

Priority Scheduling is now implemented in V1 engine (#19057), embedding models in V1 (#16188), Mamba2 in V1 (#19327).
Full CUDA‑Graph execution is now available for all FlashAttention v3 (FA3) and FlashMLA paths, including prefix‑caching. CUDA graph now has a live capture progress bar makes debugging easier (#20301, #18581, #19617, #19501).
FlexAttention update – any head size, FP32 fallback (#20467, #19754).
Shared CachedRequestData objects and cached sampler‑ID stores deliver perf enhancements (#20232, #20291).

Model Support

New families: Ernie 4.5 (+MoE) (#20220), MiniMax‑M1 (#19677, #20297), Slim‑MoE “Phi‑tiny‑MoE‑instruct” (#20286), Tencent HunYuan‑MoE‑V1 (#20114), Keye‑VL‑8B‑Preview (#20126), GLM‑4.1 V (#19331), Gemma‑3 (text‑only, #20134), Tarsier 2 (#19887), Qwen 3 Embedding & Reranker (#19260), dots1 (#18254), GPT‑2 for Sequence Classification (#19663).
Granite hybrid MoE configurations with shared experts are fully supported (#19652).

Large‑Scale Serving & Engine Improvements

Expert‑Parallel Load Balancer (EPLB) has been added! (#18343, #19790, #19885).
Disaggregated serving enhancements: Avoid stranding blocks in P when aborted in D's waiting queue (#19223), let toy proxy handle /chat/completions (#19730)
Native xPyD P2P NCCL transport as a base case for native PD without external dependency (#18242, #20246).

Hardware & Performance

NVIDIA Blackwell
- SM120: CUTLASS W8A8/FP8 kernels and related tuning, added to Dockerfile (#17280, #19566, #20071, #19794)
- SM100: block‑scaled‑group GEMM, INT8/FP8 vectorization, deep‑GEMM kernels, activation‑chunking for MoE, and group‑size 64 for Machete (#19757, #19572, #19168, #19085, #20290, #20331).
Intel GPU (V1) backend with Flash‑Attention support (#19560).
AMD ROCm: full‑graph capture for TritonAttention, quick All‑Reduce, and chunked pre‑fill (#19158, #19744, #18596).
- Split‑KV support landed in the unified Triton Attention kernel, boosting long‑context throughput (#19152).
- Full‑graph mode enabled in ROCm AITER MLA V1 decode path (#20254).
TPU: dynamic‑grid KV‑cache updates, head‑dim less than 128, tuned paged‑attention kernels, and KV‑padding fixes (#19928, #20235, #19620, #19813, #20048, #20339).
- Add models and features supporting matrix. (#20230)

Quantization

Calibration‑free RTN INT4/INT8 pipeline for effortless, accurate compression (#18768).
Compressed‑Tensor NVFP4 (including MoE) + emulation; FP4 emulation removed on < SM100 devices (#19879, #19990, #19563).
Dynamic MoE‑layer quant (Marlin/GPTQ) and INT8 vectorization primitives (#19395, #20331, #19233).
Bits‑and‑Bytes 0.45 + with improved double‑quant logic and AWQ quality (#20424, #20033, #19431, #20076).

API · CLI · Frontend

API Server: Eliminate api_key and x_request_id headers middleware overhead (#19946)
New OpenAI‑compatible endpoints: /v1/audio/translations & revamped /v1/audio/transcriptions (#19615, #20179, #19597).
Token‑level progress bar for LLM.beam_search and cached template‑resolution speed‑ups (#19301, #20065).
Image‑object support in llm.chat, tool‑choice expansion, and custom‑arg passthroughs enrich multi‑modal agents (#19635, #17177, #16862).
CLI QoL: better parsing for -O/--compilation-config, batch‑size‑sweep benchmarking, richer --help, faster startup (#20156, #20516, #20430, #19941).
Metrics: Deprecate metrics with gpu_ prefix for non GPU specific metrics (#18354), Export NaNs in logits to scheduler_stats if output is corrupted (#18777)

Platform & Deployment

No‑privileged CPU / Docker / K8s mode (#19241) and custom default max‑tokens for hosted platforms (#18557).
Security hardening – runtime (cloud)pickle imports forbidden (#18018).
Hermetic builds and wheel slimming (FA2 8.0 + PTX only) shrink supply‑chain surface (#18064, #19336).

What's Changed

[Docs] Note that alternative structured output backends are supported by @russellb in #19426
[ROCm][V1] Adding ROCm to the list of plaforms using V1 by default by @gshtras in #19440
[Model] use AutoWeightsLoader for commandr by @py-andy-c in #19399
Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 by @Xu-Wenqing in #19401
[BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 by @zou3519 in #19390
[New Model]: Support Qwen3 Embedding & Reranker by @noooop in #19260
[BugFix] Fix docker build cpu-dev image error by @2niuhe in #19394
Fix test_max_model_len in tests/entrypoints/llm/test_generate.py by @houseroad in #19451
[CI] Disable failing GGUF model test by @mgoin in #19454
[Misc] Remove unused MultiModalHasher.hash_prompt_mm_data by @lgeiger in #19422
Add fused MOE config for Qwen3 30B A3B on B200 by @0xjunhao in #19455
Fix Typo in Documentation and Function Name by @leopardracer in #19442
[ROCm] Add rules to automatically label ROCm related PRs by @houseroad in #19405
[Kernel] Support deep_gemm for linear methods by @artetaout in #19085
[Doc] Update V1 User Guide for Hardware and Models by @DarkLight1337 in #19474
[Doc] Fix quantization link titles by @DarkLight1337 in #19478
[Doc] Support "important" and "announcement" admonitions by @DarkLight1337 in #19479
[Misc] Reduce warning message introduced in env_override by @houseroad in #19476
Support non-string values in JSON keys from CLI by @DarkLight1337 in #19471
Add cache to cuda get_device_capability by @mgoin in #19436
Fix some typo by @Ximingwang-09 in #19475
Support no privileged mode on CPU for docker and kubernetes deployments by @louie-tsai in #19241
[Bugfix] Update the example code, make it work with the latest lmcache by @runzhen in #19453
[CI] Update FlashInfer to 0.2.6.post1 by @mgoin in #19297
[doc] fix "Other AI accelerators" getting started page by @davidxia in #19457
[Misc] Fix misleading ROCm warning by @jeejeelee in #19486
[Docs] Remove WIP features in V1 guide by @WoosukKwon in #19498
[Kernels] Add activation chunking logic to FusedMoEModularKernel by @bnellnm in #19168
[AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger by @rasmith in #17331
[UX] Add Feedback During CUDAGraph Capture by @robertgshaw2-redhat in #19501
[CI/Build] Fix torch nightly CI dependencies by @zou3519 in #19505
[CI] change spell checker from codespell to typos by @andyxning in #18711
[BugFix] Force registration of w8a8_block_fp8_matmul_deepgemm via lazy import by @varun-sundar-rabindranath in #19514
Add Triton Fused MoE kernel config for E=16 on B200 by @b8zhong in #19518
[Frontend] Improve error message in tool_choice validation by @22quinn in #19239
[BugFix] Work-around incremental detokenization edge case error by @njhill in #19449
[BugFix] Handle missing sep_token for Qwen3-Reranker in Score API by @strutive07 in #19522
[AMD][Kernel][BugFix] fix test_rocm_compressed_tensors_w8a8 for rocm by @rasmith in #19509
Fix typo by @2niuhe in #19525
[Security] Prevent new imports of (cloud)pickle by @russellb in #18018
[Bugfix][V1] Allow manual FlashAttention for Blackwell by @mgoin in #19492
[Bugfix] Respect num-gpu-blocks-override in v1 by @jmswen in #19503
[Quantization] Improve AWQ logic by @jeejeelee in #19431
[Doc] Add V1 column to supported models list by @DarkLight1337 in #19523
[NixlConnector] Drop num_blocks check by @NickLucche in #19532
[Perf] Vectorize static / dynamic INT8 quant kernels by @yewentao256 in #19233
Fix TorchAOConfig skip layers by @mobicham in #19265
[torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass by @ProExpertProg in #16756
[doc] Make top navigatio...

Contributors

rasmith, shawntan, and 166 other contributors

Assets 6

06 Jul 21:03

github-actions

v0.9.2rc2

a5dd03c

v0.9.2rc2 Pre-release

Pre-release

What's Changed

[Kernel] Enable fp8 support for pplx and BatchedTritonExperts. by @bnellnm in #18864
[Misc] Fix Unable to detect current VLLM config. Defaulting to NHD kv cache layout warning by @NickLucche in #20400
[Bugfix] Register reducer even if transformers_modules not available by @eicherseiji in #19510
Change warn_for_unimplemented_methods to debug by @mgoin in #20455
[Platform] Add custom default max tokens by @gmarinho2 in #18557
Add ignore consolidated file in mistral example code by @princepride in #20420
[Misc] small update by @reidliu41 in #20462
[Structured Outputs][V1] Skipping with models doesn't contain tokenizers by @aarnphm in #20365
[Perf] Optimize Vectorization Utils for Int 8 Quantization Kernels by @yewentao256 in #20331
[Misc] Add SPDX-FileCopyrightText by @jeejeelee in #20428
Support Llama 4 for fused_marlin_moe by @mgoin in #20457
[Bug][Frontend] Fix structure of transcription's decoder_prompt by @sangbumlikeagod in #18809
[Model][3/N] Automatic conversion of CrossEncoding model by @noooop in #20168
[Doc] Fix classification table in list of supported models by @DarkLight1337 in #20489
[CI] add kvcache-connector dependency definition and add into CI build by @panpan0000 in #18193
[Misc] Small: Remove global media connector. Each test should have its own test connector object. by @huachenheli in #20395
Enable V1 for Hybrid SSM/Attention Models by @tdoublep in #20016
[feat]: CUTLASS block scaled group gemm for SM100 by @djmmoss in #19757
[CI Bugfix] Fix pre-commit failures on main by @mgoin in #20502
[Doc] fix mutltimodal_inputs.md gh examples link by @GuyStone in #20497
[Misc] Add security warning for development mode endpoints by @reidliu41 in #20508
[doc] small fix by @reidliu41 in #20506
[Misc] Remove the unused LoRA test code by @jeejeelee in #20494
Fix unknown attribute of topk_indices_dtype in CompressedTensorsW8A8Fp8MoECutlassMethod by @luccafong in #20507
[v1] Re-add fp32 support to v1 engine through FlexAttention by @Isotr0py in #19754
[Misc] Add logger.exception for TPU information collection failures by @reidliu41 in #20510
[Misc] remove unused import by @reidliu41 in #20517
test_attention compat with coming xformers change by @bottler in #20487
[BUG] Fix #20484. Support empty sequence in cuda penalty kernel by @vadiklyutiy in #20491
[Bugfix] Fix missing per_act_token parameter in compressed_tensors_moe by @luccafong in #20509
[BugFix] Fix: ImportError when building on hopper systems by @LucasWilkinson in #20513
[TPU][Bugfix] fix the MoE OOM issue by @yaochengji in #20339
[Frontend] Support image object in llm.chat by @sfeng33 in #19635
[Benchmark] Add support for multiple batch size benchmark through CLI in benchmark_moe.py + Add Triton Fused MoE kernel config for FP8 E=16 on B200 by @b8zhong in #20516
[Misc] call the pre-defined func by @reidliu41 in #20518
[V0 deprecation] Remove V0 CPU/XPU/TPU backends by @WoosukKwon in #20412
[V1] Support any head size for FlexAttention backend by @DarkLight1337 in #20467
[BugFix][Spec Decode] Fix spec token ids in model runner by @WoosukKwon in #20530
[Bugfix] Add use_cross_encoder flag to use correct activation in ClassifierPooler by @DarkLight1337 in #20527

New Contributors

@sangbumlikeagod made their first contribution in #18809
@djmmoss made their first contribution in #19757
@GuyStone made their first contribution in #20497
@bottler made their first contribution in #20487

Full Changelog: v0.9.2rc1...v0.9.2rc2

Contributors

bottler, mgoin, and 25 other contributors

Assets 2

03 Jul 21:54

github-actions

v0.9.2rc1

2f2fcb3

v0.9.2rc1 Pre-release

Pre-release

What's Changed

[Docs] Note that alternative structured output backends are supported by @russellb in #19426
[ROCm][V1] Adding ROCm to the list of plaforms using V1 by default by @gshtras in #19440
[Model] use AutoWeightsLoader for commandr by @py-andy-c in #19399
Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 by @Xu-Wenqing in #19401
[BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 by @zou3519 in #19390
[New Model]: Support Qwen3 Embedding & Reranker by @noooop in #19260
[BugFix] Fix docker build cpu-dev image error by @2niuhe in #19394
Fix test_max_model_len in tests/entrypoints/llm/test_generate.py by @houseroad in #19451
[CI] Disable failing GGUF model test by @mgoin in #19454
[Misc] Remove unused MultiModalHasher.hash_prompt_mm_data by @lgeiger in #19422
Add fused MOE config for Qwen3 30B A3B on B200 by @0xjunhao in #19455
Fix Typo in Documentation and Function Name by @leopardracer in #19442
[ROCm] Add rules to automatically label ROCm related PRs by @houseroad in #19405
[Kernel] Support deep_gemm for linear methods by @artetaout in #19085
[Doc] Update V1 User Guide for Hardware and Models by @DarkLight1337 in #19474
[Doc] Fix quantization link titles by @DarkLight1337 in #19478
[Doc] Support "important" and "announcement" admonitions by @DarkLight1337 in #19479
[Misc] Reduce warning message introduced in env_override by @houseroad in #19476
Support non-string values in JSON keys from CLI by @DarkLight1337 in #19471
Add cache to cuda get_device_capability by @mgoin in #19436
Fix some typo by @Ximingwang-09 in #19475
Support no privileged mode on CPU for docker and kubernetes deployments by @louie-tsai in #19241
[Bugfix] Update the example code, make it work with the latest lmcache by @runzhen in #19453
[CI] Update FlashInfer to 0.2.6.post1 by @mgoin in #19297
[doc] fix "Other AI accelerators" getting started page by @davidxia in #19457
[Misc] Fix misleading ROCm warning by @jeejeelee in #19486
[Docs] Remove WIP features in V1 guide by @WoosukKwon in #19498
[Kernels] Add activation chunking logic to FusedMoEModularKernel by @bnellnm in #19168
[AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger by @rasmith in #17331
[UX] Add Feedback During CUDAGraph Capture by @robertgshaw2-redhat in #19501
[CI/Build] Fix torch nightly CI dependencies by @zou3519 in #19505
[CI] change spell checker from codespell to typos by @andyxning in #18711
[BugFix] Force registration of w8a8_block_fp8_matmul_deepgemm via lazy import by @varun-sundar-rabindranath in #19514
Add Triton Fused MoE kernel config for E=16 on B200 by @b8zhong in #19518
[Frontend] Improve error message in tool_choice validation by @22quinn in #19239
[BugFix] Work-around incremental detokenization edge case error by @njhill in #19449
[BugFix] Handle missing sep_token for Qwen3-Reranker in Score API by @strutive07 in #19522
[AMD][Kernel][BugFix] fix test_rocm_compressed_tensors_w8a8 for rocm by @rasmith in #19509
Fix typo by @2niuhe in #19525
[Security] Prevent new imports of (cloud)pickle by @russellb in #18018
[Bugfix][V1] Allow manual FlashAttention for Blackwell by @mgoin in #19492
[Bugfix] Respect num-gpu-blocks-override in v1 by @jmswen in #19503
[Quantization] Improve AWQ logic by @jeejeelee in #19431
[Doc] Add V1 column to supported models list by @DarkLight1337 in #19523
[NixlConnector] Drop num_blocks check by @NickLucche in #19532
[Perf] Vectorize static / dynamic INT8 quant kernels by @yewentao256 in #19233
Fix TorchAOConfig skip layers by @mobicham in #19265
[torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass by @ProExpertProg in #16756
[doc] Make top navigation sticky by @reidliu41 in #19540
[Spec Decode][Benchmark] Generalize spec decode offline benchmark to more methods and datasets by @ekagra-ranjan in #18847
[Misc] Turn MOE_DP_CHUNK_SIZE into an env var by @varun-sundar-rabindranath in #19506
[Bugfix] Enforce contiguous input for dynamic_per_token FP8/INT8 quant by @mgoin in #19452
[Doc] Unify structured outputs examples by @aarnphm in #18196
[V1] Resolve failed concurrent structred output requests by @russellb in #19565
Revert "[Build/CI] Add tracing deps to vllm container image (#15224)" by @kouroshHakha in #19378
[BugFix] : Fix Batched DeepGemm Experts by @varun-sundar-rabindranath in #19515
[Bugfix] Fix EAGLE vocab embedding for multimodal target model by @zixi-qi in #19570
[Doc] uses absolute links for structured outputs by @aarnphm in #19582
[doc] fix incorrect link by @reidliu41 in #19586
[Misc] Correct broken docs link by @Zerohertz in #19553
[CPU] Refine default config for the CPU backend by @bigPYJ1151 in #19539
[Fix] bump mistral common to support magistral by @princepride in #19533
[Fix] The zip function in Python 3.9 does not have the strict argument by @princepride in #19549
use base version for version comparison by @BoyuanFeng in #19587
[torch.compile] reorganize the cache directory to support compiling multiple models by @youkaichao in #19064
[BugFix] Honor enable_caching in connector-delayed kvcache load case by @njhill in #19435
[Model] Fix minimax model cache & lm_head precision by @qscqesze in #19592
[Refactor] Remove unused variables in moe_permute_unpermute_kernel.inl by @yewentao256 in #19573
[doc][mkdocs] fix the duplicate Supported features sections in GPU docs by @reidliu41 in #19606
[CUDA] Enable full cudagraph for FlashMLA by @ProExpertProg in #18581
[Doc] Add troubleshooting section to k8s deployment by @annapendleton in #19377
[torch.compile] Use custom ops when use_inductor=False by @WoosukKwon in #19618
Adding "AMD: Multi-step Tests" to amdproduction. by @Concurrensee in #19508
[BugFix] Fix DP Coordinator incorrect debug log message by @njhill in #19624
[V1][Metrics] Deprecate metrics with gpu_ prefix for non GPU specific metrics. by @sahelib25 in #18354
[Bugfix][1/n] Fix the speculative decoding test by setting the target dtype by @houseroad in #19633
[Misc] Modularize CLI Argument Parsing in Benchmark Scripts by @reidliu41 in #19593
[Bugfix] Fix auto dtype casting for BatchFeature by @Isotr0py in #19316
[Hardware][NVIDIA][kernel] Fp4 MOE quant kernel optimization by @jiahanc in #19500
Only build CUTLASS MoE kernels on Hopper by @huydhn in #19648
[Bugfix] Don't attempt to use triton if no driver is active by @kzawora-intel in #19561
[Fix] Convert kv_transfer_config from dict to KVTransferConfig by @maobaolong in #19262
[Perf...

Contributors

rasmith, shawntan, and 158 other contributors

Assets 2

10 Jun 18:30

github-actions

v0.9.1

b6553be

v0.9.1

Highlights

This release features 274 commits, from 123 contributors (27 new contributors!)

Progress in large scale serving
- DP Attention + Expert Parallelism: CUDA graph support (#18724), DeepEP dispatch-combine kernel (#18434), batched/masked DeepGEMM kernel (#19111), CUTLASS MoE kernel with PPLX (#18762)
- Heterogeneous TP (#18833), NixlConnector Enable FlashInfer backend (#19090)
- DP: API-server scaleout with many-to-many server-engine comms (#17546), Support DP with Ray (#18779), allow AsyncLLMEngine.generate to target a specific DP rank (#19102), data parallel rank to KVEventBatch (#18925)
- Tooling: Simplify EP kernels installation (#19412)
RLHF workflow: Support inplace model weights loading (#18745)
Initial full support for Hybrid Memory Allocator (#17996), support cross-layer KV sharing (#18212)
Add FlexAttention to vLLM V1 (#16078)
Various production hardening related to full cuda graph mode (#19171, #19106, #19321)

Model Support

Support Magistral (#19193), LoRA support for InternVL (#18842), minicpm eagle support (#18943), NemotronH support (#18863, #19249)
Enable data parallel for Llama4 vision encoder (#18368)
Add DeepSeek-R1-0528 function call chat template (#18874)

Hardware Support & Performance Optimizations

Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 (#19205), Qwen3-235B-A22B (#19315)
Blackwell: Add Cutlass MLA backend (#17625), Tunings for SM100 FP8 CUTLASS kernel (#18778), Use FlashInfer by default on Blackwell GPUs (#19118), Tune scaled_fp8_quant by increasing vectorization (#18844)
FP4: Add compressed-tensors NVFP4 support (#18312), FP4 MoE kernel optimization (#19110)
CPU: V1 support for the CPU backend (#16441)
ROCm: Add AITER grouped topk for DeepSeekV2 (#18825)
POWER: Add IBM POWER11 Support to CPU Extension Detection (#19082)
TPU: Initial support of model parallelism with single worker using SPMD (#18011), Multi-LoRA Optimizations for the V1 TPU backend (#15655)
Neuron: Add multi-LoRA support for Neuron. (#18284), Add Multi-Modal model support for Neuron (#18921), Support quantization on neuron (#18283)
Platform: Make torch distributed process group extendable (#18763)

Engine features

Add Lora Support to Beam Search (#18346)
Add rerank support to run_batch endpoint (#16278)
CLI: add run batch (#18804)
Server: custom logging (#18403), allowed_token_ids in ChatCompletionRequest (#19143)
LLM API: make use_tqdm accept a callable for custom progress bars (#19357)
perf: [KERNEL] Sampler. CUDA kernel for applying repetition penalty (#18437)

API Deprecations

Disallow pos-args other than model when initializing LLM (#18802)
Remove inputs arg fallback in Engine classes (#18799)
Remove fallbacks for Embeddings API (#18795)
Remove mean pooling default for Qwen2EmbeddingModel (#18913)
Require overriding get_dummy_text and get_dummy_mm_data (#18796)
Remove metrics that were deprecated in 0.8 (#18837)

Documentation

Add CLI doc (#18871)
Update SECURITY.md with link to our security guide (#18961), Add security warning to bug report template (#19365)

What's Changed

[CI/Build] [TPU] Fix TPU CI exit code by @CAROLZXYZXY in #18282
[Neuron] Support quantization on neuron by @aws-satyajith in #18283
Support datasets in vllm bench serve and sync with benchmark_[serving,datasets].py by @mgoin in #18566
[Bugfix] Disable prefix caching by default for benchmark by @cascade812 in #18771
[Build] Fixes for CMake install by @ProExpertProg in #18570
[Core] Improve Tensor serialisation by @lgeiger in #18774
[rocm] Fix wrong attention log by @fxmarty-amd in #18764
[Bugfix] Fix nomic max_model_len by @noooop in #18755
[Bugfix]: correctly propagate errors message caught at the chat_templating step to the client by @gcalmettes in #18769
[V1] fix torch profiling for V1 offline scenarios by @divakar-amd in #18445
[V1] [Bugfix] eagle bugfix and enable correct lm_head for multimodal (2) by @RonaldBXu in #18781
[Bugfix][FailingTest]Fix test_model_load_with_params.py by @rabi in #18758
[Deprecation] Require overriding get_dummy_text and get_dummy_mm_data by @DarkLight1337 in #18796
[Deprecation] Remove unused sync methods in async_timeout by @DarkLight1337 in #18792
[Deprecation] Remove fallbacks for Embeddings API by @DarkLight1337 in #18795
[CI] improve embed testing by @noooop in #18747
Fix PiecewiseCompileInterpreter by @zou3519 in #17338
[BugFix] FA2 MLA Accuracy Issue by @LucasWilkinson in #18807
[Platform][Dist] Make torch distributed process group extendable by @MengqingCao in #18763
Enable Pydantic mypy checks and convert configs to Pydantic dataclasses by @hmellor in #17599
[Frontend] add run batch to CLI by @reidliu41 in #18804
decrement server_load on listen for disconnect by @daniel-salib in #18784
[Core] Add Lora Support to Beam Search by @alex-jw-brooks in #18346
[Chore] update ty configuration by @aarnphm in #18839
[Misc] fix olmoe model layer for TP > 1 by @lengrongfu in #18828
[V1][Metrics] Remove metrics that were deprecated in 0.8 by @markmc in #18837
[Chore][Spec Decode] Update check NoneType instead of assigning variables by @aarnphm in #18836
[Hardware][TPU][V1] Multi-LoRA Optimisations for the V1 TPU backend by @Akshat-Tripathi in #15655
Remove checks for None for fields which should never be None by @hmellor in #17985
[Core] Enable CUDA graphs for DP + All2All kernels by @varun-sundar-rabindranath in #18724
[Bugfix][ROCm] fix the power of 2 exception from triton_unified_attention.py when running llama4 models and unit test fix by @hongxiayang in #18100
Prevent the cross-encoder logic from being applied to classification tasks by @maxdebayser in #18838
Add ability to use CUDAGraphs with use_inductor=False by @zou3519 in #17345
[Bugfix][TPU] fix moe custom kernel import by @yaochengji in #18853
[Doc][Neuron] Update documentation for Neuron by @elaineyz in #18868
Skip device and quant Pydantic validation to make plugin device work by @Yikun in #18843
Fixes a dead link in nightly benchmark readme by @nerdalert in #18856
[Neuron] Add multi-LoRA support for Neuron. by @aws-satyajith in #18284
[LoRA] Add LoRA support for InternVL by @jeejeelee in #18842
[Doc] Remove redundant spaces from compatibility_matrix.md by @windsonsea in #18891
[doc] add CLI doc by @reidliu41 in #18871
[Bugfix] Fix misleading information in the documentation by @jeejeelee in #18845
[Misc] Replace TODO in serving transcription by @NickLucche in #18895
[Bugfix] Ensure tensors are contiguous during serialisation by @lgeiger in #18860
[BugFix] Update pydantic to fix error on python 3.10 by @ProExpertProg in #18852
Fix an error in dummy weight loading for quantization models by @Chenyaaang in #18855
[Misc][Tools][Benchmark] Add benchmark_serving supports for llama.cpp. by @Duyi-Wang in #18692
[Doc] Fix codeblocks formatting in LoRA adapters documentation by @Zerohertz in #18907
[Bugfix] Fix the failing gte embedding test by @Isotr0py in #18720
[Attention][V1] Toggle for v1 attention backend by @gshtras in #18275
[ROCm][V0][Attention] Revert to the previous FA triton kernel by @gshtras in #18226
[Deprecation] Disallow pos-args other than model when initializing LLM by @DarkLight1337 in #18802
[Misc] Remove duplicate init for self.vllm_config by @googs1025 in #18896
[V1] Allocate kv_cache with stride order for V1 by @NickLucche in #18775
[BugFix] Make DP work with connector-delayed new requests by @njhill in #18559
[P/D] NixlConnector DP fixes by @wseaton...

Contributors

markmc, rabi, and 121 other contributors

Assets 6

09 Jun 23:48

github-actions

v0.9.1rc1

3a7cd62

v0.9.1rc1 Pre-release

Pre-release

What's Changed

[CI/Build] [TPU] Fix TPU CI exit code by @CAROLZXYZXY in #18282
[Neuron] Support quantization on neuron by @aws-satyajith in #18283
Support datasets in vllm bench serve and sync with benchmark_[serving,datasets].py by @mgoin in #18566
[Bugfix] Disable prefix caching by default for benchmark by @cascade812 in #18771
[Build] Fixes for CMake install by @ProExpertProg in #18570
[Core] Improve Tensor serialisation by @lgeiger in #18774
[rocm] Fix wrong attention log by @fxmarty-amd in #18764
[Bugfix] Fix nomic max_model_len by @noooop in #18755
[Bugfix]: correctly propagate errors message caught at the chat_templating step to the client by @gcalmettes in #18769
[V1] fix torch profiling for V1 offline scenarios by @divakar-amd in #18445
[V1] [Bugfix] eagle bugfix and enable correct lm_head for multimodal (2) by @RonaldBXu in #18781
[Bugfix][FailingTest]Fix test_model_load_with_params.py by @rabi in #18758
[Deprecation] Require overriding get_dummy_text and get_dummy_mm_data by @DarkLight1337 in #18796
[Deprecation] Remove unused sync methods in async_timeout by @DarkLight1337 in #18792
[Deprecation] Remove fallbacks for Embeddings API by @DarkLight1337 in #18795
[CI] improve embed testing by @noooop in #18747
Fix PiecewiseCompileInterpreter by @zou3519 in #17338
[BugFix] FA2 MLA Accuracy Issue by @LucasWilkinson in #18807
[Platform][Dist] Make torch distributed process group extendable by @MengqingCao in #18763
Enable Pydantic mypy checks and convert configs to Pydantic dataclasses by @hmellor in #17599
[Frontend] add run batch to CLI by @reidliu41 in #18804
decrement server_load on listen for disconnect by @daniel-salib in #18784
[Core] Add Lora Support to Beam Search by @alex-jw-brooks in #18346
[Chore] update ty configuration by @aarnphm in #18839
[Misc] fix olmoe model layer for TP > 1 by @lengrongfu in #18828
[V1][Metrics] Remove metrics that were deprecated in 0.8 by @markmc in #18837
[Chore][Spec Decode] Update check NoneType instead of assigning variables by @aarnphm in #18836
[Hardware][TPU][V1] Multi-LoRA Optimisations for the V1 TPU backend by @Akshat-Tripathi in #15655
Remove checks for None for fields which should never be None by @hmellor in #17985
[Core] Enable CUDA graphs for DP + All2All kernels by @varun-sundar-rabindranath in #18724
[Bugfix][ROCm] fix the power of 2 exception from triton_unified_attention.py when running llama4 models and unit test fix by @hongxiayang in #18100
Prevent the cross-encoder logic from being applied to classification tasks by @maxdebayser in #18838
Add ability to use CUDAGraphs with use_inductor=False by @zou3519 in #17345
[Bugfix][TPU] fix moe custom kernel import by @yaochengji in #18853
[Doc][Neuron] Update documentation for Neuron by @elaineyz in #18868
Skip device and quant Pydantic validation to make plugin device work by @Yikun in #18843
Fixes a dead link in nightly benchmark readme by @nerdalert in #18856
[Neuron] Add multi-LoRA support for Neuron. by @aws-satyajith in #18284
[LoRA] Add LoRA support for InternVL by @jeejeelee in #18842
[Doc] Remove redundant spaces from compatibility_matrix.md by @windsonsea in #18891
[doc] add CLI doc by @reidliu41 in #18871
[Bugfix] Fix misleading information in the documentation by @jeejeelee in #18845
[Misc] Replace TODO in serving transcription by @NickLucche in #18895
[Bugfix] Ensure tensors are contiguous during serialisation by @lgeiger in #18860
[BugFix] Update pydantic to fix error on python 3.10 by @ProExpertProg in #18852
Fix an error in dummy weight loading for quantization models by @Chenyaaang in #18855
[Misc][Tools][Benchmark] Add benchmark_serving supports for llama.cpp. by @Duyi-Wang in #18692
[Doc] Fix codeblocks formatting in LoRA adapters documentation by @Zerohertz in #18907
[Bugfix] Fix the failing gte embedding test by @Isotr0py in #18720
[Attention][V1] Toggle for v1 attention backend by @gshtras in #18275
[ROCm][V0][Attention] Revert to the previous FA triton kernel by @gshtras in #18226
[Deprecation] Disallow pos-args other than model when initializing LLM by @DarkLight1337 in #18802
[Misc] Remove duplicate init for self.vllm_config by @googs1025 in #18896
[V1] Allocate kv_cache with stride order for V1 by @NickLucche in #18775
[BugFix] Make DP work with connector-delayed new requests by @njhill in #18559
[P/D] NixlConnector DP fixes by @wseaton in #18903
Use standalone_compile by default in torch >= 2.8.0 by @zou3519 in #18846
[TPU] remove transpose ops in moe kernel by @yaochengji in #18923
[Bugfix] Fix PP default fallback behavior for V1 by @mgoin in #18915
[Misc] Update type annotation for rotary embedding base by @DarkLight1337 in #18914
[TPU][CI/CD] Clean up docker for TPU tests. by @CAROLZXYZXY in #18926
improve the robustness of parsing vlms config in AutoRound by @wenhuach21 in #18894
[Bugfix] Consistent ascii handling in tool parsers by @chaunceyjiang in #18883
[Model] Use AutoWeightsLoader for mamba2 by @jinyouzhi in #18918
[docs] fix: fix markdown syntax by @eric-haibin-lin in #18927
[ROCm] Remove unnecessary assertion of max_model_len in ROCM_AITER_MLA attention backend. by @vllmellm in #18938
[Bugfix] Remove NVFP4 scales assertions to fix load_format=dummy by @mgoin in #18861
[Deprecation] Remove mean pooling default for Qwen2EmbeddingModel by @DarkLight1337 in #18913
[Misc]Fix benchmarks/README.md for speculative decoding by @rabi in #18897
[doc] add mkdocs doc by @reidliu41 in #18930
[Model] Use in-place adds in SigLIP by @lgeiger in #18922
[Bugfix][Failing Test] Fix test_vllm_port.py by @rabi in #18618
[Misc]Fix typo by @Always-Naive in #18947
[Bugfix][TPU] Fix tpu model runner testcase failure by @CAROLZXYZXY in #18810
[CI/Build] remove regex from build dependencies by @dtrifiro in #18945
[Feature] minicpm eagle support by @huangyuxiang03 in #18943
[doc] show the count for fork and watch by @reidliu41 in #18950
[Docs] Update SECURITY.md with link to our security guide by @russellb in #18961
Improve "failed to get the hash of the compiled graph" error by @zou3519 in #18956
[Perf] API-server scaleout with many-to-many server-engine comms by @njhill in #17546
Benchmark script for fp8 vs bf16 gemm by @mgoin in #17126
[VLM] Add PP support and fix GPTQ inference for Ovis models by @Isotr0py in #18958
[Misc] add group_size is -1 in awq quantization by @lengrongfu in #18910
Tool parser regex timeout handling by @wseaton in https://github.com/vl...

Contributors

markmc, rabi, and 114 other contributors

Assets 6

30 May 16:11

github-actions

v0.9.0.1

5fbbfe9

v0.9.0.1

This patch release contains important bugfix for DeepSeek family of models on NVIDIA Ampere and below (#18807)

Full Changelog: v0.9.0...v0.9.0.1

Assets 6

15 May 03:38

github-actions

v0.9.0

5873877

v0.9.0

Highlights

This release features 649 commits, from 215 contributors (82 new contributors!)

vLLM has upgraded to PyTorch 2.7! (#16859) This is a breaking change for environment dependency.
- The default wheel has been upgraded from CUDA 12.4 to CUDA 12.8. We will distribute CUDA 12.6 wheel on GitHub artifact.
- As a general rule of thumb, our CUDA version policy follow PyTorch's CUDA version policy.
Enhanced NVIDIA Blackwell support. vLLM now ships with initial set of optimized kernels on NVIDIA Blackwell with both attention and mlp.
- You can use our docker image or install FlashInfer nightly wheel pip install https://download.pytorch.org/whl/cu128/flashinfer/flashinfer_python-0.2.5%2Bcu128torch2.7-cp38-abi3-linux_x86_64.whl then set VLLM_ATTENTION_BACKEND=FLASHINFER for better performance.
- Upgraded support for the new FlashInfer main branch. (#15777)
- Please checkout #18153 for the full roadmap
Initial DP, EP, PD support for large scale inference
- EP:
  - Permute and unpermute kernel for moe optimization (#14568)
  - Modularize fused experts and integrate PPLX kernels (#15956)
  - Refactor pplx init logic to make it modular (prepare for deepep) (#18200)
  - Add ep group and all2all interface (#18077)
- DP:
  - Decouple engine process management and comms (#15977)
- PD:
  - NIXL Integration (#17751)
  - Local attention optimization for NIXL (#18170)
  - Support multiple kv connectors (#17564)
Migrate docs from Sphinx to MkDocs (#18145, #18610, #18614, #18616. #18622, #18626, #18627, #18635, #18637, #18657, #18663, #18666, #18713)

Notable Changes

Removal of CUDA 12.4 support due to PyTorch upgrade to 2.7.
Change top_k to be disabled with 0 (still accept -1 for now) (#17773)
The seed is now set to 0 by default for V1 Engine, meaning that different vLLM runs now yield the same outputs even if temperature > 0. This does not modify the random state in user code since workers are run in separate processes unless VLLM_USE_V1_MULTIPROCESSING=0. (#17929, #18741)

Model Enhancements

Support MiMo-7B (#17433), MiniMax-VL-01 (#16328), Ovis 1.6 (#17861), Ovis 2 (#15826), GraniteMoeHybrid 4.0 (#17497), FalconH1* (#18406), LlamaGuard4 (#17315)
- Please install the development version of transformers (from source) to use Falcon-H1.
Embedding models: nomic-embed-text-v2-moe (#17785), new class of gte models (#17986)
Progress in Hybrid Memory Allocator (#17394, #17479, #17474, #17483, #17193, #17946, #17945, #17999, #18001, #18593)
DeepSeek: perf enhancement by moving more calls into cuda-graph region(#17484, #17668), Function Call (#17784), MTP in V1 (#18435)
Qwen2.5-1M: Implements dual-chunk-flash-attn backend for dual chunk attention with sparse attention support (#11844)
Qwen2.5-VL speed enhancement via rotary_emb optimization (#17973)
InternVL models with Qwen2.5 backbone now support video inputs (#18499)

Performance, Production and Scaling

Support full cuda graph in v1 (#16072)
Pipeline Parallelism: MultiprocExecutor support (#14219), torchrun (#17827)
Support sequence parallelism combined with pipeline parallelism (#18243)
Async tensor parallelism using compilation pass (#17882)
Perf: Use small max_num_batched_tokens for A100 (#17885)
Fast Model Loading: Tensorizer support for V1 and LoRA (#17926)
Multi-modality: Automatically cast multi-modal input dtype before transferring device (#18756)

Security

Prevent side-channel attacks via cache salting (#17045)
Fix image hash collision in certain edge cases (#17378)
Add VLLM_ALLOW_INSECURE_SERIALIZATION env var (#17490)
Migrate to REGEX Library to prevent catastrophic backtracking (#18454, #18750)

Features

CLI: deprecated=True (#17426)
Frontend: progress bar for adding requests (#17525), chat_template_kwargs in LLM.chat (#17356), /classify endpoint (#17032), truncation control for embedding models (#14776), cached_tokens in response usage (#18149)
LoRA: default local directory LoRA resolver plugin. (#16855)
Metrics: kv event publishing (#16750), API for accessing in-memory Prometheus metrics (#17010)
Quantization: nvidia/DeepSeek-R1-FP4 (#16362), Quark MXFP4 format (#16943), AutoRound (#17850), torchao models with AOPerModuleConfig (#17826), CUDA Graph support for V1 GGUF support (#18646)
Reasoning: deprecate --enable-reasoning (#17452)
Spec Decode: EAGLE share input embedding (#17326), torch.compile & cudagraph to EAGLE (#17211), EAGLE3 (#17504), log accumulated metrics(#17913), Medusa (#17956)
Structured Outputs: Thinking compatibility (#16577), Spec Decoding (#14702), Qwen3 reasoning parser (#17466), tool_choice: required for Xgrammar (#17845), Structural Tag with Guidance backend (#17333)
Transformers backend: named parameters (#16868), interleaved sliding window attention (#18494)

Hardwares

NVIDIA: cutlass support for blackwell fp8 blockwise gemm (#14383)
TPU: Multi-LoRA implementation(#14238), default max-num-batched-tokens (#17508), V1 backend by default (#17673), top-logprobs (#17072)
Neuron: NeuronxDistributedInference support (#15970), Speculative Decoding, Dynamic on-device sampling (#16357), Mistral Model (#18222), Multi-LoRA (#18284)
AMD: Enable FP8 KV cache on V1 (#17870), Tuned fused moe config for Qwen3 MoE on MI300X (#17535, #17530), AITER biased group topk (#17955), Block-Scaled GEMM (#14968), MLA (#17523), Radeon GPU use Custom Paged Attention (#17004), reduce the number of environment variables in command line (#17229)
Extensibility: Make PiecewiseBackend pluggable and extendable (#18076)

Documentation

Update quickstart and install for cu128 using --torch-backend=auto (#18505)
NVIDIA TensorRT Model Optimizer (#17561)
Usage of Qwen3 thinking (#18291)

Developer Facing

Benchmark: Add single turn MTBench to Serving Bench (#17202)
Usability: Decrease import time of vllm.multimodal (#18031)
Code Format: Code formatting using ruff format (#17656, #18068, #18400)
Readability:
- Configuration and arguments unification is now complete! (#17130, #17453, #17562)
- Update deprecated type hinting from Python 3.7 (#18056, #18130, #18132, #18129, #18073, #18072, #18126, #18128, #18057, #18058)
Process:
- Propose a deprecation policy for the project (#17063)
Testing: expanding torch nightly tests (#18004)

What's Changed

Support loading transformers models with named parameters by @wuisawesome in #16868
Add tuned triton fused_moe configs for Qwen3Moe by @mgoin in #17328
[Benchmark] Add single turn MTBench to Serving Bench by @ekagra-ranjan in #17202
[Optim] Compute multimodal hash only once per item by @DarkLight1337 in #17314
implement Structural Tag with Guidance backend by @mmoskal in #17333
[V1][Spec Decode] Make Eagle model arch config driven by @ekagra-ranjan in #17323
[model] make llama4 compatible with pure dense layers by @luccafong in #17315
[Bugfix] Fix numel() downcast in fused_layernorm_dynamic_per_token_quant.cu by @r-barnes in #17316
Ignore '<string>' filepath by @zou3519 in #17330
[Bugfix] Add contiguous call inside rope kernel wrapper by @timzsu in #17091
[Misc] Add a Jinja template to support Mistral3 function calling by @chaunceyjiang in #17195
[Model] support MiniMax-VL-01 model by @qscqesze in #16328
[Misc] Move config fields to MultiModalConfig by @DarkLight1337 in #17343
[Misc]Use a platform independent interface to obtain the device attributes by @ponix-j in #17100
[Fix] Documentation spacing in compilation config help text by @Zerohertz in #17342
[Build][Bugfix] Restrict setuptools version to <80 by @gshtras in #17320
[Model] Ignore rotary embed load for Cohere model by @ekagra-ranjan in #17319
Update docs requirements by @hmellor in #17379
[Doc] Fix QWen3MOE info by @jeejeelee in #17381
[Bugfix] Clean up MiniMax-VL and fix processing by @DarkLight1337 in #17354
pre-commit autoupdate by @hmellor in #17380
[Frontend] Support chat_template_kwargs in LLM.chat by @DarkLight1337 in #17356
Transformers backend tweaks by @hmellor in #17365
Fix: Spelling of inference by @a2q1p in #17387
Improve literal dataclass field conversion to argparse argument by @hmellor in #17391
[V1] Remove num_input_tokens from attn_metadata by @heheda12345 in #17193
[Bugfix] add qwen3 reasoning-parser fix content is None when disable … by @mofanke in #17369
fix gemma3 results all zero by @mayuyuace in #17364
[Misc][ROCm] Exclude cutlass_mla_decode for ROCm build by @tywuAMD in #17289
Enabling multi-group kernel tests. by @Alexei-V-Ivanov-AMD in https://github.com/vllm-p...