Releases: vllm-project/vllm
v0.10.0
Highlights
v0.10.0 release includes 308 commits, 168 contributors (62 new!).
NOTE: This release begins the cleanup of V0 engine codebase. We have removed V0 CPU/XPU/TPU/HPU backends (#20412), long context LoRA (#21169), Prompt Adapters (#20588), Phi3-Small & BlockSparse Attention (#21217), and Spec Decode workers (#21152) so far and plan to continued to delete code that is no longer used.
Model Support
- New families: Llama 4 with EAGLE support (#20591), EXAONE 4.0 (#21060), Microsoft Phi-4-mini-flash-reasoning (#20702), Hunyuan V1 Dense + A13B with reasoning/tool parsing (#21368, #20625, #20820), Ling MoE models (#20680), JinaVL Reranker (#20260), Nemotron-Nano-VL-8B-V1 (#20349), Arcee (#21296), Voxtral (#20970).
- Enhanced compatibility: BERT/RoBERTa with AutoWeightsLoader (#20534), HF format support for MiniMax (#20211), Gemini configuration (#20971), GLM-4 updates (#20736).
- Architecture expansions: Attention-free model support (#20811), Hybrid SSM/Attention models on V1 (#20016), LlamaForSequenceClassification (#20807), expanded Mamba2 layer support (#20660).
- VLM improvements: VLM support with transformers backend (#20543), PrithviMAE on V1 engine (#20577).
Engine Core
- Experimental async scheduling
--async-scheduling
flag to overlap engine core scheduling with GPU runner (#19970). - V1 engine improvements: backend-agnostic local attention (#21093), MLA FlashInfer ragged prefill (#20034), hybrid KV cache with local chunked attention (#19351).
- Multi-task support: models can now support multiple tasks (#20771), multiple poolers (#21227), and dynamic pooling parameter configuration (#21128).
- RLHF Support: new RPC methods for runtime weight reloading (#20096) and config updates (#20095), logprobs mode for selecting which stage of logprobs to return (#21398).
- Enhanced caching: multi-modal caching for transformers backend (#21358), reproducible prefix cache hashing using SHA-256 + CBOR (#20511).
- Startup time reduction via CUDA graph capture speedup via frozen GC (#21146).
- Elastic expert parallel for dynamic GPU scaling while preserving state (#20775).
Hardwares & Performance
- NVIDIA Blackwell/SM100 optimizations: CUTLASS block scaled group GEMM for smaller batches (#20640), FP8 groupGEMM support (#20447), DeepGEMM integration (#20087), FlashInfer MoE blockscale FP8 backend (#20645), CUDNN prefill API for MLA (#20411), Triton Fused MoE kernel config for FP8 E=16 on B200 (#20516).
- Performance improvements: 48% request duration reduction via microbatch tokenization for concurrent requests (#19334), fused MLA QKV + strided layernorm (#21116), Triton causal-conv1d for Mamba models (#18218).
- Hardware expansion: ARM CPU int8 quantization (#14129), PPC64LE/ARM V1 support (#20554), Intel XPU ray distributed execution (#20659), shared-memory pipeline parallel for CPU (#21289), FlashInfer ARM CUDA support (#21013).
Quantization
- New quantization support: MXFP4 for MoE models (#17888), BNB support for Mixtral and additional MoE models (#20893, #21100), in-flight quantization for MoE (#20061).
- Hardware-specific: FP8 KV cache quantization on TPU (#19292), FP8 support for BatchedTritonExperts (#18864), optimized INT8 vectorization kernels (#20331).
- Performance optimizations: Triton backend for DeepGEMM per-token group quantization (#20841), CUDA kernel for per-token group quantization (#21083), CustomOp abstraction for FP8 (#19830).
API & Frontend
- OpenAI compatibility: Responses API implementation (#20504, #20975), image object support in llm.chat (#19635), tool calling with required choice and $defs (#20629).
- New endpoints:
get_tokenizer_info
for tokenizer/chat-template information (#20575), cache_salt support for completions/responses (#20981). - Model loading: Tensorizer S3 integration with arbitrary arguments (#19619), HF repo paths & URLs for GGUF models (#20793), tokenization_kwargs for embedding truncation (#21033).
- CLI improvements:
--help=page
option for enhanced help documentation (#20961), default model changed to Qwen3-0.6B (#20335).
Dependencies
What's Changed
- [Docs] Note that alternative structured output backends are supported by @russellb in #19426
- [ROCm][V1] Adding ROCm to the list of plaforms using V1 by default by @gshtras in #19440
- [Model] use AutoWeightsLoader for commandr by @py-andy-c in #19399
- Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 by @Xu-Wenqing in #19401
- [BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 by @zou3519 in #19390
- [New Model]: Support Qwen3 Embedding & Reranker by @noooop in #19260
- [BugFix] Fix docker build cpu-dev image error by @2niuhe in #19394
- Fix test_max_model_len in tests/entrypoints/llm/test_generate.py by @houseroad in #19451
- [CI] Disable failing GGUF model test by @mgoin in #19454
- [Misc] Remove unused
MultiModalHasher.hash_prompt_mm_data
by @lgeiger in #19422 - Add fused MOE config for Qwen3 30B A3B on B200 by @0xjunhao in #19455
- Fix Typo in Documentation and Function Name by @leopardracer in #19442
- [ROCm] Add rules to automatically label ROCm related PRs by @houseroad in #19405
- [Kernel] Support deep_gemm for linear methods by @artetaout in #19085
- [Doc] Update V1 User Guide for Hardware and Models by @DarkLight1337 in #19474
- [Doc] Fix quantization link titles by @DarkLight1337 in #19478
- [Doc] Support "important" and "announcement" admonitions by @DarkLight1337 in #19479
- [Misc] Reduce warning message introduced in env_override by @houseroad in #19476
- Support non-string values in JSON keys from CLI by @DarkLight1337 in #19471
- Add cache to cuda get_device_capability by @mgoin in #19436
- Fix some typo by @Ximingwang-09 in #19475
- Support no privileged mode on CPU for docker and kubernetes deployments by @louie-tsai in #19241
- [Bugfix] Update the example code, make it work with the latest lmcache by @runzhen in #19453
- [CI] Update FlashInfer to 0.2.6.post1 by @mgoin in #19297
- [doc] fix "Other AI accelerators" getting started page by @davidxia in #19457
- [Misc] Fix misleading ROCm warning by @jeejeelee in #19486
- [Docs] Remove WIP features in V1 guide by @WoosukKwon in #19498
- [Kernels] Add activation chunking logic to FusedMoEModularKernel by @bnellnm in #19168
- [AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger by @rasmith in #17331
- [UX] Add Feedback During CUDAGraph Capture by @robertgshaw2-redhat in #19501
- [CI/Build] Fix torch nightly CI dependencies by @zou3519 in #19505
- [CI] change spell checker from codespell to typos by @andyxning in #18711
- [BugFix] Force registration of w8a8_block_fp8_matmul_deepgemm via lazy import by @varun-sundar-rabindranath in #19514
- Add Triton Fused MoE kernel config for E=16 on B200 by @b8zhong in #19518
- [Frontend] Improve error message in tool_choice validation by @22quinn in #19239
- [BugFix] Work-around incremental detokenization edge case error by @njhill in #19449
- [BugFix] Handle missing sep_token for Qwen3-Reranker in Score API by @strutive07 in #19522
- [AMD][Kernel][BugFix] fix test_rocm_compressed_tensors_w8a8 for rocm by @rasmith in #19509
- Fix typo by @2niuhe in #19525
- [Security] Prevent new imports of (cloud)pickle by @russellb in #18018
- [Bugfix][V1] Allow manual FlashAttention for Blackwell by @mgoin in #19492
- [Bugfix] Respect num-gpu-blocks-override in v1 by @jmswen in #19503
- [Quantization] Improve AWQ logic by @jeejeelee in #19431
- [Doc] Add V1 column to supported models list by @DarkLight1337 in #19523
- [NixlConnector] Drop
num_blocks
check by @NickLucche in #19532 - [Perf] Vectorize static / dynamic INT8 quant kernels by @yewentao256 in #19233
- Fix TorchAOConfig skip layers by @mobicham in #19265
- [torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass by @ProExpertProg in https://github.com/vllm-proj...
v0.10.0rc2
What's Changed
- [Model] use AutoWeightsLoader for bart by @calvin0327 in #18299
- [Model] Support VLMs with transformers backend by @zucchini-nlp in #20543
- [bugfix] fix syntax warning caused by backslash by @1195343015 in #21251
- [CI] Cleanup modelscope version constraint in Dockerfile by @yankay in #21243
- [Docs] Add RFC Meeting to Issue Template by @simon-mo in #21279
- Add the instruction to run e2e validation manually before release by @huydhn in #21023
- [Bugfix] Fix missing placeholder in logger debug by @DarkLight1337 in #21280
- [Model][1/N] Support multiple poolers at model level by @DarkLight1337 in #21227
- [Docs] Fix hardcoded links in docs by @hmellor in #21287
- [Docs] Make tables more space efficient in
supported_models.md
by @hmellor in #21291 - [Misc] unify variable for LLM instance by @andyxning in #20996
- Add Nvidia ModelOpt config adaptation by @Edwardf0t1 in #19815
- [Misc] Add sliding window to flashinfer test by @WoosukKwon in #21282
- [CPU] Enable shared-memory based pipeline parallel for CPU backend by @bigPYJ1151 in #21289
- [BugFix] make utils.current_stream thread-safety (#21252) by @simpx in #21253
- [Misc] Add dummy maverick test by @minosfuture in #21199
- [Attention] Clean up iRoPE in V1 by @LucasWilkinson in #21188
- [DP] Fix Prometheus Logging by @robertgshaw2-redhat in #21257
- Fix bad lm-eval fork by @mgoin in #21318
- [perf] Speed up align sum kernels by @hj-mistral in #21079
- [v1][sampler] Inplace logprobs comparison to get the token rank by @houseroad in #21283
- [XPU] Enable external_launcher to serve as an executor via torchrun by @chaojun-zhang in #21021
- [Doc] Fix CPU doc format by @bigPYJ1151 in #21316
- [Intel GPU] Ray Compiled Graph avoid NCCL for Intel GPU by @ratnampa in #21338
- Revert "[Performance] Performance improvements in non-blockwise fp8 CUTLASS MoE (#20762) by @minosfuture in #21334
- [Core] Minimize number of dict lookup in _maybe_evict_cached_block by @Jialin in #21281
- [V1] [Hybrid] Add new test to verify that hybrid views into KVCacheTensor are compatible by @tdoublep in #21300
- [Refactor] Fix Compile Warning #1444-D by @yewentao256 in #21208
- Fix kv_cache_dtype handling for out-of-tree HPU plugin by @kzawora-intel in #21302
- [Misc] DeepEPHighThroughtput - Enable Inductor pass by @varun-sundar-rabindranath in #21311
- [Bug] DeepGemm: Fix Cuda Init Error by @yewentao256 in #21312
- Update fp4 quantize API by @wenscarl in #21327
- [Feature][eplb] add verify ep or tp or dp by @lengrongfu in #21102
- Add arcee model by @alyosha-swamy in #21296
- [Bugfix] Fix eviction cached blocked logic by @simon-mo in #21357
- [Misc] Remove deprecated args in v0.10 by @kebe7jun in #21349
- [Core] Optimize update checks in LogitsProcessor by @Jialin in #21245
- [benchmark] Port benchmark request sent optimization to benchmark_serving by @Jialin in #21209
- [Core] Introduce popleft_n and append_n in FreeKVCacheBlockQueue to further optimize block_pool by @Jialin in #21222
- [Misc] unify variable for LLM instance v2 by @andyxning in #21356
- [perf] Add fused MLA QKV + strided layernorm by @mickaelseznec in #21116
- [feat]: add SM100 support for cutlass FP8 groupGEMM by @djmmoss in #20447
- [Perf] Cuda Kernel for Per Token Group Quant by @yewentao256 in #21083
- Adds parallel model weight loading for runai_streamer by @bbartels in #21330
- [feat] Enable mm caching for transformers backend by @zucchini-nlp in #21358
- Revert "[Refactor] Fix Compile Warning #1444-D (#21208)" by @yewentao256 in #21384
- Add tokenization_kwargs to encode for embedding model truncation by @Receiling in #21033
- [Bugfix] Decode Tokenized IDs to Strings for
hf_processor
inllm.chat()
withmodel_impl=transformers
by @ariG23498 in #21353 - [CI/Build] Fix test failure due to updated model repo by @DarkLight1337 in #21375
- Fix Flashinfer Allreduce+Norm enable disable calculation based on
fi_allreduce_fusion_max_token_num
by @xinli-git in #21325 - [Model] Add Qwen3CoderToolParser by @ranpox in #21396
- [Misc] Copy HF_TOKEN env var to Ray workers by @ruisearch42 in #21406
- [BugFix] Fix ray import error mem cleanup bug by @joerunde in #21381
- [CI/Build] Fix model executor tests by @DarkLight1337 in #21387
- [Bugfix][ROCm][Build] Fix build regression on ROCm by @gshtras in #21393
- Simplify weight loading in Transformers backend by @hmellor in #21382
- [BugFix] Update python to python3 calls for image; fix prefix & input calculations. by @ericehanley in #21391
- [BUGFIX] deepseek-v2-lite failed due to fused_qkv_a_proj name update by @xuechendi in #21414
- [Bugfix][CUDA] fixes CUDA FP8 kv cache dtype supported by @elvischenv in #21420
- Changing "amdproduction" allocation. by @Alexei-V-Ivanov-AMD in #21409
- [Bugfix] Fix nightly transformers CI failure by @Isotr0py in #21427
- [Core] Add basic unit test for maybe_evict_cached_block by @Jialin in #21400
- [Cleanup] Only log MoE DP setup warning if DP is enabled by @mgoin in #21315
- add clear messages for deprecated models by @youkaichao in #21424
- [Bugfix] ensure tool_choice is popped when
tool_choice:null
is passed in json payload by @gcalmettes in #19679 - Fixed typo in profiling logs by @sergiopaniego in #21441
- [Docs] Fix bullets and grammars in tool_calling.md by @windsonsea in #21440
- [Sampler] Introduce logprobs mode for logging by @houseroad in #21398
- Mamba V2 Test not Asserting Failures. by @fabianlim in #21379
- [Misc] fixed nvfp4_moe test failures due to invalid kwargs by @chenyang78 in #21246
- [Docs] Clean up v1/metrics.md by @windsonsea in #21449
- [Model] add Hunyuan V1 Dense Model support. by @kzjeef in #21368
- [V1] Check all pooling tasks during profiling by @DarkLight1337 in #21299
- [Bugfix][Qwen][DCA] fixes bug in dual-chunk-flash-attn backend for qwen 1m models. by @sighingnow in #21364
- [Tests] Add tests for headless internal DP LB by @njhill in #21450
- [Core][Model] PrithviMAE Enablement on vLLM v1 engine by @christian-pinto in #20577
- Add test case for compiling multiple graphs by @sarckk in #21044
- [TPU][TEST] Fix the downloading issue in TPU v1 test 11. by @QiliangCui in #21418
- [Core] Add
reload_weights
RPC method by @22quinn in #20096 - [V1] Fix local chunked attention always disabled by @sarckk in #21419
- [V0 Deprecation] Remove Prompt Adapters by @mgoin in #20588
- [Core] Freeze gc during cuda graph capture to speed up init by @mgoin in #21146
- feat(gguf_loader): accept HF repo paths & URLs for GGUF by @hardikkgupta in #20793
- [Frontend] Set MAX_AUDIO_CLI...
v0.10.0rc1
What's Changed
- [Kernel] Enable fp8 support for pplx and BatchedTritonExperts. by @bnellnm in #18864
- [Misc] Fix
Unable to detect current VLLM config. Defaulting to NHD kv cache layout
warning by @NickLucche in #20400 - [Bugfix] Register reducer even if transformers_modules not available by @eicherseiji in #19510
- Change warn_for_unimplemented_methods to debug by @mgoin in #20455
- [Platform] Add custom default max tokens by @gmarinho2 in #18557
- Add ignore consolidated file in mistral example code by @princepride in #20420
- [Misc] small update by @reidliu41 in #20462
- [Structured Outputs][V1] Skipping with models doesn't contain tokenizers by @aarnphm in #20365
- [Perf] Optimize Vectorization Utils for Int 8 Quantization Kernels by @yewentao256 in #20331
- [Misc] Add SPDX-FileCopyrightText by @jeejeelee in #20428
- Support Llama 4 for fused_marlin_moe by @mgoin in #20457
- [Bug][Frontend] Fix structure of transcription's decoder_prompt by @sangbumlikeagod in #18809
- [Model][3/N] Automatic conversion of CrossEncoding model by @noooop in #20168
- [Doc] Fix classification table in list of supported models by @DarkLight1337 in #20489
- [CI] add kvcache-connector dependency definition and add into CI build by @panpan0000 in #18193
- [Misc] Small: Remove global media connector. Each test should have its own test connector object. by @huachenheli in #20395
- Enable V1 for Hybrid SSM/Attention Models by @tdoublep in #20016
- [feat]: CUTLASS block scaled group gemm for SM100 by @djmmoss in #19757
- [CI Bugfix] Fix pre-commit failures on main by @mgoin in #20502
- [Doc] fix mutltimodal_inputs.md gh examples link by @GuyStone in #20497
- [Misc] Add security warning for development mode endpoints by @reidliu41 in #20508
- [doc] small fix by @reidliu41 in #20506
- [Misc] Remove the unused LoRA test code by @jeejeelee in #20494
- Fix unknown attribute of topk_indices_dtype in CompressedTensorsW8A8Fp8MoECutlassMethod by @luccafong in #20507
- [v1] Re-add fp32 support to v1 engine through FlexAttention by @Isotr0py in #19754
- [Misc] Add logger.exception for TPU information collection failures by @reidliu41 in #20510
- [Misc] remove unused import by @reidliu41 in #20517
- test_attention compat with coming xformers change by @bottler in #20487
- [BUG] Fix #20484. Support empty sequence in cuda penalty kernel by @vadiklyutiy in #20491
- [Bugfix] Fix missing per_act_token parameter in compressed_tensors_moe by @luccafong in #20509
- [BugFix] Fix: ImportError when building on hopper systems by @LucasWilkinson in #20513
- [TPU][Bugfix] fix the MoE OOM issue by @yaochengji in #20339
- [Frontend] Support image object in llm.chat by @sfeng33 in #19635
- [Benchmark] Add support for multiple batch size benchmark through CLI in
benchmark_moe.py
+ Add Triton Fused MoE kernel config for FP8 E=16 on B200 by @b8zhong in #20516 - [Misc] call the pre-defined func by @reidliu41 in #20518
- [V0 deprecation] Remove V0 CPU/XPU/TPU backends by @WoosukKwon in #20412
- [V1] Support any head size for FlexAttention backend by @DarkLight1337 in #20467
- [BugFix][Spec Decode] Fix spec token ids in model runner by @WoosukKwon in #20530
- [Bugfix] Add
use_cross_encoder
flag to use correct activation inClassifierPooler
by @DarkLight1337 in #20527 - Implement OpenAI Responses API [1/N] by @WoosukKwon in #20504
- [Misc] add a tip for pre-commit by @reidliu41 in #20536
- [Refactor]Abstract Platform Interface for Distributed Backend and Add xccl Support for Intel XPU by @dbyoung18 in #19410
- [CI/Build] Enable phi2 lora test by @jeejeelee in #20540
- [XPU][CI] add v1/core test in xpu hardware ci by @Liangliang-Ma in #20537
- Add docstrings to url_schemes.py to improve readability by @windsonsea in #20545
- [XPU] log clean up for XPU platform by @yma11 in #20553
- [Docs] Clean up tables in supported_models.md by @windsonsea in #20552
- [Misc] remove unused jinaai_serving_reranking by @Abirdcfly in #18878
- [Misc] Set the minimum openai version by @jeejeelee in #20539
- [Doc] Remove extra whitespace from CI failures doc by @hmellor in #20565
- [Doc] Use
gh-pr
andgh-issue
everywhere we can in the docs by @hmellor in #20564 - [Doc] Fix internal links so they don't always point to latest by @hmellor in #20563
- [Doc] Add outline for content tabs by @hmellor in #20571
- [Doc] Fix some MkDocs snippets used in the installation docs by @hmellor in #20572
- [Model][Last/4] Automatic conversion of CrossEncoding model by @noooop in #19675
- [Bugfix] Prevent IndexError for cached requests when pipeline parallelism is disabled by @panpan0000 in #20486
- [Feature] microbatch tokenization by @ztang2370 in #19334
- [DP] Copy environment variables to Ray DPEngineCoreActors by @ruisearch42 in #20344
- [Kernel] Optimize Prefill Attention in Unified Triton Attention Kernel by @jvlunteren in #20308
- [Misc] Add fully interleaved support for multimodal 'string' content format by @Dekakhrone in #14047
- [Misc] feat output content in stream response by @lengrongfu in #19608
- Fix links in multi-modal model contributing page by @hmellor in #18615
- [Config] Refactor mistral configs by @patrickvonplaten in #20570
- [Misc] Improve logging for dynamic shape cache compilation by @kyolebu in #20573
- [Bugfix] Fix Maverick correctness by filling zero to cache space in cutlass_moe by @minosfuture in #20167
- [Optimize] Don't send token ids when kv connector is not used by @WoosukKwon in #20586
- Make distinct
code
andconsole
admonitions so readers are less likely to miss them by @hmellor in #20585 - [Bugfix]: Fix messy code when using logprobs by @chaunceyjiang in #19209
- [Doc] Syntax highlight request responses as JSON instead of bash by @hmellor in #20582
- [Docs] Rewrite offline inference guide by @crypdick in #20594
- [Docs] Improve docstring for ray data llm example by @crypdick in #20597
- [Docs] Add Ray Serve LLM section to openai compatible server guide by @crypdick in #20595
- [Docs] Add Anyscale to frameworks by @crypdick in #20590
- [Misc] improve error msg by @reidliu41 in #20604
- [CI/Build][CPU] Fix CPU CI and remove all CPU V0 files by @bigPYJ1151 in #20560
- [TPU] Temporary fix vmem oom for long model len by reducing page size by @Chenyaaang in #20278
- [Frontend] [Core] Integrate Tensorizer in to S3 loading machinery, allow passing arbitrary arguments during save/load by @sangstar in #19619
- [PD][Nixl] Remote consumer READ timeout for clearing request blocks by @NickLucche in #20139
- [Docs] Improve documentation for Deepseek R1 on Ray Serve LLM by @crypdick in #20601
- Remove unnecessary explicit title anchors and use relative links instead by @hmellor in #20620
- Stop using title frontmatter and fix doc that can only be ...
v0.9.2
Highlights
This release contains 452 commits from 167 contributors (31 new!)
NOTE: This is the last version where V0 engine code and features stay intact. We highly recommend migrating to V1 engine.
Engine Core
- Priority Scheduling is now implemented in V1 engine (#19057), embedding models in V1 (#16188), Mamba2 in V1 (#19327).
- Full CUDA‑Graph execution is now available for all FlashAttention v3 (FA3) and FlashMLA paths, including prefix‑caching. CUDA graph now has a live capture progress bar makes debugging easier (#20301, #18581, #19617, #19501).
- FlexAttention update – any head size, FP32 fallback (#20467, #19754).
- Shared
CachedRequestData
objects and cached sampler‑ID stores deliver perf enhancements (#20232, #20291).
Model Support
- New families: Ernie 4.5 (+MoE) (#20220), MiniMax‑M1 (#19677, #20297), Slim‑MoE “Phi‑tiny‑MoE‑instruct” (#20286), Tencent HunYuan‑MoE‑V1 (#20114), Keye‑VL‑8B‑Preview (#20126), GLM‑4.1 V (#19331), Gemma‑3 (text‑only, #20134), Tarsier 2 (#19887), Qwen 3 Embedding & Reranker (#19260), dots1 (#18254), GPT‑2 for Sequence Classification (#19663).
- Granite hybrid MoE configurations with shared experts are fully supported (#19652).
Large‑Scale Serving & Engine Improvements
- Expert‑Parallel Load Balancer (EPLB) has been added! (#18343, #19790, #19885).
- Disaggregated serving enhancements: Avoid stranding blocks in P when aborted in D's waiting queue (#19223), let toy proxy handle /chat/completions (#19730)
- Native xPyD P2P NCCL transport as a base case for native PD without external dependency (#18242, #20246).
Hardware & Performance
- NVIDIA Blackwell
- Intel GPU (V1) backend with Flash‑Attention support (#19560).
- AMD ROCm: full‑graph capture for TritonAttention, quick All‑Reduce, and chunked pre‑fill (#19158, #19744, #18596).
- TPU: dynamic‑grid KV‑cache updates, head‑dim less than 128, tuned paged‑attention kernels, and KV‑padding fixes (#19928, #20235, #19620, #19813, #20048, #20339).
- Add models and features supporting matrix. (#20230)
Quantization
- Calibration‑free RTN INT4/INT8 pipeline for effortless, accurate compression (#18768).
- Compressed‑Tensor NVFP4 (including MoE) + emulation; FP4 emulation removed on < SM100 devices (#19879, #19990, #19563).
- Dynamic MoE‑layer quant (Marlin/GPTQ) and INT8 vectorization primitives (#19395, #20331, #19233).
- Bits‑and‑Bytes 0.45 + with improved double‑quant logic and AWQ quality (#20424, #20033, #19431, #20076).
API · CLI · Frontend
- API Server: Eliminate api_key and x_request_id headers middleware overhead (#19946)
- New OpenAI‑compatible endpoints:
/v1/audio/translations
& revamped/v1/audio/transcriptions
(#19615, #20179, #19597). - Token‑level progress bar for
LLM.beam_search
and cached template‑resolution speed‑ups (#19301, #20065). - Image‑object support in
llm.chat
, tool‑choice expansion, and custom‑arg passthroughs enrich multi‑modal agents (#19635, #17177, #16862). - CLI QoL: better parsing for
-O/--compilation-config
, batch‑size‑sweep benchmarking, richer--help
, faster startup (#20156, #20516, #20430, #19941). - Metrics: Deprecate metrics with gpu_ prefix for non GPU specific metrics (#18354), Export NaNs in logits to scheduler_stats if output is corrupted (#18777)
Platform & Deployment
- No‑privileged CPU / Docker / K8s mode (#19241) and custom default max‑tokens for hosted platforms (#18557).
- Security hardening – runtime (cloud)pickle imports forbidden (#18018).
- Hermetic builds and wheel slimming (FA2 8.0 + PTX only) shrink supply‑chain surface (#18064, #19336).
What's Changed
- [Docs] Note that alternative structured output backends are supported by @russellb in #19426
- [ROCm][V1] Adding ROCm to the list of plaforms using V1 by default by @gshtras in #19440
- [Model] use AutoWeightsLoader for commandr by @py-andy-c in #19399
- Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 by @Xu-Wenqing in #19401
- [BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 by @zou3519 in #19390
- [New Model]: Support Qwen3 Embedding & Reranker by @noooop in #19260
- [BugFix] Fix docker build cpu-dev image error by @2niuhe in #19394
- Fix test_max_model_len in tests/entrypoints/llm/test_generate.py by @houseroad in #19451
- [CI] Disable failing GGUF model test by @mgoin in #19454
- [Misc] Remove unused
MultiModalHasher.hash_prompt_mm_data
by @lgeiger in #19422 - Add fused MOE config for Qwen3 30B A3B on B200 by @0xjunhao in #19455
- Fix Typo in Documentation and Function Name by @leopardracer in #19442
- [ROCm] Add rules to automatically label ROCm related PRs by @houseroad in #19405
- [Kernel] Support deep_gemm for linear methods by @artetaout in #19085
- [Doc] Update V1 User Guide for Hardware and Models by @DarkLight1337 in #19474
- [Doc] Fix quantization link titles by @DarkLight1337 in #19478
- [Doc] Support "important" and "announcement" admonitions by @DarkLight1337 in #19479
- [Misc] Reduce warning message introduced in env_override by @houseroad in #19476
- Support non-string values in JSON keys from CLI by @DarkLight1337 in #19471
- Add cache to cuda get_device_capability by @mgoin in #19436
- Fix some typo by @Ximingwang-09 in #19475
- Support no privileged mode on CPU for docker and kubernetes deployments by @louie-tsai in #19241
- [Bugfix] Update the example code, make it work with the latest lmcache by @runzhen in #19453
- [CI] Update FlashInfer to 0.2.6.post1 by @mgoin in #19297
- [doc] fix "Other AI accelerators" getting started page by @davidxia in #19457
- [Misc] Fix misleading ROCm warning by @jeejeelee in #19486
- [Docs] Remove WIP features in V1 guide by @WoosukKwon in #19498
- [Kernels] Add activation chunking logic to FusedMoEModularKernel by @bnellnm in #19168
- [AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger by @rasmith in #17331
- [UX] Add Feedback During CUDAGraph Capture by @robertgshaw2-redhat in #19501
- [CI/Build] Fix torch nightly CI dependencies by @zou3519 in #19505
- [CI] change spell checker from codespell to typos by @andyxning in #18711
- [BugFix] Force registration of w8a8_block_fp8_matmul_deepgemm via lazy import by @varun-sundar-rabindranath in #19514
- Add Triton Fused MoE kernel config for E=16 on B200 by @b8zhong in #19518
- [Frontend] Improve error message in tool_choice validation by @22quinn in #19239
- [BugFix] Work-around incremental detokenization edge case error by @njhill in #19449
- [BugFix] Handle missing sep_token for Qwen3-Reranker in Score API by @strutive07 in #19522
- [AMD][Kernel][BugFix] fix test_rocm_compressed_tensors_w8a8 for rocm by @rasmith in #19509
- Fix typo by @2niuhe in #19525
- [Security] Prevent new imports of (cloud)pickle by @russellb in #18018
- [Bugfix][V1] Allow manual FlashAttention for Blackwell by @mgoin in #19492
- [Bugfix] Respect num-gpu-blocks-override in v1 by @jmswen in #19503
- [Quantization] Improve AWQ logic by @jeejeelee in #19431
- [Doc] Add V1 column to supported models list by @DarkLight1337 in #19523
- [NixlConnector] Drop
num_blocks
check by @NickLucche in #19532 - [Perf] Vectorize static / dynamic INT8 quant kernels by @yewentao256 in #19233
- Fix TorchAOConfig skip layers by @mobicham in #19265
- [torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass by @ProExpertProg in #16756
- [doc] Make top navigatio...
v0.9.2rc2
What's Changed
- [Kernel] Enable fp8 support for pplx and BatchedTritonExperts. by @bnellnm in #18864
- [Misc] Fix
Unable to detect current VLLM config. Defaulting to NHD kv cache layout
warning by @NickLucche in #20400 - [Bugfix] Register reducer even if transformers_modules not available by @eicherseiji in #19510
- Change warn_for_unimplemented_methods to debug by @mgoin in #20455
- [Platform] Add custom default max tokens by @gmarinho2 in #18557
- Add ignore consolidated file in mistral example code by @princepride in #20420
- [Misc] small update by @reidliu41 in #20462
- [Structured Outputs][V1] Skipping with models doesn't contain tokenizers by @aarnphm in #20365
- [Perf] Optimize Vectorization Utils for Int 8 Quantization Kernels by @yewentao256 in #20331
- [Misc] Add SPDX-FileCopyrightText by @jeejeelee in #20428
- Support Llama 4 for fused_marlin_moe by @mgoin in #20457
- [Bug][Frontend] Fix structure of transcription's decoder_prompt by @sangbumlikeagod in #18809
- [Model][3/N] Automatic conversion of CrossEncoding model by @noooop in #20168
- [Doc] Fix classification table in list of supported models by @DarkLight1337 in #20489
- [CI] add kvcache-connector dependency definition and add into CI build by @panpan0000 in #18193
- [Misc] Small: Remove global media connector. Each test should have its own test connector object. by @huachenheli in #20395
- Enable V1 for Hybrid SSM/Attention Models by @tdoublep in #20016
- [feat]: CUTLASS block scaled group gemm for SM100 by @djmmoss in #19757
- [CI Bugfix] Fix pre-commit failures on main by @mgoin in #20502
- [Doc] fix mutltimodal_inputs.md gh examples link by @GuyStone in #20497
- [Misc] Add security warning for development mode endpoints by @reidliu41 in #20508
- [doc] small fix by @reidliu41 in #20506
- [Misc] Remove the unused LoRA test code by @jeejeelee in #20494
- Fix unknown attribute of topk_indices_dtype in CompressedTensorsW8A8Fp8MoECutlassMethod by @luccafong in #20507
- [v1] Re-add fp32 support to v1 engine through FlexAttention by @Isotr0py in #19754
- [Misc] Add logger.exception for TPU information collection failures by @reidliu41 in #20510
- [Misc] remove unused import by @reidliu41 in #20517
- test_attention compat with coming xformers change by @bottler in #20487
- [BUG] Fix #20484. Support empty sequence in cuda penalty kernel by @vadiklyutiy in #20491
- [Bugfix] Fix missing per_act_token parameter in compressed_tensors_moe by @luccafong in #20509
- [BugFix] Fix: ImportError when building on hopper systems by @LucasWilkinson in #20513
- [TPU][Bugfix] fix the MoE OOM issue by @yaochengji in #20339
- [Frontend] Support image object in llm.chat by @sfeng33 in #19635
- [Benchmark] Add support for multiple batch size benchmark through CLI in
benchmark_moe.py
+ Add Triton Fused MoE kernel config for FP8 E=16 on B200 by @b8zhong in #20516 - [Misc] call the pre-defined func by @reidliu41 in #20518
- [V0 deprecation] Remove V0 CPU/XPU/TPU backends by @WoosukKwon in #20412
- [V1] Support any head size for FlexAttention backend by @DarkLight1337 in #20467
- [BugFix][Spec Decode] Fix spec token ids in model runner by @WoosukKwon in #20530
- [Bugfix] Add
use_cross_encoder
flag to use correct activation inClassifierPooler
by @DarkLight1337 in #20527
New Contributors
- @sangbumlikeagod made their first contribution in #18809
- @djmmoss made their first contribution in #19757
- @GuyStone made their first contribution in #20497
- @bottler made their first contribution in #20487
Full Changelog: v0.9.2rc1...v0.9.2rc2
v0.9.2rc1
What's Changed
- [Docs] Note that alternative structured output backends are supported by @russellb in #19426
- [ROCm][V1] Adding ROCm to the list of plaforms using V1 by default by @gshtras in #19440
- [Model] use AutoWeightsLoader for commandr by @py-andy-c in #19399
- Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 by @Xu-Wenqing in #19401
- [BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 by @zou3519 in #19390
- [New Model]: Support Qwen3 Embedding & Reranker by @noooop in #19260
- [BugFix] Fix docker build cpu-dev image error by @2niuhe in #19394
- Fix test_max_model_len in tests/entrypoints/llm/test_generate.py by @houseroad in #19451
- [CI] Disable failing GGUF model test by @mgoin in #19454
- [Misc] Remove unused
MultiModalHasher.hash_prompt_mm_data
by @lgeiger in #19422 - Add fused MOE config for Qwen3 30B A3B on B200 by @0xjunhao in #19455
- Fix Typo in Documentation and Function Name by @leopardracer in #19442
- [ROCm] Add rules to automatically label ROCm related PRs by @houseroad in #19405
- [Kernel] Support deep_gemm for linear methods by @artetaout in #19085
- [Doc] Update V1 User Guide for Hardware and Models by @DarkLight1337 in #19474
- [Doc] Fix quantization link titles by @DarkLight1337 in #19478
- [Doc] Support "important" and "announcement" admonitions by @DarkLight1337 in #19479
- [Misc] Reduce warning message introduced in env_override by @houseroad in #19476
- Support non-string values in JSON keys from CLI by @DarkLight1337 in #19471
- Add cache to cuda get_device_capability by @mgoin in #19436
- Fix some typo by @Ximingwang-09 in #19475
- Support no privileged mode on CPU for docker and kubernetes deployments by @louie-tsai in #19241
- [Bugfix] Update the example code, make it work with the latest lmcache by @runzhen in #19453
- [CI] Update FlashInfer to 0.2.6.post1 by @mgoin in #19297
- [doc] fix "Other AI accelerators" getting started page by @davidxia in #19457
- [Misc] Fix misleading ROCm warning by @jeejeelee in #19486
- [Docs] Remove WIP features in V1 guide by @WoosukKwon in #19498
- [Kernels] Add activation chunking logic to FusedMoEModularKernel by @bnellnm in #19168
- [AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger by @rasmith in #17331
- [UX] Add Feedback During CUDAGraph Capture by @robertgshaw2-redhat in #19501
- [CI/Build] Fix torch nightly CI dependencies by @zou3519 in #19505
- [CI] change spell checker from codespell to typos by @andyxning in #18711
- [BugFix] Force registration of w8a8_block_fp8_matmul_deepgemm via lazy import by @varun-sundar-rabindranath in #19514
- Add Triton Fused MoE kernel config for E=16 on B200 by @b8zhong in #19518
- [Frontend] Improve error message in tool_choice validation by @22quinn in #19239
- [BugFix] Work-around incremental detokenization edge case error by @njhill in #19449
- [BugFix] Handle missing sep_token for Qwen3-Reranker in Score API by @strutive07 in #19522
- [AMD][Kernel][BugFix] fix test_rocm_compressed_tensors_w8a8 for rocm by @rasmith in #19509
- Fix typo by @2niuhe in #19525
- [Security] Prevent new imports of (cloud)pickle by @russellb in #18018
- [Bugfix][V1] Allow manual FlashAttention for Blackwell by @mgoin in #19492
- [Bugfix] Respect num-gpu-blocks-override in v1 by @jmswen in #19503
- [Quantization] Improve AWQ logic by @jeejeelee in #19431
- [Doc] Add V1 column to supported models list by @DarkLight1337 in #19523
- [NixlConnector] Drop
num_blocks
check by @NickLucche in #19532 - [Perf] Vectorize static / dynamic INT8 quant kernels by @yewentao256 in #19233
- Fix TorchAOConfig skip layers by @mobicham in #19265
- [torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass by @ProExpertProg in #16756
- [doc] Make top navigation sticky by @reidliu41 in #19540
- [Spec Decode][Benchmark] Generalize spec decode offline benchmark to more methods and datasets by @ekagra-ranjan in #18847
- [Misc] Turn MOE_DP_CHUNK_SIZE into an env var by @varun-sundar-rabindranath in #19506
- [Bugfix] Enforce contiguous input for dynamic_per_token FP8/INT8 quant by @mgoin in #19452
- [Doc] Unify structured outputs examples by @aarnphm in #18196
- [V1] Resolve failed concurrent structred output requests by @russellb in #19565
- Revert "[Build/CI] Add tracing deps to vllm container image (#15224)" by @kouroshHakha in #19378
- [BugFix] : Fix Batched DeepGemm Experts by @varun-sundar-rabindranath in #19515
- [Bugfix] Fix EAGLE vocab embedding for multimodal target model by @zixi-qi in #19570
- [Doc] uses absolute links for structured outputs by @aarnphm in #19582
- [doc] fix incorrect link by @reidliu41 in #19586
- [Misc] Correct broken docs link by @Zerohertz in #19553
- [CPU] Refine default config for the CPU backend by @bigPYJ1151 in #19539
- [Fix] bump mistral common to support magistral by @princepride in #19533
- [Fix] The zip function in Python 3.9 does not have the strict argument by @princepride in #19549
- use base version for version comparison by @BoyuanFeng in #19587
- [torch.compile] reorganize the cache directory to support compiling multiple models by @youkaichao in #19064
- [BugFix] Honor
enable_caching
in connector-delayed kvcache load case by @njhill in #19435 - [Model] Fix minimax model cache & lm_head precision by @qscqesze in #19592
- [Refactor] Remove unused variables in
moe_permute_unpermute_kernel.inl
by @yewentao256 in #19573 - [doc][mkdocs] fix the duplicate Supported features sections in GPU docs by @reidliu41 in #19606
- [CUDA] Enable full cudagraph for FlashMLA by @ProExpertProg in #18581
- [Doc] Add troubleshooting section to k8s deployment by @annapendleton in #19377
- [torch.compile] Use custom ops when use_inductor=False by @WoosukKwon in #19618
- Adding "AMD: Multi-step Tests" to amdproduction. by @Concurrensee in #19508
- [BugFix] Fix DP Coordinator incorrect debug log message by @njhill in #19624
- [V1][Metrics] Deprecate metrics with gpu_ prefix for non GPU specific metrics. by @sahelib25 in #18354
- [Bugfix][1/n] Fix the speculative decoding test by setting the target dtype by @houseroad in #19633
- [Misc] Modularize CLI Argument Parsing in Benchmark Scripts by @reidliu41 in #19593
- [Bugfix] Fix auto dtype casting for BatchFeature by @Isotr0py in #19316
- [Hardware][NVIDIA][kernel] Fp4 MOE quant kernel optimization by @jiahanc in #19500
- Only build CUTLASS MoE kernels on Hopper by @huydhn in #19648
- [Bugfix] Don't attempt to use triton if no driver is active by @kzawora-intel in #19561
- [Fix] Convert kv_transfer_config from dict to KVTransferConfig by @maobaolong in #19262
- [Perf...
v0.9.1
Highlights
This release features 274 commits, from 123 contributors (27 new contributors!)
- Progress in large scale serving
- DP Attention + Expert Parallelism: CUDA graph support (#18724), DeepEP dispatch-combine kernel (#18434), batched/masked DeepGEMM kernel (#19111), CUTLASS MoE kernel with PPLX (#18762)
- Heterogeneous TP (#18833), NixlConnector Enable FlashInfer backend (#19090)
- DP: API-server scaleout with many-to-many server-engine comms (#17546), Support DP with Ray (#18779), allow AsyncLLMEngine.generate to target a specific DP rank (#19102), data parallel rank to KVEventBatch (#18925)
- Tooling: Simplify EP kernels installation (#19412)
- RLHF workflow: Support inplace model weights loading (#18745)
- Initial full support for Hybrid Memory Allocator (#17996), support cross-layer KV sharing (#18212)
- Add FlexAttention to vLLM V1 (#16078)
- Various production hardening related to full cuda graph mode (#19171, #19106, #19321)
Model Support
- Support Magistral (#19193), LoRA support for InternVL (#18842), minicpm eagle support (#18943), NemotronH support (#18863, #19249)
- Enable data parallel for Llama4 vision encoder (#18368)
- Add DeepSeek-R1-0528 function call chat template (#18874)
Hardware Support & Performance Optimizations
- Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 (#19205), Qwen3-235B-A22B (#19315)
- Blackwell: Add Cutlass MLA backend (#17625), Tunings for SM100 FP8 CUTLASS kernel (#18778), Use FlashInfer by default on Blackwell GPUs (#19118), Tune
scaled_fp8_quant
by increasing vectorization (#18844) - FP4: Add compressed-tensors NVFP4 support (#18312), FP4 MoE kernel optimization (#19110)
- CPU: V1 support for the CPU backend (#16441)
- ROCm: Add AITER grouped topk for DeepSeekV2 (#18825)
- POWER: Add IBM POWER11 Support to CPU Extension Detection (#19082)
- TPU: Initial support of model parallelism with single worker using SPMD (#18011), Multi-LoRA Optimizations for the V1 TPU backend (#15655)
- Neuron: Add multi-LoRA support for Neuron. (#18284), Add Multi-Modal model support for Neuron (#18921), Support quantization on neuron (#18283)
- Platform: Make torch distributed process group extendable (#18763)
Engine features
- Add Lora Support to Beam Search (#18346)
- Add rerank support to run_batch endpoint (#16278)
- CLI: add run batch (#18804)
- Server: custom logging (#18403), allowed_token_ids in ChatCompletionRequest (#19143)
LLM
API: make use_tqdm accept a callable for custom progress bars (#19357)- perf: [KERNEL] Sampler. CUDA kernel for applying repetition penalty (#18437)
API Deprecations
- Disallow pos-args other than
model
when initializingLLM
(#18802) - Remove
inputs
arg fallback in Engine classes (#18799) - Remove fallbacks for Embeddings API (#18795)
- Remove mean pooling default for
Qwen2EmbeddingModel
(#18913) - Require overriding
get_dummy_text
andget_dummy_mm_data
(#18796) - Remove metrics that were deprecated in 0.8 (#18837)
Documentation
- Add CLI doc (#18871)
- Update SECURITY.md with link to our security guide (#18961), Add security warning to bug report template (#19365)
What's Changed
- [CI/Build] [TPU] Fix TPU CI exit code by @CAROLZXYZXY in #18282
- [Neuron] Support quantization on neuron by @aws-satyajith in #18283
- Support datasets in
vllm bench serve
and sync with benchmark_[serving,datasets].py by @mgoin in #18566 - [Bugfix] Disable prefix caching by default for benchmark by @cascade812 in #18771
- [Build] Fixes for CMake install by @ProExpertProg in #18570
- [Core] Improve Tensor serialisation by @lgeiger in #18774
- [rocm] Fix wrong attention log by @fxmarty-amd in #18764
- [Bugfix] Fix nomic max_model_len by @noooop in #18755
- [Bugfix]: correctly propagate errors message caught at the chat_templating step to the client by @gcalmettes in #18769
- [V1] fix torch profiling for V1 offline scenarios by @divakar-amd in #18445
- [V1] [Bugfix] eagle bugfix and enable correct lm_head for multimodal (2) by @RonaldBXu in #18781
- [Bugfix][FailingTest]Fix test_model_load_with_params.py by @rabi in #18758
- [Deprecation] Require overriding
get_dummy_text
andget_dummy_mm_data
by @DarkLight1337 in #18796 - [Deprecation] Remove unused sync methods in
async_timeout
by @DarkLight1337 in #18792 - [Deprecation] Remove fallbacks for Embeddings API by @DarkLight1337 in #18795
- [CI] improve embed testing by @noooop in #18747
- Fix PiecewiseCompileInterpreter by @zou3519 in #17338
- [BugFix] FA2 MLA Accuracy Issue by @LucasWilkinson in #18807
- [Platform][Dist] Make torch distributed process group extendable by @MengqingCao in #18763
- Enable Pydantic mypy checks and convert configs to Pydantic dataclasses by @hmellor in #17599
- [Frontend] add run batch to CLI by @reidliu41 in #18804
- decrement server_load on listen for disconnect by @daniel-salib in #18784
- [Core] Add Lora Support to Beam Search by @alex-jw-brooks in #18346
- [Chore] update ty configuration by @aarnphm in #18839
- [Misc] fix olmoe model layer for TP > 1 by @lengrongfu in #18828
- [V1][Metrics] Remove metrics that were deprecated in 0.8 by @markmc in #18837
- [Chore][Spec Decode] Update check NoneType instead of assigning variables by @aarnphm in #18836
- [Hardware][TPU][V1] Multi-LoRA Optimisations for the V1 TPU backend by @Akshat-Tripathi in #15655
- Remove checks for
None
for fields which should never beNone
by @hmellor in #17985 - [Core] Enable CUDA graphs for DP + All2All kernels by @varun-sundar-rabindranath in #18724
- [Bugfix][ROCm] fix the power of 2 exception from triton_unified_attention.py when running llama4 models and unit test fix by @hongxiayang in #18100
- Prevent the cross-encoder logic from being applied to classification tasks by @maxdebayser in #18838
- Add ability to use CUDAGraphs with use_inductor=False by @zou3519 in #17345
- [Bugfix][TPU] fix moe custom kernel import by @yaochengji in #18853
- [Doc][Neuron] Update documentation for Neuron by @elaineyz in #18868
- Skip device and quant Pydantic validation to make plugin device work by @Yikun in #18843
- Fixes a dead link in nightly benchmark readme by @nerdalert in #18856
- [Neuron] Add multi-LoRA support for Neuron. by @aws-satyajith in #18284
- [LoRA] Add LoRA support for InternVL by @jeejeelee in #18842
- [Doc] Remove redundant spaces from compatibility_matrix.md by @windsonsea in #18891
- [doc] add CLI doc by @reidliu41 in #18871
- [Bugfix] Fix misleading information in the documentation by @jeejeelee in #18845
- [Misc] Replace TODO in serving transcription by @NickLucche in #18895
- [Bugfix] Ensure tensors are contiguous during serialisation by @lgeiger in #18860
- [BugFix] Update pydantic to fix error on python 3.10 by @ProExpertProg in #18852
- Fix an error in dummy weight loading for quantization models by @Chenyaaang in #18855
- [Misc][Tools][Benchmark] Add benchmark_serving supports for llama.cpp. by @Duyi-Wang in #18692
- [Doc] Fix codeblocks formatting in LoRA adapters documentation by @Zerohertz in #18907
- [Bugfix] Fix the failing gte embedding test by @Isotr0py in #18720
- [Attention][V1] Toggle for v1 attention backend by @gshtras in #18275
- [ROCm][V0][Attention] Revert to the previous FA triton kernel by @gshtras in #18226
- [Deprecation] Disallow pos-args other than
model
when initializingLLM
by @DarkLight1337 in #18802 - [Misc] Remove duplicate init for self.vllm_config by @googs1025 in #18896
- [V1] Allocate kv_cache with stride order for V1 by @NickLucche in #18775
- [BugFix] Make DP work with connector-delayed new requests by @njhill in #18559
- [P/D] NixlConnector DP fixes by @wseaton...
v0.9.1rc1
What's Changed
- [CI/Build] [TPU] Fix TPU CI exit code by @CAROLZXYZXY in #18282
- [Neuron] Support quantization on neuron by @aws-satyajith in #18283
- Support datasets in
vllm bench serve
and sync with benchmark_[serving,datasets].py by @mgoin in #18566 - [Bugfix] Disable prefix caching by default for benchmark by @cascade812 in #18771
- [Build] Fixes for CMake install by @ProExpertProg in #18570
- [Core] Improve Tensor serialisation by @lgeiger in #18774
- [rocm] Fix wrong attention log by @fxmarty-amd in #18764
- [Bugfix] Fix nomic max_model_len by @noooop in #18755
- [Bugfix]: correctly propagate errors message caught at the chat_templating step to the client by @gcalmettes in #18769
- [V1] fix torch profiling for V1 offline scenarios by @divakar-amd in #18445
- [V1] [Bugfix] eagle bugfix and enable correct lm_head for multimodal (2) by @RonaldBXu in #18781
- [Bugfix][FailingTest]Fix test_model_load_with_params.py by @rabi in #18758
- [Deprecation] Require overriding
get_dummy_text
andget_dummy_mm_data
by @DarkLight1337 in #18796 - [Deprecation] Remove unused sync methods in
async_timeout
by @DarkLight1337 in #18792 - [Deprecation] Remove fallbacks for Embeddings API by @DarkLight1337 in #18795
- [CI] improve embed testing by @noooop in #18747
- Fix PiecewiseCompileInterpreter by @zou3519 in #17338
- [BugFix] FA2 MLA Accuracy Issue by @LucasWilkinson in #18807
- [Platform][Dist] Make torch distributed process group extendable by @MengqingCao in #18763
- Enable Pydantic mypy checks and convert configs to Pydantic dataclasses by @hmellor in #17599
- [Frontend] add run batch to CLI by @reidliu41 in #18804
- decrement server_load on listen for disconnect by @daniel-salib in #18784
- [Core] Add Lora Support to Beam Search by @alex-jw-brooks in #18346
- [Chore] update ty configuration by @aarnphm in #18839
- [Misc] fix olmoe model layer for TP > 1 by @lengrongfu in #18828
- [V1][Metrics] Remove metrics that were deprecated in 0.8 by @markmc in #18837
- [Chore][Spec Decode] Update check NoneType instead of assigning variables by @aarnphm in #18836
- [Hardware][TPU][V1] Multi-LoRA Optimisations for the V1 TPU backend by @Akshat-Tripathi in #15655
- Remove checks for
None
for fields which should never beNone
by @hmellor in #17985 - [Core] Enable CUDA graphs for DP + All2All kernels by @varun-sundar-rabindranath in #18724
- [Bugfix][ROCm] fix the power of 2 exception from triton_unified_attention.py when running llama4 models and unit test fix by @hongxiayang in #18100
- Prevent the cross-encoder logic from being applied to classification tasks by @maxdebayser in #18838
- Add ability to use CUDAGraphs with use_inductor=False by @zou3519 in #17345
- [Bugfix][TPU] fix moe custom kernel import by @yaochengji in #18853
- [Doc][Neuron] Update documentation for Neuron by @elaineyz in #18868
- Skip device and quant Pydantic validation to make plugin device work by @Yikun in #18843
- Fixes a dead link in nightly benchmark readme by @nerdalert in #18856
- [Neuron] Add multi-LoRA support for Neuron. by @aws-satyajith in #18284
- [LoRA] Add LoRA support for InternVL by @jeejeelee in #18842
- [Doc] Remove redundant spaces from compatibility_matrix.md by @windsonsea in #18891
- [doc] add CLI doc by @reidliu41 in #18871
- [Bugfix] Fix misleading information in the documentation by @jeejeelee in #18845
- [Misc] Replace TODO in serving transcription by @NickLucche in #18895
- [Bugfix] Ensure tensors are contiguous during serialisation by @lgeiger in #18860
- [BugFix] Update pydantic to fix error on python 3.10 by @ProExpertProg in #18852
- Fix an error in dummy weight loading for quantization models by @Chenyaaang in #18855
- [Misc][Tools][Benchmark] Add benchmark_serving supports for llama.cpp. by @Duyi-Wang in #18692
- [Doc] Fix codeblocks formatting in LoRA adapters documentation by @Zerohertz in #18907
- [Bugfix] Fix the failing gte embedding test by @Isotr0py in #18720
- [Attention][V1] Toggle for v1 attention backend by @gshtras in #18275
- [ROCm][V0][Attention] Revert to the previous FA triton kernel by @gshtras in #18226
- [Deprecation] Disallow pos-args other than
model
when initializingLLM
by @DarkLight1337 in #18802 - [Misc] Remove duplicate init for self.vllm_config by @googs1025 in #18896
- [V1] Allocate kv_cache with stride order for V1 by @NickLucche in #18775
- [BugFix] Make DP work with connector-delayed new requests by @njhill in #18559
- [P/D] NixlConnector DP fixes by @wseaton in #18903
- Use standalone_compile by default in torch >= 2.8.0 by @zou3519 in #18846
- [TPU] remove transpose ops in moe kernel by @yaochengji in #18923
- [Bugfix] Fix PP default fallback behavior for V1 by @mgoin in #18915
- [Misc] Update type annotation for rotary embedding
base
by @DarkLight1337 in #18914 - [TPU][CI/CD] Clean up docker for TPU tests. by @CAROLZXYZXY in #18926
- improve the robustness of parsing vlms config in AutoRound by @wenhuach21 in #18894
- [Bugfix] Consistent ascii handling in tool parsers by @chaunceyjiang in #18883
- [Model] Use AutoWeightsLoader for mamba2 by @jinyouzhi in #18918
- [docs] fix: fix markdown syntax by @eric-haibin-lin in #18927
- [ROCm] Remove unnecessary assertion of max_model_len in ROCM_AITER_MLA attention backend. by @vllmellm in #18938
- [Bugfix] Remove NVFP4 scales assertions to fix load_format=dummy by @mgoin in #18861
- [Deprecation] Remove mean pooling default for
Qwen2EmbeddingModel
by @DarkLight1337 in #18913 - [Misc]Fix benchmarks/README.md for speculative decoding by @rabi in #18897
- [doc] add mkdocs doc by @reidliu41 in #18930
- [Model] Use in-place adds in SigLIP by @lgeiger in #18922
- [Bugfix][Failing Test] Fix test_vllm_port.py by @rabi in #18618
- [Misc]Fix typo by @Always-Naive in #18947
- [Bugfix][TPU] Fix tpu model runner testcase failure by @CAROLZXYZXY in #18810
- [CI/Build] remove regex from build dependencies by @dtrifiro in #18945
- [Feature] minicpm eagle support by @huangyuxiang03 in #18943
- [doc] show the count for fork and watch by @reidliu41 in #18950
- [Docs] Update SECURITY.md with link to our security guide by @russellb in #18961
- Improve "failed to get the hash of the compiled graph" error by @zou3519 in #18956
- [Perf] API-server scaleout with many-to-many server-engine comms by @njhill in #17546
- Benchmark script for fp8 vs bf16 gemm by @mgoin in #17126
- [VLM] Add PP support and fix GPTQ inference for Ovis models by @Isotr0py in #18958
- [Misc] add group_size is -1 in awq quantization by @lengrongfu in #18910
- Tool parser regex timeout handling by @wseaton in https://github.com/vl...
v0.9.0.1
This patch release contains important bugfix for DeepSeek family of models on NVIDIA Ampere and below (#18807)
Full Changelog: v0.9.0...v0.9.0.1
v0.9.0
Highlights
This release features 649 commits, from 215 contributors (82 new contributors!)
- vLLM has upgraded to PyTorch 2.7! (#16859) This is a breaking change for environment dependency.
- The default wheel has been upgraded from CUDA 12.4 to CUDA 12.8. We will distribute CUDA 12.6 wheel on GitHub artifact.
- As a general rule of thumb, our CUDA version policy follow PyTorch's CUDA version policy.
- Enhanced NVIDIA Blackwell support. vLLM now ships with initial set of optimized kernels on NVIDIA Blackwell with both attention and mlp.
- You can use our docker image or install FlashInfer nightly wheel
pip install https://download.pytorch.org/whl/cu128/flashinfer/flashinfer_python-0.2.5%2Bcu128torch2.7-cp38-abi3-linux_x86_64.whl
then setVLLM_ATTENTION_BACKEND=FLASHINFER
for better performance. - Upgraded support for the new FlashInfer main branch. (#15777)
- Please checkout #18153 for the full roadmap
- You can use our docker image or install FlashInfer nightly wheel
- Initial DP, EP, PD support for large scale inference
- EP:
- DP:
- Decouple engine process management and comms (#15977)
- PD:
- Migrate docs from Sphinx to MkDocs (#18145, #18610, #18614, #18616. #18622, #18626, #18627, #18635, #18637, #18657, #18663, #18666, #18713)
Notable Changes
- Removal of CUDA 12.4 support due to PyTorch upgrade to 2.7.
- Change
top_k
to be disabled with0
(still accept-1
for now) (#17773) - The seed is now set to
0
by default for V1 Engine, meaning that different vLLM runs now yield the same outputs even iftemperature > 0
. This does not modify the random state in user code since workers are run in separate processes unlessVLLM_USE_V1_MULTIPROCESSING=0
. (#17929, #18741)
Model Enhancements
- Support MiMo-7B (#17433), MiniMax-VL-01 (#16328), Ovis 1.6 (#17861), Ovis 2 (#15826), GraniteMoeHybrid 4.0 (#17497), FalconH1* (#18406), LlamaGuard4 (#17315)
- Please install the development version of
transformers
(from source) to use Falcon-H1.
- Please install the development version of
- Embedding models: nomic-embed-text-v2-moe (#17785), new class of gte models (#17986)
- Progress in Hybrid Memory Allocator (#17394, #17479, #17474, #17483, #17193, #17946, #17945, #17999, #18001, #18593)
- DeepSeek: perf enhancement by moving more calls into cuda-graph region(#17484, #17668), Function Call (#17784), MTP in V1 (#18435)
- Qwen2.5-1M: Implements dual-chunk-flash-attn backend for dual chunk attention with sparse attention support (#11844)
- Qwen2.5-VL speed enhancement via rotary_emb optimization (#17973)
- InternVL models with Qwen2.5 backbone now support video inputs (#18499)
Performance, Production and Scaling
- Support full cuda graph in v1 (#16072)
- Pipeline Parallelism: MultiprocExecutor support (#14219),
torchrun
(#17827) - Support sequence parallelism combined with pipeline parallelism (#18243)
- Async tensor parallelism using compilation pass (#17882)
- Perf: Use small max_num_batched_tokens for A100 (#17885)
- Fast Model Loading: Tensorizer support for V1 and LoRA (#17926)
- Multi-modality: Automatically cast multi-modal input dtype before transferring device (#18756)
Security
- Prevent side-channel attacks via cache salting (#17045)
- Fix image hash collision in certain edge cases (#17378)
- Add
VLLM_ALLOW_INSECURE_SERIALIZATION
env var (#17490) - Migrate to REGEX Library to prevent catastrophic backtracking (#18454, #18750)
Features
- CLI:
deprecated=True
(#17426) - Frontend: progress bar for adding requests (#17525),
chat_template_kwargs
inLLM.chat
(#17356),/classify
endpoint (#17032), truncation control for embedding models (#14776),cached_tokens
in response usage (#18149) - LoRA: default local directory LoRA resolver plugin. (#16855)
- Metrics: kv event publishing (#16750), API for accessing in-memory Prometheus metrics (#17010)
- Quantization:
nvidia/DeepSeek-R1-FP4
(#16362), Quark MXFP4 format (#16943), AutoRound (#17850), torchao models withAOPerModuleConfig
(#17826), CUDA Graph support for V1 GGUF support (#18646) - Reasoning: deprecate
--enable-reasoning
(#17452) - Spec Decode: EAGLE share input embedding (#17326), torch.compile & cudagraph to EAGLE (#17211), EAGLE3 (#17504), log accumulated metrics(#17913), Medusa (#17956)
- Structured Outputs: Thinking compatibility (#16577), Spec Decoding (#14702), Qwen3 reasoning parser (#17466),
tool_choice: required
for Xgrammar (#17845), Structural Tag with Guidance backend (#17333) - Transformers backend: named parameters (#16868), interleaved sliding window attention (#18494)
Hardwares
- NVIDIA: cutlass support for blackwell fp8 blockwise gemm (#14383)
- TPU: Multi-LoRA implementation(#14238), default max-num-batched-tokens (#17508), V1 backend by default (#17673), top-logprobs (#17072)
- Neuron: NeuronxDistributedInference support (#15970), Speculative Decoding, Dynamic on-device sampling (#16357), Mistral Model (#18222), Multi-LoRA (#18284)
- AMD: Enable FP8 KV cache on V1 (#17870), Tuned fused moe config for Qwen3 MoE on MI300X (#17535, #17530), AITER biased group topk (#17955), Block-Scaled GEMM (#14968), MLA (#17523), Radeon GPU use Custom Paged Attention (#17004), reduce the number of environment variables in command line (#17229)
- Extensibility: Make PiecewiseBackend pluggable and extendable (#18076)
Documentation
- Update quickstart and install for cu128 using
--torch-backend=auto
(#18505) - NVIDIA TensorRT Model Optimizer (#17561)
- Usage of Qwen3 thinking (#18291)
Developer Facing
- Benchmark: Add single turn MTBench to Serving Bench (#17202)
- Usability: Decrease import time of
vllm.multimodal
(#18031) - Code Format: Code formatting using
ruff format
(#17656, #18068, #18400) - Readability:
- Process:
- Propose a deprecation policy for the project (#17063)
- Testing: expanding torch nightly tests (#18004)
What's Changed
- Support loading transformers models with named parameters by @wuisawesome in #16868
- Add tuned triton fused_moe configs for Qwen3Moe by @mgoin in #17328
- [Benchmark] Add single turn MTBench to Serving Bench by @ekagra-ranjan in #17202
- [Optim] Compute multimodal hash only once per item by @DarkLight1337 in #17314
- implement Structural Tag with Guidance backend by @mmoskal in #17333
- [V1][Spec Decode] Make Eagle model arch config driven by @ekagra-ranjan in #17323
- [model] make llama4 compatible with pure dense layers by @luccafong in #17315
- [Bugfix] Fix
numel()
downcast in fused_layernorm_dynamic_per_token_quant.cu by @r-barnes in #17316 - Ignore
'<string>'
filepath by @zou3519 in #17330 - [Bugfix] Add contiguous call inside rope kernel wrapper by @timzsu in #17091
- [Misc] Add a Jinja template to support Mistral3 function calling by @chaunceyjiang in #17195
- [Model] support MiniMax-VL-01 model by @qscqesze in #16328
- [Misc] Move config fields to MultiModalConfig by @DarkLight1337 in #17343
- [Misc]Use a platform independent interface to obtain the device attributes by @ponix-j in #17100
- [Fix] Documentation spacing in compilation config help text by @Zerohertz in #17342
- [Build][Bugfix] Restrict setuptools version to <80 by @gshtras in #17320
- [Model] Ignore rotary embed load for Cohere model by @ekagra-ranjan in #17319
- Update docs requirements by @hmellor in #17379
- [Doc] Fix QWen3MOE info by @jeejeelee in #17381
- [Bugfix] Clean up MiniMax-VL and fix processing by @DarkLight1337 in #17354
pre-commit autoupdate
by @hmellor in #17380- [Frontend] Support
chat_template_kwargs
inLLM.chat
by @DarkLight1337 in #17356 - Transformers backend tweaks by @hmellor in #17365
- Fix: Spelling of inference by @a2q1p in #17387
- Improve literal dataclass field conversion to argparse argument by @hmellor in #17391
- [V1] Remove num_input_tokens from attn_metadata by @heheda12345 in #17193
- [Bugfix] add qwen3 reasoning-parser fix content is None when disable … by @mofanke in #17369
- fix gemma3 results all zero by @mayuyuace in #17364
- [Misc][ROCm] Exclude
cutlass_mla_decode
for ROCm build by @tywuAMD in #17289 - Enabling multi-group kernel tests. by @Alexei-V-Ivanov-AMD in https://github.com/vllm-p...