Releases: vllm-project/vllm
v0.11.0
Highlights
This release features 538 commits, 207 contributors (65 new contributors)!
- This release completes the removal of V0 engine. V0 engine code including AsyncLLMEngine, LLMEngine, MQLLMEngine, all attention backends, and related components have been removed. V1 is the only engine in the codebase now.
- This releases turns on FULL_AND_PIECEWISE as the CUDA graph mode default. This should provide better out of the box performance for most models, particularly fine-grained MoEs, while preserving compatibility with existing models supporting only PIECEWISE mode.
Model Support
- New architectures: DeepSeek-V3.2-Exp (#25896), Qwen3-VL series (#24727), Qwen3-Next (#24526), OLMo3 (#24534), LongCat-Flash (#23991), Dots OCR (#24645), Ling2.0 (#24627), CWM (#25611).
- Encoders: RADIO encoder support (#24595), Transformers backend support for encoder-only models (#25174).
- Task expansion: BERT token classification/NER (#24872), multimodal models for pooling tasks (#24451).
- Data parallel for vision encoders: InternVL (#23909), Qwen2-VL (#25445), Qwen3-VL (#24955).
- Speculative decoding: EAGLE3 for MiniCPM3 (#24243) and GPT-OSS (#25246).
- Features: Qwen3-VL text-only mode (#26000), EVS video token pruning (#22980), Mamba2 TP+quantization (#24593), MRoPE + YaRN (#25384), Whisper on XPU (#25123), LongCat-Flash-Chat tool calling (#24083).
- Performance: GLM-4.1V 916ms TTFT reduction via fused RMSNorm (#24733), GLM-4 MoE SharedFusedMoE optimization (#24849), Qwen2.5-VL CUDA sync removal (#24741), Qwen3-VL Triton MRoPE kernel (#25055), FP8 checkpoints for Qwen3-Next (#25079).
- Reasoning: SeedOSS reason parser (#24263).
Engine Core
- KV cache offloading: CPU offloading with LRU management (#19848, #20075, #21448, #22595, #24251).
- V1 features: Prompt embeddings (#24278), sharded state loading (#25308), FlexAttention sliding window (#24089), LLM.apply_model (#18465).
- Hybrid allocator: Pipeline parallel (#23974), varying hidden sizes (#25101).
- Async scheduling: Uniprocessor executor support (#24219).
- Architecture: Tokenizer group removal (#24078), shared memory multimodal caching (#20452).
- Attention: Hybrid SSM/Attention in Triton (#21197), FlashAttention 3 for ViT (#24347).
- Performance: FlashInfer RoPE 2x speedup (#21126), fused Q/K RoPE 11% improvement (#24511, #25005), 8x spec decode overhead reduction (#24986), FlashInfer spec decode with 1.14x speedup (#25196), model info caching (#23558), inputs_embeds copy avoidance (#25739).
- LoRA: Optimized weight loading (#25403).
- Defaults: CUDA graph mode FULL_AND_PIECEWISE (#25444), Inductor standalone compile disabled (#25391).
- torch.compile: CUDA graph Inductor partition integration (#24281).
Hardware & Performance
- NVIDIA: FP8 FlashInfer MLA decode (#24705), BF16 fused MoE for Hopper/Blackwell expert parallel (#25503).
- DeepGEMM: Enabled by default (#24462), 5.5% throughput improvement (#24783).
- New architectures: RISC-V 64-bit (#22112), ARM non-x86 CPU (#25166), ARM 4-bit fused MoE (#23809).
- AMD: ROCm 7.0 (#25178), GLM-4.5 MI300X tuning (#25703).
- Intel XPU: MoE DP accuracy fix (#25465).
Large Scale Serving & Performance
- Dual-Batch Overlap (DBO): Overlapping computation mechanism (#23693), DeepEP high throughput + prefill (#24845).
- Data Parallelism: torchrun launcher (#24899), Ray placement groups (#25026), Triton DP/EP kernels (#24588).
- EPLB: Hunyuan V1 (#23078), Mixtral (#22842), static placement (#23745), reduced overhead (#24573).
- Disaggregated serving: KV transfer metrics (#22188), NIXL MLA latent dimension (#25902).
- MoE: Shared expert overlap optimization (#24254), SiLU kernel for DeepSeek-R1 (#24054), Enable Allgather/ReduceScatter backend for NaiveAllToAll (#23964).
- Distributed: NCCL symmetric memory with 3-4% throughput improvement (#24532), enabled by default for TP (#25070).
Quantization
- FP8: Per-token-group quantization (#24342), hardware-accelerated instructions (#24757), torch.compile KV cache (#22758), paged attention update (#22222).
- FP4: NVFP4 for dense models (#25609), Gemma3 (#22771), Llama 3.1 405B (#25135).
- W4A8: Faster preprocessing (#23972).
- Compressed tensors: Blocked FP8 for MoE (#25219).
API & Frontend
- OpenAI: Prompt logprobs for all tokens (#24956), logprobs=-1 for full vocab (#25031), reasoning streaming events (#24938), Responses API MCP tools (#24628, #24985), health 503 on dead engine (#24897).
- Multimodal: Media UUID caching (#23950), image path format (#25081).
- Tool calling: XML parser for Qwen3-Coder (#25028), Hermes-style tokens (#25281).
- CLI: --enable-logging (#25610), improved --help (#24903).
- Config: Speculative model engine args (#25250), env validation (#24761), NVTX profiling (#25501), guided decoding backward compatibility (#25615, #25422).
- Metrics: V1 TPOT histogram (#24015), hidden deprecated gpu_ metrics (#24245), KV cache GiB units (#25204, #25479).
- UX: Removed misleading quantization warning (#25012).
Security
Dependencies
- PyTorch 2.8 for CPU (#25652), FlashInfer 0.3.1 (#24470), CUDA 13 (#24599), ROCm 7.0 (#25178).
- Build requirements: C++17 now enforced globally (#24823).
- TPU: Deprecated
xm.mark_step
in favor oftorch_xla.sync
(#25254).
V0 Deprecation
- Engines: AsyncLLMEngine (#25025), LLMEngine (#25033), MQLLMEngine (#25019), core (#25321), model runner (#25328), MP executor (#25329).
- Components: Attention backends (#25351), encoder-decoder (#24907), output processor (#25320), sampling metadata (#25345), Sequence/Sampler (#25332).
- Interfaces: LoRA (#25686), async output processor (#25334), MultiModalPlaceholderMap (#25366), seq group methods (#25330), placeholder attention (#25510), input embeddings (#25242), multimodal registry (#25362), max_seq_len_to_capture (#25543), attention classes (#25541), hybrid models (#25400), backend suffixes (#25489), compilation fallbacks (#25675), default args (#25409).
What's Changed
- [Qwen3-Next] MoE configs for H20 TP=1,2,4,8 by @jeejeelee in #24707
- [DOCs] Update ROCm installation docs section by @gshtras in #24691
- Enable conversion of multimodal models to pooling tasks by @maxdebayser in #24451
- Fix implementation divergence for BLOOM models between vLLM and HuggingFace when using prompt embeds by @qthequartermasterman in #24686
- [Bugfix] Fix MRoPE dispatch on CPU by @bigPYJ1151 in #24712
- [BugFix] Fix Qwen3-Next PP by @njhill in #24709
- [CI] Fix flaky test v1/worker/test_gpu_model_runner.py::test_kv_cache_stride_order by @heheda12345 in #24640
- [CI] Add ci_envs for convenient local testing by @noooop in #24630
- [CI/Build] Skip prompt embeddings tests on V1-only CPU backend by @bigPYJ1151 in #24721
- [Misc][gpt-oss] Add gpt-oss label to PRs that mention harmony or related to builtin tool call by @heheda12345 in #24717
- [Bugfix] Fix BNB name match by @jeejeelee in #24735
- [Kernel] [CPU] refactor
cpu_attn.py:_run_sdpa_forward
for better memory access by @ignaciosica in #24701 - [sleep mode] save memory for on-the-fly quantization by @youkaichao in #24731
- [Multi Modal] Add FA3 in VIT by @wwl2755 in #24347
- [Multimodal] Remove legacy multimodal fields in favor of MultiModalFeatureSpec by @sfeng33 in #24548
- [Doc]: fix typos in various files by @didier-durand in #24726
- [Docs] Fix warnings in mkdocs build (continued) by @Zerohertz in #24740
- [Bugfix] Fix MRoPE dispatch on XPU by @yma11 in #24724
- [Qwen3-Next] MoE configs for H100 TP=1,2 and TP2/EP by @elvircrn in #24739
- [Core] Shared memory based object store for Multimodal data caching and IPC by @dongluw in #20452
- [Bugfix][Frontend] Fix
--enable-log-outputs
does not match the documentation by @kebe7jun in #24626 - [Models] Optimise and simplify
_validate_and_reshape_mm_tensor
by @lgeiger in #24742 - [Models] Prevent CUDA sync in Qwen2.5-VL by @lgeiger in #24741
- [Model] Switch to Fused RMSNorm in GLM-4.1V model by @SamitHuang in #24733
- [UX] Remove AsyncLLM torch profiler disabled log by @mgoin in #24609
- [CI] Speed up model unit tests in CI by @afeldman-nm in #24253
- [Bugfix] Fix incompatibility between #20452 and #24548 by @DarkLight1337 in #24754
- [CI] Trigger BC Linter when labels are added/removed by @zhewenl in #24767
- [Benchmark] Allow arbitrary headers to be passed to benchmarked endpoints by @smarterclayton in #23937
- [Compilation Bug] Fix Inductor Graph Output with Shape Issue by @yewentao256 in #24772
- Invert pattern order to make sure that out_proj layers are identified by @anmarques in #24781
- [Attention][Fl...
v0.10.2
Highlights
This release contains 740 commits from 266 contributors (97 new)!
Breaking Changes: This release includes PyTorch 2.8.0 upgrade, V0 deprecations, and API changes - please review the changelog carefully.
aarch64 support: This release features native support for aarch64 allowing usage of vLLM on GB200 platform. The docker image vllm/vllm-openai
should already be multiplatform. To install the wheels, you can download the wheels from this release artifact or install via
uv pip install vllm==0.10.2 --extra-index-url https://wheels.vllm.ai/0.10.2/ --torch-backend=auto
Model Support
- New model families and enhancements: Apertus (#23068), LFM2 (#22845), MiDashengLM (#23652), Motif-1-Tiny (#23414), Seed-Oss (#23241), Google EmbeddingGemma-300m (#24318), GTE sequence classification (#23524), Donut OCR model (#23229), KeyeVL-1.5-8B (#23838), R-4B vision model (#23246), Ernie4.5 VL (#22514), MiniCPM-V 4.5 (#23586), Ovis2.5 (#23084), Qwen3-Next with hybrid attention (#24526), InternVL3.5 with video support (#23658), Qwen2Audio embeddings (#23625), NemotronH Nano VLM (#23644), BLOOM V1 engine support (#23488), and Whisper encoder-decoder for V1 (#21088).
- Pipeline parallelism expansion: Added PP support for Hunyuan (#24212), Ovis2.5 (#23405), GPT-OSS (#23680), and Kimi-VL-A3B-Thinking-2506 (#23114).
- Data parallelism for vision models: Enabled DP for ViT across Qwen2.5VL (#22742), MiniCPM-V (#23948, #23327), Kimi-VL (#23817), and GLM-4.5V (#23168).
- LoRA ecosystem expansion: Added LoRA support to Voxtral (#24517), Qwen-2.5-Omni (#24231), and DeepSeek models V2/V3/R1-0528 (#23971), with significantly faster LoRA startup performance (#23777).
- Classification and pooling enhancements: Multi-label classification support (#23173), logit bias and sigmoid normalization (#24031), and FP32 precision heads for pooling models (#23810).
- Performance optimizations: Removed unnecessary CUDA sync from GLM-4.1V (#24332) and Qwen2VL (#24334) preprocessing, eliminated redundant all-reduce in Qwen3 MoE (#23169), optimized InternVL CPU threading (#24519), and GLM4.5-V video frame decoding (#24161).
Engine Core
- V1 engine maturation: Extended V1 support to compute capability < 8.0 (#23614, #24022), added cross-attention KV cache for encoder-decoder models (#23664), request-level logits processor integration (#23656), and KV events from connectors (#19737).
- Backend expansion: Terratorch backend integration (#23513) enabling non-language model tasks like semantic segmentation and geospatial applications with
--model-impl terratorch
support. - Hybrid and Mamba model improvements: Enabled full CUDA graphs by default for hybrid models (#22594), disabled prefix caching for hybrid/Mamba models (#23716), added FP32 SSM kernel support (#23506), full CUDA graph support for Mamba1 (#23035), and V1 as default for Mamba models (#23650).
- Performance core improvements:
--safetensors-load-strategy
for NFS based file loading acceleration (#24469), critical CUDA graph capture throughput fix (#24128), scheduler optimization for single completions (#21917), multi-threaded model weight loading (#23928), and tensor core usage enforcement for FlashInfer decode (#23214). - Multimodal enhancements: Multimodal cache tracking with mm_hash (#22711), UUID-based multimodal identifiers (#23394), improved V1 video embedding estimation (#24312), and simplified multimodal UUID handling (#24271).
- Sampling and structured outputs: Support for all prompt logprobs (#23868), final logprobs (#22387), grammar bitmask optimization (#23361), and user-configurable KV cache memory size (#21489).
- Distributed: Support Decode Context Parallel (DCP) for MLA (#23734)
Hardware & Performance
- NVIDIA Blackwell/SM100 generation: FP8 MLA support with CUTLASS backend (#23289), DeepGEMM Linear with 1.5% E2E throughput improvement (#23351), Hopper DeepGEMM E8M0 for DeepSeekV3.1 (#23666), SM100 FlashInfer CUTLASS MoE FP8 backend (#22357), MXFP4 fused CUTLASS MoE (#23696), default MXFP4 MoE on Blackwell (#23008), and GPT-OSS DP/EP support with 52,003 tokens/s throughput (#23608).
- Breaking change: FlashMLA disabled on Blackwell GPUs due to compatibility issues (#24521).
- Kernel and attention optimizations: FlashAttention MLA with CUDA graph support (#14258, #23958), V1 cross-attention support (#23297), FP8 support for FlashMLA (#22668), fused grouped TopK for MoE (#23274), Flash Linear Attention kernels (#24518), and W4A8 support on Hopper (#23198).
- Performance improvements: 13.7x speedup for token conversion (#20413), TTIT/TTFT improvements for disaggregated serving (#22760), symmetric memory all-reduce by default (#24111), FlashInfer warmup during startup (#23439), V1 model execution overlap (#23569), and various Triton configuration tuning (#23748, #23939).
- Platform expansion: Apple Silicon bfloat16 support for M2+ (#24129), IBM Z V1 engine support (#22725), Intel XPU torch.compile (#22609), XPU MoE data parallelism (#22887), XPU Triton attention (#24149), XPU FP8 quantization (#23148), and ROCm pipeline parallelism with Ray (#24275).
- Model-specific optimizations: Hardware-tuned MoE configurations for Qwen3-Next on B200/H200/H100 (#24698, #24688, #24699, #24695), GLM-4.5-Air-FP8 B200 configs (#23695), Kimi K2 optimization (#24597), and QWEN3 Coder/Thinking configs (#24266, #24330).
Quantization
- New quantization capabilities: Per-layer quantization routing (#23556), GGUF quantization with layer skipping (#23188), NFP4+FP8 MoE support (#22674), W4A8 channel scales (#23570), and AMD CDNA2/CDNA3 FP4 support (#22527).
- Advanced quantization infrastructure: Compressed tensors transforms for linear operations (#22486) enabling techniques like SpinQuantR1R2R4 and QuIP quantization methods.
- FlashInfer quantization integration: FP8 KV cache for TRTLLM prefill attention (#24197), FP8-qkv attention kernels (#23647), and FP8 per-tensor GEMMs (#22895).
- Platform-specific quantization: ROCm TorchAO quantization enablement (#24400) and TorchAO module swap configuration (#21982).
- Performance optimizations: MXFP4 MoE loading cache optimization (#24154) and compressed tensors version updates (#23202).
- Breaking change: Removed original Marlin quantization format (#23204).
API & Frontend
- OpenAI API enhancements: Gemma3n audio transcription/translation endpoints (#23735), transcription response usage statistics (#23576), and return_token_ids parameter (#22587).
- Response API improvements: Streaming support for non-harmony responses (#23741), non-streaming logprobs (#23319), MCP tool background mode (#23494), MCP streaming+background support (#23927), and tool output token reporting (#24285).
- Frontend optimizations: Error stack traces with --log-error-stack (#22960), collective RPC endpoint (#23075), beam search concurrency optimization (#23599), unnecessary detokenization skipping (#24236), and custom media UUIDs (#23449).
- Configuration enhancements: Formalized --mm-encoder-tp-mode flag (#23190), VLLM_DISABLE_PAD_FOR_CUDAGRAPH environment variable (#23595), EPLB configuration parameter (#20562), embedding endpoint chat request support (#23931), and LM Format Enforcer V1 integration (#22564).
Dependencies
- Major updates: PyTorch 2.8.0 upgrade (#20358) - breaking change requiring environment updates, FlashInfer v0.3.0 upgrade (#24086), and FlashInfer 0.2.14.post1 maintenance update (#23537).
- Supporting updates: XGrammar 0.1.23 (#22988), TPU core dump fix with tpu_info 0.4.0 (#23135), and compressed tensors version bump (#23202).
- Deployment improvements: FlashInfer cubin directory environment variable (#22675) for offline environments and pre-cached CUDA binaries.
V0 Deprecation
- Backend removals: V0 Neuron backend deprecation (#21159), V0 pooling model support removal (#23434), V0 FlashInfer attention backend removal (#22776), and V0 test cleanup (#23418, #23862).
- API breaking changes: prompt_token_ids fallback removal from LLM.generate and LLM.embed (#18800), LoRA extra vocab size deprecation warning (#23635), LoRA bias parameter deprecation (#24339), and metrics naming change from TPOT to ITL (#24110).
Breaking Changes
- PyTorch 2.8.0 upgrade - Environment dependency change requiring updated CUDA versions
- FlashMLA Blackwell restriction - FlashMLA disabled on Blackwell GPUs due to compatibility issues
- V0 feature removals - Neuron backend, pooling models, FlashInfer attention backend
- Quantizations - Removed quantized Mixtral hack implementation, and original Marlin format.
- Metrics renaming - TPOT deprecated in favor of ITL
What's Changed
- [Misc] Minor code cleanup for _get_prompt_logprobs_dict by @WoosukKwon in #23064
- [Misc] enhance static type hint by @andyxning in #23059
- [Bugfix] fix Qwen2.5-Omni processor output mapping by @DoubleVII in #23058
- [Bugfix][CI] Machete kernels: deterministic ordering for more cache hits by @andylolu2 in #23055
- [Misc] refactor function name by @andyxning in #23029
- [Misc] Fix backward compatibility from #23030 by @ywang96 in #23070
- [XPU] Fix compile size for xpu by @jikunshang in #23069
- [XPU][CI]add xpu env vars in CI scripts by @jikunshang in #22946
- [Refactor] Define MultiModalKwargsItems separate from MultiModalKwargs by @DarkLight1337 in #23053
- [Bugfix] fix IntermediateTensors equal method by @andyxning in https://github.com/vllm-project/...
v0.10.1.1
This is a critical bugfix and security release:
-
Fix CUTLASS MLA Full CUDAGraph (#23200)
-
Limit HTTP header count and size (#23267): GHSA-rxc4-3w6r-4v47
-
Do not use eval() to convert unknown types (#23266): GHSA-79j6-g2m3-jgfw
Full Changelog: v0.10.1...v0.10.1.1
v0.10.1
Highlights
v0.10.1 release includes 727 commits, 245 committers (105 new contributors).
NOTE: This release deprecates V0 FA3 support and as a result FP8 kv-cache in V0 may have issues
Model Support
- New model families: GPT-OSS with comprehensive tool calling and streaming support (#22327, #22330, #22332, #22335, #22339, #22340, #22342), Command-A-Vision (#22660), mBART (#22883), and SmolLM3 using Transformers backend (#22665).
- Vision-language models: Official Eagle multimodal support with Llama4 backend (#20788), Step3 vision-language models (#21998), Gemma3n multimodal (#20495), MiniCPM-V 4.0 (#22166), HyperCLOVAX-SEED-Vision-Instruct-3B (#20931), Emu3 with Transformers backend (#21319), Intern-S1 (#21628), and Prithvi in online serving mode (#21518).
- Enhanced existing models: NemotronH support (#22349), Ernie 4.5 Base 0.3B model name change (#21735), GLM-4.5 series improvements (#22215), Granite models with fused MoE configurations (#21332) and quantized checkpoint loading (#22925), Ultravox support for Llama 4 and Gemma 3 backends (#17818), Mamba1 and Jamba model support in V1 (without CUDA graphs) (#21249)
- Advanced model capabilities: Qwen3 EPLB (#20815) and dual-chunk attention support (#21924), Qwen native Eagle3 target support (#22333).
- Architecture expansions: Encoder-only models without KV-cache enabling BERT-style architectures (#21270), expanded tensor parallelism support in Transformers backend (#22651), tensor parallelism for Deepseek_vl2 vision transformer (#21494), and tensor/pipeline parallelism with Mamba2 kernel for PLaMo2 (#19674).
- V1 engine compatibility: Extended support for additional pooling models (#21747) and Step3VisionEncoder distributed processing option (#22697).
Engine Core
- CUDA graph performance: Full CUDA graph support with separate attention routines, adding FA2 and FlashInfer compatibility (#20059), plus 6% end-to-end throughput improvement from Cutlass MLA (#22763).
- Attention system advances: Multiple attention metadata builders per KV cache specification (#21588), tree attention backend for v1 engine (experimental) (#20401), FlexAttention encoder-only support (#22273), upgraded FlashAttention 3 with attention sink support (#22313), and multiple attention groups for KV sharing patterns (#22672).
- Speculative decoding optimizations: N-gram speculative decoding with single KMP token proposal algorithm (#22437), explicit EAGLE3 interface for enhanced compatibility (#22642).
- Default behavior improvements: Pooling models now default to chunked prefill and prefix caching (#20930), disabled chunked local attention by default for Llama4 for better performance (#21761).
- Extensibility and configuration: Model loader plugin system (#21067), custom operations support for FusedMoe (#22509), rate limiting with bucket algorithm for proxy server (#22643), torch.compile support for bailing MoE (#21664).
- Performance optimizations: Improved startup time by disabling C++ compilation of symbolic shapes (#20836), enhanced headless models for pooling in Transformers backend (#21767).
Hardware & Performance
- NVIDIA Blackwell (SM100) optimizations: CutlassMLA as default backend (#21626), FlashInfer MoE per-tensor scale FP8 backend (#21458), SM90 CUTLASS FP8 GEMM with kernel tuning and swap AB support (#20396).
- NVIDIA RTX 5090/RTX PRO 6000 (SM120) support: Block FP8 quantization (#22131) and CUTLASS NVFP4 4-bit weights/activations support (#21309).
- AMD ROCm platform enhancements: Flash Attention backend for Qwen-VL models (#22069), AITER HIP block quantization kernels (#21242), reduced device-to-host transfers (#22683), and optimized kernel performance for small batch sizes 1-4 (#21350).
- Attention and compute optimizations: FlashAttention 3 attention sinks performance boost (#22478), Triton-based multi-dimensional RoPE replacing PyTorch implementation (#22375), async tensor parallelism for scaled matrix multiplication (#20155), optimized FlashInfer metadata building (#21137).
- Memory and throughput improvements: Mamba2 reduced device-to-device copy overhead (#21075), fused Triton kernels for RMSNorm (#20839, #22184), improved multimodal hasher performance for repeated image prompts (#22825), multithreaded async multimodal loading (#22710).
- Parallelization and MoE optimizations: Guided decoding throughput improvements (#21862), balanced expert sharding for MoE models (#21497), expanded fused kernel support for topk softmax (#22211), fused MoE for nomic-embed-text-v2-moe (#18321).
- Hardware compatibility and kernels: ARM CPU build fixes for systems without BF16 support (#21848), Machete memory-bound performance improvements (#21556), FlashInfer TRT-LLM prefill attention kernel support (#22095), optimized
reshape_and_cache_flash
CUDA kernel (#22036), CPU transfer support in NixlConnector (#18293). - Specialized CUDA kernels: GPT-OSS activation functions (#22538), RLHF weight loading acceleration (#21164).
Quantization
- Advanced quantization techniques: MXFP4 and bias support for Marlin kernel (#22428), NVFP4 GEMM FlashInfer backends (#22346), compressed-tensors mixed-precision model loading (#22468), FlashInfer MoE support for NVFP4 (#21639).
- Hardware-optimized quantization: Dynamic 4-bit quantization with Kleidiai kernels for CPU inference (#17112), TensorRT-LLM FP4 quantization optimized for MoE low-latency inference (#21331).
- Expanded model quantization support: BitsAndBytes quantization for InternS1 (#21953) and additional MoE models (#21370, #21548), Gemma3n quantization compatibility (#21974), calibration-free RTN quantization for MoE models (#20766), ModelOpt Qwen3 NVFP4 support (#20101).
- Performance and compatibility improvements: CUDA kernel optimization for Int8 per-token group quantization (#21476), non-contiguous tensor support in FP8 quantization (#21961), automatic detection of ModelOpt quantization formats (#22073).
- Breaking change: Removed AQLM quantization support (#22943) - users should migrate to alternative quantization methods.
API & Frontend
- OpenAI API compatibility: Unix domain socket support for local communication (#18097), improved error response format matching upstream specification (#22099), aligned tool_choice="required" behavior with OpenAI when tools list is empty (#21052).
- New API capabilities: Dedicated LLM.reward interface for reward models (#21720), chunked processing for long inputs in embedding models (#22280), AsyncLLM proper response handling for aborted requests (#22283).
- Configuration and environment: Multiple API keys support for enhanced authentication (#18548), custom vLLM tuned configuration paths (#22791), environment variable control for logging statistics (#22905), multimodal cache size (#22441), and DeepGEMM E8M0 scaling behavior (#21968).
- CLI and tooling improvements: V1 API support for run-batch command (#21541), custom process naming for better monitoring (#21445), improved help display showing available choices (#21760), optional memory profiling skip for multimodal models (#22950), enhanced logging of non-default arguments (#21680).
- Tool and parser support: HermesToolParser for models without special tokens (#16890), multi-turn conversation benchmarking tool (#20267).
- Distributed serving enhancements: Enhanced hybrid distributed serving with multiple API servers in load balancing mode (#21510), request_id support for external load balancers (#21009).
- User experience enhancements: Improved error messaging for multimodal items (#22114), per-request pooling control via PoolingParams (#20538).
Dependencies
- FlashInfer updates: Updated to v0.2.8 for improved performance (#21385), moved to optional dependency install with
pip install vllm[flashinfer]
for flexible installation (#21959). - Mamba SSM restructuring: Updated to version 2.2.5 (#21421), removed from core requirements to reduce installation complexity (#22541).
- Docker and deployment: Docker-aware precompiled wheel support for easier containerized deployment (#21127, #22106).
- Python package updates: OpenAI Python dependency updated to latest version for API compatibility (#22316).
- Dependency optimizations: Removed xformers requirement for Mistral-format Pixtral and Mistral3 models (#21154), deprecation warnings added for old DeepGEMM version (#22194).
V0 Deprecation
Important: As part of the ongoing V0 engine cleanup, several breaking changes have been introduced:
- CLI flag updates: Replaced
--task
with--runner
and--convert
options (#21470), deprecated--disable-log-requests
in favor of--enable-log-requests
for clearer semantics (#21739), renamed--expand-tools-even-if-tool-choice-none
to--exclude-tools-when-tool-choice-none
for consistency (#20544). - API cleanup: Removed previously deprecated arguments and methods as part of ongoing V0 engine codebase cleanup (#21907).
What's Changed
- Deduplicate Transformers backend code using inheritance by @hmellor in #21461
- [Bugfix][ROCm] Fix for warp_size uses on host by @gshtras in #21205
- [TPU][Bugfix] fix moe layer by @yaochengji in #21340
- [v1][Core] Clean up usages of
SpecializedManager
by @zhouwfang in #21407 - [Misc] Fix duplicate FusedMoEConfig debug messages by @njhill in #21455
- [Core] Support model loader plugins by @22quinn in #21067
- remove GLM-4 quantization wrong Code by @zRzRzRzRzRzRzR in #21435
- Replace
--expand-tools-even-if-tool-choice-none
with--exclude-tools-when-tool-choice-none
...
v0.10.1rc1
What's Changed
- Deduplicate Transformers backend code using inheritance by @hmellor in #21461
- [Bugfix][ROCm] Fix for warp_size uses on host by @gshtras in #21205
- [TPU][Bugfix] fix moe layer by @yaochengji in #21340
- [v1][Core] Clean up usages of
SpecializedManager
by @zhouwfang in #21407 - [Misc] Fix duplicate FusedMoEConfig debug messages by @njhill in #21455
- [Core] Support model loader plugins by @22quinn in #21067
- remove GLM-4 quantization wrong Code by @zRzRzRzRzRzRzR in #21435
- Replace
--expand-tools-even-if-tool-choice-none
with--exclude-tools-when-tool-choice-none
for v0.10.0 by @okdshin in #20544 - [Misc] Improve comment for DPEngineCoreActor._set_cuda_visible_devices() by @ruisearch42 in #21501
- [Feat] Allow custom naming of vLLM processes by @chaunceyjiang in #21445
- bump
flashinfer
tov0.2.8
by @cjackal in #21385 - [Attention] Optimize FlashInfer MetadataBuilder Build call by @LucasWilkinson in #21137
- [Model] Officially support Emu3 with Transformers backend by @hmellor in #21319
- [Bugfix] Fix CUDA arch flags for MoE permute by @minosfuture in #21426
- [Fix] Update mamba_ssm to 2.2.5 by @elvischenv in #21421
- [Docs] Update Tensorizer usage documentation by @sangstar in #21190
- [Docs] Rewrite Distributed Inference and Serving guide by @crypdick in #20593
- [Bug] Fix Compressed Tensor NVFP4
cutlass_fp4_group_mm
illegal memory access by @yewentao256 in #21465 - Update flashinfer CUTLASS MoE Kernel by @wenscarl in #21408
- [XPU] Conditionally import CUDA-specific passes to avoid import errors on xpu platform by @chaojun-zhang in #21036
- [P/D] Move FakeNixlWrapper to test dir by @ruisearch42 in #21328
- [P/D] Support CPU Transfer in NixlConnector by @juncgu in #18293
- [Docs][minor] Fix broken gh-file link in distributed serving docs by @crypdick in #21543
- [Docs] Add Expert Parallelism Initial Documentation by @simon-mo in #21373
- update flashinfer to v0.2.9rc1 by @weireweire in #21485
- [TPU][TEST] HF_HUB_DISABLE_XET=1 the test 3. by @QiliangCui in #21539
- [MoE] More balanced expert sharding by @WoosukKwon in #21497
- [Frontend]
run-batch
supports V1 by @DarkLight1337 in #21541 - [Docs] Fix
site_url
for RunLLM by @hmellor in #21564 - [Bug] Fix DeepGemm Init Error by @yewentao256 in #21554
- Fix GLM-4 PP Missing Layer When using with PP. by @zRzRzRzRzRzRzR in #21531
- [Kernel] adding fused_moe configs for upcoming granite4 by @bringlein in #21332
- [Bugfix] DeepGemm utils : Fix hardcoded type-cast by @varun-sundar-rabindranath in #21517
- [DP] Support api-server-count > 0 in hybrid DP LB mode by @njhill in #21510
- [TPU][Test] Temporarily suspend this MoE model in test_basic.py. by @QiliangCui in #21560
- [Docs] Add
requirements/common.txt
to run unit tests by @zhouwfang in #21572 - Integrate TensorSchema with shape validation for Phi3VImagePixelInputs by @bbeckca in #21232
- [CI] Update CODEOWNERS for CPU and Intel GPU by @bigPYJ1151 in #21582
- [Bugfix] fix modelscope snapshot_download serialization by @andyxning in #21536
- [Model] Support tensor parallel for timm ViT in Deepseek_vl2 by @wzqd in #21494
- [Model] Fix a check for None but the return value was empty list in Gemma3 MM vision_embeddings by @hfan in #21479
- [Misc][Tools] make max-model-len a parameter in auto_tune script by @yaochengji in #21321
- [CI/Build] fix cpu_extension for apple silicon by @ignaciosica in #21195
- [Misc] Removed undefined cmake variables MOE_PERMUTE_ARCHS by @chenyang78 in #21262
- [TPU][Bugfix] fix OOM issue in CI test by @yaochengji in #21550
- [Tests] Harden DP tests by @njhill in #21508
- Add H20-3e fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-Instruct by @Xu-Wenqing in #21598
- [Bugfix] GGUF: fix AttributeError: 'PosixPath' object has no attribute 'startswith' by @kebe7jun in #21579
- [Quantization] Enable BNB support for more MoE models by @jeejeelee in #21370
- [V1] Get supported tasks from model runner instead of model config by @DarkLight1337 in #21585
- [Bugfix][Logprobs] Fix logprobs op to support more backend by @MengqingCao in #21591
- [Model] Fix Ernie4.5MoE e_score_correction_bias parameter by @xyxinyang in #21586
- [MODEL] New model support for naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B by @bigshanedogg in #20931
- [Frontend] Add request_id to the Request object so they can be controlled better via external load balancers by @kouroshHakha in #21009
- [Model] Replace Mamba2 RMSNorm Gated with Fused Triton Kernel by @cyang49 in #20839
- [ROCm][AITER] Enable fp8 kv cache on rocm aiter backend. by @fsx950223 in #20295
- [Kernel] Improve machete memory bound perf by @czhu-cohere in #21556
- Add support for Prithvi in Online serving mode by @mgazz in #21518
- [CI] Unifying Dockerfiles for ARM and X86 Builds by @kebe7jun in #21343
- [Docs] add auto-round quantization readme by @wenhuach21 in #21600
- [TPU][Test] Rollback PR-21550. by @QiliangCui in #21619
- Add Unsloth to RLHF.md by @danielhanchen in #21636
- [Perf] Cuda Kernel for Int8 Per Token Group Quant by @yewentao256 in #21476
- Add interleaved RoPE test for Llama4 (Maverick) by @sarckk in #21478
- [Bugfix] Fix sync_and_slice_intermediate_tensors by @ruisearch42 in #21537
- [Bugfix] Always set RAY_ADDRESS for Ray actor before spawn by @ruisearch42 in #21540
- [TPU] Update ptxla nightly version to 20250724 by @yaochengji in #21555
- [Feature] Add support for MoE models in the calibration-free RTN-based quantization by @sakogan in #20766
- [Model] Ultravox: Support Llama 4 and Gemma 3 backends by @farzadab in #17818
- [Docs] add offline serving multi-modal video input expamle Qwen2.5-VL by @david6666666 in #21530
- Correctly kill vLLM processes after finishing serving benchmarks by @huydhn in #21641
- [Bugfix] Fix isinstance check for tensor types in _load_prompt_embeds to use dtype comparison by @Mitix-EPI in #21612
- [TPU][Test] Divide TPU v1 Test into 2 parts. by @QiliangCui in #21431
- Support Intern-S1 by @lvhan028 in #21628
- [Misc] remove unused try-except in pooling config check by @reidliu41 in #21618
- [Take 2] Correctly kill vLLM processes after benchmarks by @huydhn in #21646
- Migrate AriaImagePixelInputs to TensorSchema for shape validation by @bbeckca in #21620
- Migrate AyaVisionImagePixelInputs to TensorSchema for shape validation by @bbeckca in #21622
- [Bugfix] Investigate Qwen2-VL failing test by @Isotr0py in #21527
- Support encoder-only models without KV-Cache by @maxdebayser in #21270
- [Bug] Fix
has_flashinfer_moe
Import Error when it is not installed by @yewentao256 in #21634 - [Misc] Improve memory profiling debug message by @yeqcharlotte in #21429
- [BugF...
v0.10.0
Highlights
v0.10.0 release includes 308 commits, 168 contributors (62 new!).
NOTE: This release begins the cleanup of V0 engine codebase. We have removed V0 CPU/XPU/TPU/HPU backends (#20412), long context LoRA (#21169), Prompt Adapters (#20588), Phi3-Small & BlockSparse Attention (#21217), and Spec Decode workers (#21152) so far and plan to continued to delete code that is no longer used.
Model Support
- New families: Llama 4 with EAGLE support (#20591), EXAONE 4.0 (#21060), Microsoft Phi-4-mini-flash-reasoning (#20702), Hunyuan V1 Dense + A13B with reasoning/tool parsing (#21368, #20625, #20820), Ling MoE models (#20680), JinaVL Reranker (#20260), Nemotron-Nano-VL-8B-V1 (#20349), Arcee (#21296), Voxtral (#20970).
- Enhanced compatibility: BERT/RoBERTa with AutoWeightsLoader (#20534), HF format support for MiniMax (#20211), Gemini configuration (#20971), GLM-4 updates (#20736).
- Architecture expansions: Attention-free model support (#20811), Hybrid SSM/Attention models on V1 (#20016), LlamaForSequenceClassification (#20807), expanded Mamba2 layer support (#20660).
- VLM improvements: VLM support with transformers backend (#20543), PrithviMAE on V1 engine (#20577).
Engine Core
- Experimental async scheduling
--async-scheduling
flag to overlap engine core scheduling with GPU runner (#19970). - V1 engine improvements: backend-agnostic local attention (#21093), MLA FlashInfer ragged prefill (#20034), hybrid KV cache with local chunked attention (#19351).
- Multi-task support: models can now support multiple tasks (#20771), multiple poolers (#21227), and dynamic pooling parameter configuration (#21128).
- RLHF Support: new RPC methods for runtime weight reloading (#20096) and config updates (#20095), logprobs mode for selecting which stage of logprobs to return (#21398).
- Enhanced caching: multi-modal caching for transformers backend (#21358), reproducible prefix cache hashing using SHA-256 + CBOR (#20511).
- Startup time reduction via CUDA graph capture speedup via frozen GC (#21146).
- Elastic expert parallel for dynamic GPU scaling while preserving state (#20775).
Hardwares & Performance
- NVIDIA Blackwell/SM100 optimizations: CUTLASS block scaled group GEMM for smaller batches (#20640), FP8 groupGEMM support (#20447), DeepGEMM integration (#20087), FlashInfer MoE blockscale FP8 backend (#20645), CUDNN prefill API for MLA (#20411), Triton Fused MoE kernel config for FP8 E=16 on B200 (#20516).
- Performance improvements: 48% request duration reduction via microbatch tokenization for concurrent requests (#19334), fused MLA QKV + strided layernorm (#21116), Triton causal-conv1d for Mamba models (#18218).
- Hardware expansion: ARM CPU int8 quantization (#14129), PPC64LE/ARM V1 support (#20554), Intel XPU ray distributed execution (#20659), shared-memory pipeline parallel for CPU (#21289), FlashInfer ARM CUDA support (#21013).
Quantization
- New quantization support: MXFP4 for MoE models (#17888), BNB support for Mixtral and additional MoE models (#20893, #21100), in-flight quantization for MoE (#20061).
- Hardware-specific: FP8 KV cache quantization on TPU (#19292), FP8 support for BatchedTritonExperts (#18864), optimized INT8 vectorization kernels (#20331).
- Performance optimizations: Triton backend for DeepGEMM per-token group quantization (#20841), CUDA kernel for per-token group quantization (#21083), CustomOp abstraction for FP8 (#19830).
API & Frontend
- OpenAI compatibility: Responses API implementation (#20504, #20975), image object support in llm.chat (#19635), tool calling with required choice and $defs (#20629).
- New endpoints:
get_tokenizer_info
for tokenizer/chat-template information (#20575), cache_salt support for completions/responses (#20981). - Model loading: Tensorizer S3 integration with arbitrary arguments (#19619), HF repo paths & URLs for GGUF models (#20793), tokenization_kwargs for embedding truncation (#21033).
- CLI improvements:
--help=page
option for enhanced help documentation (#20961), default model changed to Qwen3-0.6B (#20335).
Dependencies
What's Changed
- [Docs] Note that alternative structured output backends are supported by @russellb in #19426
- [ROCm][V1] Adding ROCm to the list of plaforms using V1 by default by @gshtras in #19440
- [Model] use AutoWeightsLoader for commandr by @py-andy-c in #19399
- Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 by @Xu-Wenqing in #19401
- [BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 by @zou3519 in #19390
- [New Model]: Support Qwen3 Embedding & Reranker by @noooop in #19260
- [BugFix] Fix docker build cpu-dev image error by @2niuhe in #19394
- Fix test_max_model_len in tests/entrypoints/llm/test_generate.py by @houseroad in #19451
- [CI] Disable failing GGUF model test by @mgoin in #19454
- [Misc] Remove unused
MultiModalHasher.hash_prompt_mm_data
by @lgeiger in #19422 - Add fused MOE config for Qwen3 30B A3B on B200 by @0xjunhao in #19455
- Fix Typo in Documentation and Function Name by @leopardracer in #19442
- [ROCm] Add rules to automatically label ROCm related PRs by @houseroad in #19405
- [Kernel] Support deep_gemm for linear methods by @artetaout in #19085
- [Doc] Update V1 User Guide for Hardware and Models by @DarkLight1337 in #19474
- [Doc] Fix quantization link titles by @DarkLight1337 in #19478
- [Doc] Support "important" and "announcement" admonitions by @DarkLight1337 in #19479
- [Misc] Reduce warning message introduced in env_override by @houseroad in #19476
- Support non-string values in JSON keys from CLI by @DarkLight1337 in #19471
- Add cache to cuda get_device_capability by @mgoin in #19436
- Fix some typo by @Ximingwang-09 in #19475
- Support no privileged mode on CPU for docker and kubernetes deployments by @louie-tsai in #19241
- [Bugfix] Update the example code, make it work with the latest lmcache by @runzhen in #19453
- [CI] Update FlashInfer to 0.2.6.post1 by @mgoin in #19297
- [doc] fix "Other AI accelerators" getting started page by @davidxia in #19457
- [Misc] Fix misleading ROCm warning by @jeejeelee in #19486
- [Docs] Remove WIP features in V1 guide by @WoosukKwon in #19498
- [Kernels] Add activation chunking logic to FusedMoEModularKernel by @bnellnm in #19168
- [AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger by @rasmith in #17331
- [UX] Add Feedback During CUDAGraph Capture by @robertgshaw2-redhat in #19501
- [CI/Build] Fix torch nightly CI dependencies by @zou3519 in #19505
- [CI] change spell checker from codespell to typos by @andyxning in #18711
- [BugFix] Force registration of w8a8_block_fp8_matmul_deepgemm via lazy import by @varun-sundar-rabindranath in #19514
- Add Triton Fused MoE kernel config for E=16 on B200 by @b8zhong in #19518
- [Frontend] Improve error message in tool_choice validation by @22quinn in #19239
- [BugFix] Work-around incremental detokenization edge case error by @njhill in #19449
- [BugFix] Handle missing sep_token for Qwen3-Reranker in Score API by @strutive07 in #19522
- [AMD][Kernel][BugFix] fix test_rocm_compressed_tensors_w8a8 for rocm by @rasmith in #19509
- Fix typo by @2niuhe in #19525
- [Security] Prevent new imports of (cloud)pickle by @russellb in #18018
- [Bugfix][V1] Allow manual FlashAttention for Blackwell by @mgoin in #19492
- [Bugfix] Respect num-gpu-blocks-override in v1 by @jmswen in #19503
- [Quantization] Improve AWQ logic by @jeejeelee in #19431
- [Doc] Add V1 column to supported models list by @DarkLight1337 in #19523
- [NixlConnector] Drop
num_blocks
check by @NickLucche in #19532 - [Perf] Vectorize static / dynamic INT8 quant kernels by @yewentao256 in #19233
- Fix TorchAOConfig skip layers by @mobicham in #19265
- [torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass by @ProExpertProg in https://github.com/vllm-proj...
v0.10.0rc2
What's Changed
- [Model] use AutoWeightsLoader for bart by @calvin0327 in #18299
- [Model] Support VLMs with transformers backend by @zucchini-nlp in #20543
- [bugfix] fix syntax warning caused by backslash by @1195343015 in #21251
- [CI] Cleanup modelscope version constraint in Dockerfile by @yankay in #21243
- [Docs] Add RFC Meeting to Issue Template by @simon-mo in #21279
- Add the instruction to run e2e validation manually before release by @huydhn in #21023
- [Bugfix] Fix missing placeholder in logger debug by @DarkLight1337 in #21280
- [Model][1/N] Support multiple poolers at model level by @DarkLight1337 in #21227
- [Docs] Fix hardcoded links in docs by @hmellor in #21287
- [Docs] Make tables more space efficient in
supported_models.md
by @hmellor in #21291 - [Misc] unify variable for LLM instance by @andyxning in #20996
- Add Nvidia ModelOpt config adaptation by @Edwardf0t1 in #19815
- [Misc] Add sliding window to flashinfer test by @WoosukKwon in #21282
- [CPU] Enable shared-memory based pipeline parallel for CPU backend by @bigPYJ1151 in #21289
- [BugFix] make utils.current_stream thread-safety (#21252) by @simpx in #21253
- [Misc] Add dummy maverick test by @minosfuture in #21199
- [Attention] Clean up iRoPE in V1 by @LucasWilkinson in #21188
- [DP] Fix Prometheus Logging by @robertgshaw2-redhat in #21257
- Fix bad lm-eval fork by @mgoin in #21318
- [perf] Speed up align sum kernels by @hj-mistral in #21079
- [v1][sampler] Inplace logprobs comparison to get the token rank by @houseroad in #21283
- [XPU] Enable external_launcher to serve as an executor via torchrun by @chaojun-zhang in #21021
- [Doc] Fix CPU doc format by @bigPYJ1151 in #21316
- [Intel GPU] Ray Compiled Graph avoid NCCL for Intel GPU by @ratnampa in #21338
- Revert "[Performance] Performance improvements in non-blockwise fp8 CUTLASS MoE (#20762) by @minosfuture in #21334
- [Core] Minimize number of dict lookup in _maybe_evict_cached_block by @Jialin in #21281
- [V1] [Hybrid] Add new test to verify that hybrid views into KVCacheTensor are compatible by @tdoublep in #21300
- [Refactor] Fix Compile Warning #1444-D by @yewentao256 in #21208
- Fix kv_cache_dtype handling for out-of-tree HPU plugin by @kzawora-intel in #21302
- [Misc] DeepEPHighThroughtput - Enable Inductor pass by @varun-sundar-rabindranath in #21311
- [Bug] DeepGemm: Fix Cuda Init Error by @yewentao256 in #21312
- Update fp4 quantize API by @wenscarl in #21327
- [Feature][eplb] add verify ep or tp or dp by @lengrongfu in #21102
- Add arcee model by @alyosha-swamy in #21296
- [Bugfix] Fix eviction cached blocked logic by @simon-mo in #21357
- [Misc] Remove deprecated args in v0.10 by @kebe7jun in #21349
- [Core] Optimize update checks in LogitsProcessor by @Jialin in #21245
- [benchmark] Port benchmark request sent optimization to benchmark_serving by @Jialin in #21209
- [Core] Introduce popleft_n and append_n in FreeKVCacheBlockQueue to further optimize block_pool by @Jialin in #21222
- [Misc] unify variable for LLM instance v2 by @andyxning in #21356
- [perf] Add fused MLA QKV + strided layernorm by @mickaelseznec in #21116
- [feat]: add SM100 support for cutlass FP8 groupGEMM by @djmmoss in #20447
- [Perf] Cuda Kernel for Per Token Group Quant by @yewentao256 in #21083
- Adds parallel model weight loading for runai_streamer by @bbartels in #21330
- [feat] Enable mm caching for transformers backend by @zucchini-nlp in #21358
- Revert "[Refactor] Fix Compile Warning #1444-D (#21208)" by @yewentao256 in #21384
- Add tokenization_kwargs to encode for embedding model truncation by @Receiling in #21033
- [Bugfix] Decode Tokenized IDs to Strings for
hf_processor
inllm.chat()
withmodel_impl=transformers
by @ariG23498 in #21353 - [CI/Build] Fix test failure due to updated model repo by @DarkLight1337 in #21375
- Fix Flashinfer Allreduce+Norm enable disable calculation based on
fi_allreduce_fusion_max_token_num
by @xinli-git in #21325 - [Model] Add Qwen3CoderToolParser by @ranpox in #21396
- [Misc] Copy HF_TOKEN env var to Ray workers by @ruisearch42 in #21406
- [BugFix] Fix ray import error mem cleanup bug by @joerunde in #21381
- [CI/Build] Fix model executor tests by @DarkLight1337 in #21387
- [Bugfix][ROCm][Build] Fix build regression on ROCm by @gshtras in #21393
- Simplify weight loading in Transformers backend by @hmellor in #21382
- [BugFix] Update python to python3 calls for image; fix prefix & input calculations. by @ericehanley in #21391
- [BUGFIX] deepseek-v2-lite failed due to fused_qkv_a_proj name update by @xuechendi in #21414
- [Bugfix][CUDA] fixes CUDA FP8 kv cache dtype supported by @elvischenv in #21420
- Changing "amdproduction" allocation. by @Alexei-V-Ivanov-AMD in #21409
- [Bugfix] Fix nightly transformers CI failure by @Isotr0py in #21427
- [Core] Add basic unit test for maybe_evict_cached_block by @Jialin in #21400
- [Cleanup] Only log MoE DP setup warning if DP is enabled by @mgoin in #21315
- add clear messages for deprecated models by @youkaichao in #21424
- [Bugfix] ensure tool_choice is popped when
tool_choice:null
is passed in json payload by @gcalmettes in #19679 - Fixed typo in profiling logs by @sergiopaniego in #21441
- [Docs] Fix bullets and grammars in tool_calling.md by @windsonsea in #21440
- [Sampler] Introduce logprobs mode for logging by @houseroad in #21398
- Mamba V2 Test not Asserting Failures. by @fabianlim in #21379
- [Misc] fixed nvfp4_moe test failures due to invalid kwargs by @chenyang78 in #21246
- [Docs] Clean up v1/metrics.md by @windsonsea in #21449
- [Model] add Hunyuan V1 Dense Model support. by @kzjeef in #21368
- [V1] Check all pooling tasks during profiling by @DarkLight1337 in #21299
- [Bugfix][Qwen][DCA] fixes bug in dual-chunk-flash-attn backend for qwen 1m models. by @sighingnow in #21364
- [Tests] Add tests for headless internal DP LB by @njhill in #21450
- [Core][Model] PrithviMAE Enablement on vLLM v1 engine by @christian-pinto in #20577
- Add test case for compiling multiple graphs by @sarckk in #21044
- [TPU][TEST] Fix the downloading issue in TPU v1 test 11. by @QiliangCui in #21418
- [Core] Add
reload_weights
RPC method by @22quinn in #20096 - [V1] Fix local chunked attention always disabled by @sarckk in #21419
- [V0 Deprecation] Remove Prompt Adapters by @mgoin in #20588
- [Core] Freeze gc during cuda graph capture to speed up init by @mgoin in #21146
- feat(gguf_loader): accept HF repo paths & URLs for GGUF by @hardikkgupta in #20793
- [Frontend] Set MAX_AUDIO_CLI...
v0.10.0rc1
What's Changed
- [Kernel] Enable fp8 support for pplx and BatchedTritonExperts. by @bnellnm in #18864
- [Misc] Fix
Unable to detect current VLLM config. Defaulting to NHD kv cache layout
warning by @NickLucche in #20400 - [Bugfix] Register reducer even if transformers_modules not available by @eicherseiji in #19510
- Change warn_for_unimplemented_methods to debug by @mgoin in #20455
- [Platform] Add custom default max tokens by @gmarinho2 in #18557
- Add ignore consolidated file in mistral example code by @princepride in #20420
- [Misc] small update by @reidliu41 in #20462
- [Structured Outputs][V1] Skipping with models doesn't contain tokenizers by @aarnphm in #20365
- [Perf] Optimize Vectorization Utils for Int 8 Quantization Kernels by @yewentao256 in #20331
- [Misc] Add SPDX-FileCopyrightText by @jeejeelee in #20428
- Support Llama 4 for fused_marlin_moe by @mgoin in #20457
- [Bug][Frontend] Fix structure of transcription's decoder_prompt by @sangbumlikeagod in #18809
- [Model][3/N] Automatic conversion of CrossEncoding model by @noooop in #20168
- [Doc] Fix classification table in list of supported models by @DarkLight1337 in #20489
- [CI] add kvcache-connector dependency definition and add into CI build by @panpan0000 in #18193
- [Misc] Small: Remove global media connector. Each test should have its own test connector object. by @huachenheli in #20395
- Enable V1 for Hybrid SSM/Attention Models by @tdoublep in #20016
- [feat]: CUTLASS block scaled group gemm for SM100 by @djmmoss in #19757
- [CI Bugfix] Fix pre-commit failures on main by @mgoin in #20502
- [Doc] fix mutltimodal_inputs.md gh examples link by @GuyStone in #20497
- [Misc] Add security warning for development mode endpoints by @reidliu41 in #20508
- [doc] small fix by @reidliu41 in #20506
- [Misc] Remove the unused LoRA test code by @jeejeelee in #20494
- Fix unknown attribute of topk_indices_dtype in CompressedTensorsW8A8Fp8MoECutlassMethod by @luccafong in #20507
- [v1] Re-add fp32 support to v1 engine through FlexAttention by @Isotr0py in #19754
- [Misc] Add logger.exception for TPU information collection failures by @reidliu41 in #20510
- [Misc] remove unused import by @reidliu41 in #20517
- test_attention compat with coming xformers change by @bottler in #20487
- [BUG] Fix #20484. Support empty sequence in cuda penalty kernel by @vadiklyutiy in #20491
- [Bugfix] Fix missing per_act_token parameter in compressed_tensors_moe by @luccafong in #20509
- [BugFix] Fix: ImportError when building on hopper systems by @LucasWilkinson in #20513
- [TPU][Bugfix] fix the MoE OOM issue by @yaochengji in #20339
- [Frontend] Support image object in llm.chat by @sfeng33 in #19635
- [Benchmark] Add support for multiple batch size benchmark through CLI in
benchmark_moe.py
+ Add Triton Fused MoE kernel config for FP8 E=16 on B200 by @b8zhong in #20516 - [Misc] call the pre-defined func by @reidliu41 in #20518
- [V0 deprecation] Remove V0 CPU/XPU/TPU backends by @WoosukKwon in #20412
- [V1] Support any head size for FlexAttention backend by @DarkLight1337 in #20467
- [BugFix][Spec Decode] Fix spec token ids in model runner by @WoosukKwon in #20530
- [Bugfix] Add
use_cross_encoder
flag to use correct activation inClassifierPooler
by @DarkLight1337 in #20527 - Implement OpenAI Responses API [1/N] by @WoosukKwon in #20504
- [Misc] add a tip for pre-commit by @reidliu41 in #20536
- [Refactor]Abstract Platform Interface for Distributed Backend and Add xccl Support for Intel XPU by @dbyoung18 in #19410
- [CI/Build] Enable phi2 lora test by @jeejeelee in #20540
- [XPU][CI] add v1/core test in xpu hardware ci by @Liangliang-Ma in #20537
- Add docstrings to url_schemes.py to improve readability by @windsonsea in #20545
- [XPU] log clean up for XPU platform by @yma11 in #20553
- [Docs] Clean up tables in supported_models.md by @windsonsea in #20552
- [Misc] remove unused jinaai_serving_reranking by @Abirdcfly in #18878
- [Misc] Set the minimum openai version by @jeejeelee in #20539
- [Doc] Remove extra whitespace from CI failures doc by @hmellor in #20565
- [Doc] Use
gh-pr
andgh-issue
everywhere we can in the docs by @hmellor in #20564 - [Doc] Fix internal links so they don't always point to latest by @hmellor in #20563
- [Doc] Add outline for content tabs by @hmellor in #20571
- [Doc] Fix some MkDocs snippets used in the installation docs by @hmellor in #20572
- [Model][Last/4] Automatic conversion of CrossEncoding model by @noooop in #19675
- [Bugfix] Prevent IndexError for cached requests when pipeline parallelism is disabled by @panpan0000 in #20486
- [Feature] microbatch tokenization by @ztang2370 in #19334
- [DP] Copy environment variables to Ray DPEngineCoreActors by @ruisearch42 in #20344
- [Kernel] Optimize Prefill Attention in Unified Triton Attention Kernel by @jvlunteren in #20308
- [Misc] Add fully interleaved support for multimodal 'string' content format by @Dekakhrone in #14047
- [Misc] feat output content in stream response by @lengrongfu in #19608
- Fix links in multi-modal model contributing page by @hmellor in #18615
- [Config] Refactor mistral configs by @patrickvonplaten in #20570
- [Misc] Improve logging for dynamic shape cache compilation by @kyolebu in #20573
- [Bugfix] Fix Maverick correctness by filling zero to cache space in cutlass_moe by @minosfuture in #20167
- [Optimize] Don't send token ids when kv connector is not used by @WoosukKwon in #20586
- Make distinct
code
andconsole
admonitions so readers are less likely to miss them by @hmellor in #20585 - [Bugfix]: Fix messy code when using logprobs by @chaunceyjiang in #19209
- [Doc] Syntax highlight request responses as JSON instead of bash by @hmellor in #20582
- [Docs] Rewrite offline inference guide by @crypdick in #20594
- [Docs] Improve docstring for ray data llm example by @crypdick in #20597
- [Docs] Add Ray Serve LLM section to openai compatible server guide by @crypdick in #20595
- [Docs] Add Anyscale to frameworks by @crypdick in #20590
- [Misc] improve error msg by @reidliu41 in #20604
- [CI/Build][CPU] Fix CPU CI and remove all CPU V0 files by @bigPYJ1151 in #20560
- [TPU] Temporary fix vmem oom for long model len by reducing page size by @Chenyaaang in #20278
- [Frontend] [Core] Integrate Tensorizer in to S3 loading machinery, allow passing arbitrary arguments during save/load by @sangstar in #19619
- [PD][Nixl] Remote consumer READ timeout for clearing request blocks by @NickLucche in #20139
- [Docs] Improve documentation for Deepseek R1 on Ray Serve LLM by @crypdick in #20601
- Remove unnecessary explicit title anchors and use relative links instead by @hmellor in #20620
- Stop using title frontmatter and fix doc that can only be ...
v0.9.2
Highlights
This release contains 452 commits from 167 contributors (31 new!)
NOTE: This is the last version where V0 engine code and features stay intact. We highly recommend migrating to V1 engine.
Engine Core
- Priority Scheduling is now implemented in V1 engine (#19057), embedding models in V1 (#16188), Mamba2 in V1 (#19327).
- Full CUDA‑Graph execution is now available for all FlashAttention v3 (FA3) and FlashMLA paths, including prefix‑caching. CUDA graph now has a live capture progress bar makes debugging easier (#20301, #18581, #19617, #19501).
- FlexAttention update – any head size, FP32 fallback (#20467, #19754).
- Shared
CachedRequestData
objects and cached sampler‑ID stores deliver perf enhancements (#20232, #20291).
Model Support
- New families: Ernie 4.5 (+MoE) (#20220), MiniMax‑M1 (#19677, #20297), Slim‑MoE “Phi‑tiny‑MoE‑instruct” (#20286), Tencent HunYuan‑MoE‑V1 (#20114), Keye‑VL‑8B‑Preview (#20126), GLM‑4.1 V (#19331), Gemma‑3 (text‑only, #20134), Tarsier 2 (#19887), Qwen 3 Embedding & Reranker (#19260), dots1 (#18254), GPT‑2 for Sequence Classification (#19663).
- Granite hybrid MoE configurations with shared experts are fully supported (#19652).
Large‑Scale Serving & Engine Improvements
- Expert‑Parallel Load Balancer (EPLB) has been added! (#18343, #19790, #19885).
- Disaggregated serving enhancements: Avoid stranding blocks in P when aborted in D's waiting queue (#19223), let toy proxy handle /chat/completions (#19730)
- Native xPyD P2P NCCL transport as a base case for native PD without external dependency (#18242, #20246).
Hardware & Performance
- NVIDIA Blackwell
- Intel GPU (V1) backend with Flash‑Attention support (#19560).
- AMD ROCm: full‑graph capture for TritonAttention, quick All‑Reduce, and chunked pre‑fill (#19158, #19744, #18596).
- TPU: dynamic‑grid KV‑cache updates, head‑dim less than 128, tuned paged‑attention kernels, and KV‑padding fixes (#19928, #20235, #19620, #19813, #20048, #20339).
- Add models and features supporting matrix. (#20230)
Quantization
- Calibration‑free RTN INT4/INT8 pipeline for effortless, accurate compression (#18768).
- Compressed‑Tensor NVFP4 (including MoE) + emulation; FP4 emulation removed on < SM100 devices (#19879, #19990, #19563).
- Dynamic MoE‑layer quant (Marlin/GPTQ) and INT8 vectorization primitives (#19395, #20331, #19233).
- Bits‑and‑Bytes 0.45 + with improved double‑quant logic and AWQ quality (#20424, #20033, #19431, #20076).
API · CLI · Frontend
- API Server: Eliminate api_key and x_request_id headers middleware overhead (#19946)
- New OpenAI‑compatible endpoints:
/v1/audio/translations
& revamped/v1/audio/transcriptions
(#19615, #20179, #19597). - Token‑level progress bar for
LLM.beam_search
and cached template‑resolution speed‑ups (#19301, #20065). - Image‑object support in
llm.chat
, tool‑choice expansion, and custom‑arg passthroughs enrich multi‑modal agents (#19635, #17177, #16862). - CLI QoL: better parsing for
-O/--compilation-config
, batch‑size‑sweep benchmarking, richer--help
, faster startup (#20156, #20516, #20430, #19941). - Metrics: Deprecate metrics with gpu_ prefix for non GPU specific metrics (#18354), Export NaNs in logits to scheduler_stats if output is corrupted (#18777)
Platform & Deployment
- No‑privileged CPU / Docker / K8s mode (#19241) and custom default max‑tokens for hosted platforms (#18557).
- Security hardening – runtime (cloud)pickle imports forbidden (#18018).
- Hermetic builds and wheel slimming (FA2 8.0 + PTX only) shrink supply‑chain surface (#18064, #19336).
What's Changed
- [Docs] Note that alternative structured output backends are supported by @russellb in #19426
- [ROCm][V1] Adding ROCm to the list of plaforms using V1 by default by @gshtras in #19440
- [Model] use AutoWeightsLoader for commandr by @py-andy-c in #19399
- Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 by @Xu-Wenqing in #19401
- [BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 by @zou3519 in #19390
- [New Model]: Support Qwen3 Embedding & Reranker by @noooop in #19260
- [BugFix] Fix docker build cpu-dev image error by @2niuhe in #19394
- Fix test_max_model_len in tests/entrypoints/llm/test_generate.py by @houseroad in #19451
- [CI] Disable failing GGUF model test by @mgoin in #19454
- [Misc] Remove unused
MultiModalHasher.hash_prompt_mm_data
by @lgeiger in #19422 - Add fused MOE config for Qwen3 30B A3B on B200 by @0xjunhao in #19455
- Fix Typo in Documentation and Function Name by @leopardracer in #19442
- [ROCm] Add rules to automatically label ROCm related PRs by @houseroad in #19405
- [Kernel] Support deep_gemm for linear methods by @artetaout in #19085
- [Doc] Update V1 User Guide for Hardware and Models by @DarkLight1337 in #19474
- [Doc] Fix quantization link titles by @DarkLight1337 in #19478
- [Doc] Support "important" and "announcement" admonitions by @DarkLight1337 in #19479
- [Misc] Reduce warning message introduced in env_override by @houseroad in #19476
- Support non-string values in JSON keys from CLI by @DarkLight1337 in #19471
- Add cache to cuda get_device_capability by @mgoin in #19436
- Fix some typo by @Ximingwang-09 in #19475
- Support no privileged mode on CPU for docker and kubernetes deployments by @louie-tsai in #19241
- [Bugfix] Update the example code, make it work with the latest lmcache by @runzhen in #19453
- [CI] Update FlashInfer to 0.2.6.post1 by @mgoin in #19297
- [doc] fix "Other AI accelerators" getting started page by @davidxia in #19457
- [Misc] Fix misleading ROCm warning by @jeejeelee in #19486
- [Docs] Remove WIP features in V1 guide by @WoosukKwon in #19498
- [Kernels] Add activation chunking logic to FusedMoEModularKernel by @bnellnm in #19168
- [AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger by @rasmith in #17331
- [UX] Add Feedback During CUDAGraph Capture by @robertgshaw2-redhat in #19501
- [CI/Build] Fix torch nightly CI dependencies by @zou3519 in #19505
- [CI] change spell checker from codespell to typos by @andyxning in #18711
- [BugFix] Force registration of w8a8_block_fp8_matmul_deepgemm via lazy import by @varun-sundar-rabindranath in #19514
- Add Triton Fused MoE kernel config for E=16 on B200 by @b8zhong in #19518
- [Frontend] Improve error message in tool_choice validation by @22quinn in #19239
- [BugFix] Work-around incremental detokenization edge case error by @njhill in #19449
- [BugFix] Handle missing sep_token for Qwen3-Reranker in Score API by @strutive07 in #19522
- [AMD][Kernel][BugFix] fix test_rocm_compressed_tensors_w8a8 for rocm by @rasmith in #19509
- Fix typo by @2niuhe in #19525
- [Security] Prevent new imports of (cloud)pickle by @russellb in #18018
- [Bugfix][V1] Allow manual FlashAttention for Blackwell by @mgoin in #19492
- [Bugfix] Respect num-gpu-blocks-override in v1 by @jmswen in #19503
- [Quantization] Improve AWQ logic by @jeejeelee in #19431
- [Doc] Add V1 column to supported models list by @DarkLight1337 in #19523
- [NixlConnector] Drop
num_blocks
check by @NickLucche in #19532 - [Perf] Vectorize static / dynamic INT8 quant kernels by @yewentao256 in #19233
- Fix TorchAOConfig skip layers by @mobicham in #19265
- [torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass by @ProExpertProg in #16756
- [doc] Make top navigatio...
v0.9.2rc2
What's Changed
- [Kernel] Enable fp8 support for pplx and BatchedTritonExperts. by @bnellnm in #18864
- [Misc] Fix
Unable to detect current VLLM config. Defaulting to NHD kv cache layout
warning by @NickLucche in #20400 - [Bugfix] Register reducer even if transformers_modules not available by @eicherseiji in #19510
- Change warn_for_unimplemented_methods to debug by @mgoin in #20455
- [Platform] Add custom default max tokens by @gmarinho2 in #18557
- Add ignore consolidated file in mistral example code by @princepride in #20420
- [Misc] small update by @reidliu41 in #20462
- [Structured Outputs][V1] Skipping with models doesn't contain tokenizers by @aarnphm in #20365
- [Perf] Optimize Vectorization Utils for Int 8 Quantization Kernels by @yewentao256 in #20331
- [Misc] Add SPDX-FileCopyrightText by @jeejeelee in #20428
- Support Llama 4 for fused_marlin_moe by @mgoin in #20457
- [Bug][Frontend] Fix structure of transcription's decoder_prompt by @sangbumlikeagod in #18809
- [Model][3/N] Automatic conversion of CrossEncoding model by @noooop in #20168
- [Doc] Fix classification table in list of supported models by @DarkLight1337 in #20489
- [CI] add kvcache-connector dependency definition and add into CI build by @panpan0000 in #18193
- [Misc] Small: Remove global media connector. Each test should have its own test connector object. by @huachenheli in #20395
- Enable V1 for Hybrid SSM/Attention Models by @tdoublep in #20016
- [feat]: CUTLASS block scaled group gemm for SM100 by @djmmoss in #19757
- [CI Bugfix] Fix pre-commit failures on main by @mgoin in #20502
- [Doc] fix mutltimodal_inputs.md gh examples link by @GuyStone in #20497
- [Misc] Add security warning for development mode endpoints by @reidliu41 in #20508
- [doc] small fix by @reidliu41 in #20506
- [Misc] Remove the unused LoRA test code by @jeejeelee in #20494
- Fix unknown attribute of topk_indices_dtype in CompressedTensorsW8A8Fp8MoECutlassMethod by @luccafong in #20507
- [v1] Re-add fp32 support to v1 engine through FlexAttention by @Isotr0py in #19754
- [Misc] Add logger.exception for TPU information collection failures by @reidliu41 in #20510
- [Misc] remove unused import by @reidliu41 in #20517
- test_attention compat with coming xformers change by @bottler in #20487
- [BUG] Fix #20484. Support empty sequence in cuda penalty kernel by @vadiklyutiy in #20491
- [Bugfix] Fix missing per_act_token parameter in compressed_tensors_moe by @luccafong in #20509
- [BugFix] Fix: ImportError when building on hopper systems by @LucasWilkinson in #20513
- [TPU][Bugfix] fix the MoE OOM issue by @yaochengji in #20339
- [Frontend] Support image object in llm.chat by @sfeng33 in #19635
- [Benchmark] Add support for multiple batch size benchmark through CLI in
benchmark_moe.py
+ Add Triton Fused MoE kernel config for FP8 E=16 on B200 by @b8zhong in #20516 - [Misc] call the pre-defined func by @reidliu41 in #20518
- [V0 deprecation] Remove V0 CPU/XPU/TPU backends by @WoosukKwon in #20412
- [V1] Support any head size for FlexAttention backend by @DarkLight1337 in #20467
- [BugFix][Spec Decode] Fix spec token ids in model runner by @WoosukKwon in #20530
- [Bugfix] Add
use_cross_encoder
flag to use correct activation inClassifierPooler
by @DarkLight1337 in #20527
New Contributors
- @sangbumlikeagod made their first contribution in #18809
- @djmmoss made their first contribution in #19757
- @GuyStone made their first contribution in #20497
- @bottler made their first contribution in #20487
Full Changelog: v0.9.2rc1...v0.9.2rc2