Skip to content

Releases: vllm-project/vllm

v0.11.0

02 Oct 19:17
Compare
Choose a tag to compare

Highlights

This release features 538 commits, 207 contributors (65 new contributors)!

  • This release completes the removal of V0 engine. V0 engine code including AsyncLLMEngine, LLMEngine, MQLLMEngine, all attention backends, and related components have been removed. V1 is the only engine in the codebase now.
  • This releases turns on FULL_AND_PIECEWISE as the CUDA graph mode default. This should provide better out of the box performance for most models, particularly fine-grained MoEs, while preserving compatibility with existing models supporting only PIECEWISE mode.

Model Support

  • New architectures: DeepSeek-V3.2-Exp (#25896), Qwen3-VL series (#24727), Qwen3-Next (#24526), OLMo3 (#24534), LongCat-Flash (#23991), Dots OCR (#24645), Ling2.0 (#24627), CWM (#25611).
  • Encoders: RADIO encoder support (#24595), Transformers backend support for encoder-only models (#25174).
  • Task expansion: BERT token classification/NER (#24872), multimodal models for pooling tasks (#24451).
  • Data parallel for vision encoders: InternVL (#23909), Qwen2-VL (#25445), Qwen3-VL (#24955).
  • Speculative decoding: EAGLE3 for MiniCPM3 (#24243) and GPT-OSS (#25246).
  • Features: Qwen3-VL text-only mode (#26000), EVS video token pruning (#22980), Mamba2 TP+quantization (#24593), MRoPE + YaRN (#25384), Whisper on XPU (#25123), LongCat-Flash-Chat tool calling (#24083).
  • Performance: GLM-4.1V 916ms TTFT reduction via fused RMSNorm (#24733), GLM-4 MoE SharedFusedMoE optimization (#24849), Qwen2.5-VL CUDA sync removal (#24741), Qwen3-VL Triton MRoPE kernel (#25055), FP8 checkpoints for Qwen3-Next (#25079).
  • Reasoning: SeedOSS reason parser (#24263).

Engine Core

  • KV cache offloading: CPU offloading with LRU management (#19848, #20075, #21448, #22595, #24251).
  • V1 features: Prompt embeddings (#24278), sharded state loading (#25308), FlexAttention sliding window (#24089), LLM.apply_model (#18465).
  • Hybrid allocator: Pipeline parallel (#23974), varying hidden sizes (#25101).
  • Async scheduling: Uniprocessor executor support (#24219).
  • Architecture: Tokenizer group removal (#24078), shared memory multimodal caching (#20452).
  • Attention: Hybrid SSM/Attention in Triton (#21197), FlashAttention 3 for ViT (#24347).
  • Performance: FlashInfer RoPE 2x speedup (#21126), fused Q/K RoPE 11% improvement (#24511, #25005), 8x spec decode overhead reduction (#24986), FlashInfer spec decode with 1.14x speedup (#25196), model info caching (#23558), inputs_embeds copy avoidance (#25739).
  • LoRA: Optimized weight loading (#25403).
  • Defaults: CUDA graph mode FULL_AND_PIECEWISE (#25444), Inductor standalone compile disabled (#25391).
  • torch.compile: CUDA graph Inductor partition integration (#24281).

Hardware & Performance

  • NVIDIA: FP8 FlashInfer MLA decode (#24705), BF16 fused MoE for Hopper/Blackwell expert parallel (#25503).
  • DeepGEMM: Enabled by default (#24462), 5.5% throughput improvement (#24783).
  • New architectures: RISC-V 64-bit (#22112), ARM non-x86 CPU (#25166), ARM 4-bit fused MoE (#23809).
  • AMD: ROCm 7.0 (#25178), GLM-4.5 MI300X tuning (#25703).
  • Intel XPU: MoE DP accuracy fix (#25465).

Large Scale Serving & Performance

  • Dual-Batch Overlap (DBO): Overlapping computation mechanism (#23693), DeepEP high throughput + prefill (#24845).
  • Data Parallelism: torchrun launcher (#24899), Ray placement groups (#25026), Triton DP/EP kernels (#24588).
  • EPLB: Hunyuan V1 (#23078), Mixtral (#22842), static placement (#23745), reduced overhead (#24573).
  • Disaggregated serving: KV transfer metrics (#22188), NIXL MLA latent dimension (#25902).
  • MoE: Shared expert overlap optimization (#24254), SiLU kernel for DeepSeek-R1 (#24054), Enable Allgather/ReduceScatter backend for NaiveAllToAll (#23964).
  • Distributed: NCCL symmetric memory with 3-4% throughput improvement (#24532), enabled by default for TP (#25070).

Quantization

  • FP8: Per-token-group quantization (#24342), hardware-accelerated instructions (#24757), torch.compile KV cache (#22758), paged attention update (#22222).
  • FP4: NVFP4 for dense models (#25609), Gemma3 (#22771), Llama 3.1 405B (#25135).
  • W4A8: Faster preprocessing (#23972).
  • Compressed tensors: Blocked FP8 for MoE (#25219).

API & Frontend

  • OpenAI: Prompt logprobs for all tokens (#24956), logprobs=-1 for full vocab (#25031), reasoning streaming events (#24938), Responses API MCP tools (#24628, #24985), health 503 on dead engine (#24897).
  • Multimodal: Media UUID caching (#23950), image path format (#25081).
  • Tool calling: XML parser for Qwen3-Coder (#25028), Hermes-style tokens (#25281).
  • CLI: --enable-logging (#25610), improved --help (#24903).
  • Config: Speculative model engine args (#25250), env validation (#24761), NVTX profiling (#25501), guided decoding backward compatibility (#25615, #25422).
  • Metrics: V1 TPOT histogram (#24015), hidden deprecated gpu_ metrics (#24245), KV cache GiB units (#25204, #25479).
  • UX: Removed misleading quantization warning (#25012).

Security

Dependencies

  • PyTorch 2.8 for CPU (#25652), FlashInfer 0.3.1 (#24470), CUDA 13 (#24599), ROCm 7.0 (#25178).
  • Build requirements: C++17 now enforced globally (#24823).
  • TPU: Deprecated xm.mark_step in favor of torch_xla.sync (#25254).

V0 Deprecation

What's Changed

Read more

v0.10.2

13 Sep 06:37
Compare
Choose a tag to compare

Highlights

This release contains 740 commits from 266 contributors (97 new)!

Breaking Changes: This release includes PyTorch 2.8.0 upgrade, V0 deprecations, and API changes - please review the changelog carefully.

aarch64 support: This release features native support for aarch64 allowing usage of vLLM on GB200 platform. The docker image vllm/vllm-openai should already be multiplatform. To install the wheels, you can download the wheels from this release artifact or install via

uv pip install vllm==0.10.2 --extra-index-url https://wheels.vllm.ai/0.10.2/ --torch-backend=auto

Model Support

  • New model families and enhancements: Apertus (#23068), LFM2 (#22845), MiDashengLM (#23652), Motif-1-Tiny (#23414), Seed-Oss (#23241), Google EmbeddingGemma-300m (#24318), GTE sequence classification (#23524), Donut OCR model (#23229), KeyeVL-1.5-8B (#23838), R-4B vision model (#23246), Ernie4.5 VL (#22514), MiniCPM-V 4.5 (#23586), Ovis2.5 (#23084), Qwen3-Next with hybrid attention (#24526), InternVL3.5 with video support (#23658), Qwen2Audio embeddings (#23625), NemotronH Nano VLM (#23644), BLOOM V1 engine support (#23488), and Whisper encoder-decoder for V1 (#21088).
  • Pipeline parallelism expansion: Added PP support for Hunyuan (#24212), Ovis2.5 (#23405), GPT-OSS (#23680), and Kimi-VL-A3B-Thinking-2506 (#23114).
  • Data parallelism for vision models: Enabled DP for ViT across Qwen2.5VL (#22742), MiniCPM-V (#23948, #23327), Kimi-VL (#23817), and GLM-4.5V (#23168).
  • LoRA ecosystem expansion: Added LoRA support to Voxtral (#24517), Qwen-2.5-Omni (#24231), and DeepSeek models V2/V3/R1-0528 (#23971), with significantly faster LoRA startup performance (#23777).
  • Classification and pooling enhancements: Multi-label classification support (#23173), logit bias and sigmoid normalization (#24031), and FP32 precision heads for pooling models (#23810).
  • Performance optimizations: Removed unnecessary CUDA sync from GLM-4.1V (#24332) and Qwen2VL (#24334) preprocessing, eliminated redundant all-reduce in Qwen3 MoE (#23169), optimized InternVL CPU threading (#24519), and GLM4.5-V video frame decoding (#24161).

Engine Core

  • V1 engine maturation: Extended V1 support to compute capability < 8.0 (#23614, #24022), added cross-attention KV cache for encoder-decoder models (#23664), request-level logits processor integration (#23656), and KV events from connectors (#19737).
  • Backend expansion: Terratorch backend integration (#23513) enabling non-language model tasks like semantic segmentation and geospatial applications with --model-impl terratorch support.
  • Hybrid and Mamba model improvements: Enabled full CUDA graphs by default for hybrid models (#22594), disabled prefix caching for hybrid/Mamba models (#23716), added FP32 SSM kernel support (#23506), full CUDA graph support for Mamba1 (#23035), and V1 as default for Mamba models (#23650).
  • Performance core improvements: --safetensors-load-strategy for NFS based file loading acceleration (#24469), critical CUDA graph capture throughput fix (#24128), scheduler optimization for single completions (#21917), multi-threaded model weight loading (#23928), and tensor core usage enforcement for FlashInfer decode (#23214).
  • Multimodal enhancements: Multimodal cache tracking with mm_hash (#22711), UUID-based multimodal identifiers (#23394), improved V1 video embedding estimation (#24312), and simplified multimodal UUID handling (#24271).
  • Sampling and structured outputs: Support for all prompt logprobs (#23868), final logprobs (#22387), grammar bitmask optimization (#23361), and user-configurable KV cache memory size (#21489).
  • Distributed: Support Decode Context Parallel (DCP) for MLA (#23734)

Hardware & Performance

  • NVIDIA Blackwell/SM100 generation: FP8 MLA support with CUTLASS backend (#23289), DeepGEMM Linear with 1.5% E2E throughput improvement (#23351), Hopper DeepGEMM E8M0 for DeepSeekV3.1 (#23666), SM100 FlashInfer CUTLASS MoE FP8 backend (#22357), MXFP4 fused CUTLASS MoE (#23696), default MXFP4 MoE on Blackwell (#23008), and GPT-OSS DP/EP support with 52,003 tokens/s throughput (#23608).
  • Breaking change: FlashMLA disabled on Blackwell GPUs due to compatibility issues (#24521).
  • Kernel and attention optimizations: FlashAttention MLA with CUDA graph support (#14258, #23958), V1 cross-attention support (#23297), FP8 support for FlashMLA (#22668), fused grouped TopK for MoE (#23274), Flash Linear Attention kernels (#24518), and W4A8 support on Hopper (#23198).
  • Performance improvements: 13.7x speedup for token conversion (#20413), TTIT/TTFT improvements for disaggregated serving (#22760), symmetric memory all-reduce by default (#24111), FlashInfer warmup during startup (#23439), V1 model execution overlap (#23569), and various Triton configuration tuning (#23748, #23939).
  • Platform expansion: Apple Silicon bfloat16 support for M2+ (#24129), IBM Z V1 engine support (#22725), Intel XPU torch.compile (#22609), XPU MoE data parallelism (#22887), XPU Triton attention (#24149), XPU FP8 quantization (#23148), and ROCm pipeline parallelism with Ray (#24275).
  • Model-specific optimizations: Hardware-tuned MoE configurations for Qwen3-Next on B200/H200/H100 (#24698, #24688, #24699, #24695), GLM-4.5-Air-FP8 B200 configs (#23695), Kimi K2 optimization (#24597), and QWEN3 Coder/Thinking configs (#24266, #24330).

Quantization

  • New quantization capabilities: Per-layer quantization routing (#23556), GGUF quantization with layer skipping (#23188), NFP4+FP8 MoE support (#22674), W4A8 channel scales (#23570), and AMD CDNA2/CDNA3 FP4 support (#22527).
  • Advanced quantization infrastructure: Compressed tensors transforms for linear operations (#22486) enabling techniques like SpinQuantR1R2R4 and QuIP quantization methods.
  • FlashInfer quantization integration: FP8 KV cache for TRTLLM prefill attention (#24197), FP8-qkv attention kernels (#23647), and FP8 per-tensor GEMMs (#22895).
  • Platform-specific quantization: ROCm TorchAO quantization enablement (#24400) and TorchAO module swap configuration (#21982).
  • Performance optimizations: MXFP4 MoE loading cache optimization (#24154) and compressed tensors version updates (#23202).
  • Breaking change: Removed original Marlin quantization format (#23204).

API & Frontend

  • OpenAI API enhancements: Gemma3n audio transcription/translation endpoints (#23735), transcription response usage statistics (#23576), and return_token_ids parameter (#22587).
  • Response API improvements: Streaming support for non-harmony responses (#23741), non-streaming logprobs (#23319), MCP tool background mode (#23494), MCP streaming+background support (#23927), and tool output token reporting (#24285).
  • Frontend optimizations: Error stack traces with --log-error-stack (#22960), collective RPC endpoint (#23075), beam search concurrency optimization (#23599), unnecessary detokenization skipping (#24236), and custom media UUIDs (#23449).
  • Configuration enhancements: Formalized --mm-encoder-tp-mode flag (#23190), VLLM_DISABLE_PAD_FOR_CUDAGRAPH environment variable (#23595), EPLB configuration parameter (#20562), embedding endpoint chat request support (#23931), and LM Format Enforcer V1 integration (#22564).

Dependencies

  • Major updates: PyTorch 2.8.0 upgrade (#20358) - breaking change requiring environment updates, FlashInfer v0.3.0 upgrade (#24086), and FlashInfer 0.2.14.post1 maintenance update (#23537).
  • Supporting updates: XGrammar 0.1.23 (#22988), TPU core dump fix with tpu_info 0.4.0 (#23135), and compressed tensors version bump (#23202).
  • Deployment improvements: FlashInfer cubin directory environment variable (#22675) for offline environments and pre-cached CUDA binaries.

V0 Deprecation

  • Backend removals: V0 Neuron backend deprecation (#21159), V0 pooling model support removal (#23434), V0 FlashInfer attention backend removal (#22776), and V0 test cleanup (#23418, #23862).
  • API breaking changes: prompt_token_ids fallback removal from LLM.generate and LLM.embed (#18800), LoRA extra vocab size deprecation warning (#23635), LoRA bias parameter deprecation (#24339), and metrics naming change from TPOT to ITL (#24110).

Breaking Changes

  1. PyTorch 2.8.0 upgrade - Environment dependency change requiring updated CUDA versions
  2. FlashMLA Blackwell restriction - FlashMLA disabled on Blackwell GPUs due to compatibility issues
  3. V0 feature removals - Neuron backend, pooling models, FlashInfer attention backend
  4. Quantizations - Removed quantized Mixtral hack implementation, and original Marlin format.
  5. Metrics renaming - TPOT deprecated in favor of ITL

What's Changed

Read more

v0.10.1.1

20 Aug 21:20
Compare
Choose a tag to compare

This is a critical bugfix and security release:

Full Changelog: v0.10.1...v0.10.1.1

v0.10.1

18 Aug 04:39
Compare
Choose a tag to compare

Highlights

v0.10.1 release includes 727 commits, 245 committers (105 new contributors).

NOTE: This release deprecates V0 FA3 support and as a result FP8 kv-cache in V0 may have issues

Model Support

  • New model families: GPT-OSS with comprehensive tool calling and streaming support (#22327, #22330, #22332, #22335, #22339, #22340, #22342), Command-A-Vision (#22660), mBART (#22883), and SmolLM3 using Transformers backend (#22665).
  • Vision-language models: Official Eagle multimodal support with Llama4 backend (#20788), Step3 vision-language models (#21998), Gemma3n multimodal (#20495), MiniCPM-V 4.0 (#22166), HyperCLOVAX-SEED-Vision-Instruct-3B (#20931), Emu3 with Transformers backend (#21319), Intern-S1 (#21628), and Prithvi in online serving mode (#21518).
  • Enhanced existing models: NemotronH support (#22349), Ernie 4.5 Base 0.3B model name change (#21735), GLM-4.5 series improvements (#22215), Granite models with fused MoE configurations (#21332) and quantized checkpoint loading (#22925), Ultravox support for Llama 4 and Gemma 3 backends (#17818), Mamba1 and Jamba model support in V1 (without CUDA graphs) (#21249)
  • Advanced model capabilities: Qwen3 EPLB (#20815) and dual-chunk attention support (#21924), Qwen native Eagle3 target support (#22333).
  • Architecture expansions: Encoder-only models without KV-cache enabling BERT-style architectures (#21270), expanded tensor parallelism support in Transformers backend (#22651), tensor parallelism for Deepseek_vl2 vision transformer (#21494), and tensor/pipeline parallelism with Mamba2 kernel for PLaMo2 (#19674).
  • V1 engine compatibility: Extended support for additional pooling models (#21747) and Step3VisionEncoder distributed processing option (#22697).

Engine Core

  • CUDA graph performance: Full CUDA graph support with separate attention routines, adding FA2 and FlashInfer compatibility (#20059), plus 6% end-to-end throughput improvement from Cutlass MLA (#22763).
  • Attention system advances: Multiple attention metadata builders per KV cache specification (#21588), tree attention backend for v1 engine (experimental) (#20401), FlexAttention encoder-only support (#22273), upgraded FlashAttention 3 with attention sink support (#22313), and multiple attention groups for KV sharing patterns (#22672).
  • Speculative decoding optimizations: N-gram speculative decoding with single KMP token proposal algorithm (#22437), explicit EAGLE3 interface for enhanced compatibility (#22642).
  • Default behavior improvements: Pooling models now default to chunked prefill and prefix caching (#20930), disabled chunked local attention by default for Llama4 for better performance (#21761).
  • Extensibility and configuration: Model loader plugin system (#21067), custom operations support for FusedMoe (#22509), rate limiting with bucket algorithm for proxy server (#22643), torch.compile support for bailing MoE (#21664).
  • Performance optimizations: Improved startup time by disabling C++ compilation of symbolic shapes (#20836), enhanced headless models for pooling in Transformers backend (#21767).

Hardware & Performance

  • NVIDIA Blackwell (SM100) optimizations: CutlassMLA as default backend (#21626), FlashInfer MoE per-tensor scale FP8 backend (#21458), SM90 CUTLASS FP8 GEMM with kernel tuning and swap AB support (#20396).
  • NVIDIA RTX 5090/RTX PRO 6000 (SM120) support: Block FP8 quantization (#22131) and CUTLASS NVFP4 4-bit weights/activations support (#21309).
  • AMD ROCm platform enhancements: Flash Attention backend for Qwen-VL models (#22069), AITER HIP block quantization kernels (#21242), reduced device-to-host transfers (#22683), and optimized kernel performance for small batch sizes 1-4 (#21350).
  • Attention and compute optimizations: FlashAttention 3 attention sinks performance boost (#22478), Triton-based multi-dimensional RoPE replacing PyTorch implementation (#22375), async tensor parallelism for scaled matrix multiplication (#20155), optimized FlashInfer metadata building (#21137).
  • Memory and throughput improvements: Mamba2 reduced device-to-device copy overhead (#21075), fused Triton kernels for RMSNorm (#20839, #22184), improved multimodal hasher performance for repeated image prompts (#22825), multithreaded async multimodal loading (#22710).
  • Parallelization and MoE optimizations: Guided decoding throughput improvements (#21862), balanced expert sharding for MoE models (#21497), expanded fused kernel support for topk softmax (#22211), fused MoE for nomic-embed-text-v2-moe (#18321).
  • Hardware compatibility and kernels: ARM CPU build fixes for systems without BF16 support (#21848), Machete memory-bound performance improvements (#21556), FlashInfer TRT-LLM prefill attention kernel support (#22095), optimized reshape_and_cache_flash CUDA kernel (#22036), CPU transfer support in NixlConnector (#18293).
  • Specialized CUDA kernels: GPT-OSS activation functions (#22538), RLHF weight loading acceleration (#21164).

Quantization

  • Advanced quantization techniques: MXFP4 and bias support for Marlin kernel (#22428), NVFP4 GEMM FlashInfer backends (#22346), compressed-tensors mixed-precision model loading (#22468), FlashInfer MoE support for NVFP4 (#21639).
  • Hardware-optimized quantization: Dynamic 4-bit quantization with Kleidiai kernels for CPU inference (#17112), TensorRT-LLM FP4 quantization optimized for MoE low-latency inference (#21331).
  • Expanded model quantization support: BitsAndBytes quantization for InternS1 (#21953) and additional MoE models (#21370, #21548), Gemma3n quantization compatibility (#21974), calibration-free RTN quantization for MoE models (#20766), ModelOpt Qwen3 NVFP4 support (#20101).
  • Performance and compatibility improvements: CUDA kernel optimization for Int8 per-token group quantization (#21476), non-contiguous tensor support in FP8 quantization (#21961), automatic detection of ModelOpt quantization formats (#22073).
  • Breaking change: Removed AQLM quantization support (#22943) - users should migrate to alternative quantization methods.

API & Frontend

  • OpenAI API compatibility: Unix domain socket support for local communication (#18097), improved error response format matching upstream specification (#22099), aligned tool_choice="required" behavior with OpenAI when tools list is empty (#21052).
  • New API capabilities: Dedicated LLM.reward interface for reward models (#21720), chunked processing for long inputs in embedding models (#22280), AsyncLLM proper response handling for aborted requests (#22283).
  • Configuration and environment: Multiple API keys support for enhanced authentication (#18548), custom vLLM tuned configuration paths (#22791), environment variable control for logging statistics (#22905), multimodal cache size (#22441), and DeepGEMM E8M0 scaling behavior (#21968).
  • CLI and tooling improvements: V1 API support for run-batch command (#21541), custom process naming for better monitoring (#21445), improved help display showing available choices (#21760), optional memory profiling skip for multimodal models (#22950), enhanced logging of non-default arguments (#21680).
  • Tool and parser support: HermesToolParser for models without special tokens (#16890), multi-turn conversation benchmarking tool (#20267).
  • Distributed serving enhancements: Enhanced hybrid distributed serving with multiple API servers in load balancing mode (#21510), request_id support for external load balancers (#21009).
  • User experience enhancements: Improved error messaging for multimodal items (#22114), per-request pooling control via PoolingParams (#20538).

Dependencies

  • FlashInfer updates: Updated to v0.2.8 for improved performance (#21385), moved to optional dependency install with pip install vllm[flashinfer] for flexible installation (#21959).
  • Mamba SSM restructuring: Updated to version 2.2.5 (#21421), removed from core requirements to reduce installation complexity (#22541).
  • Docker and deployment: Docker-aware precompiled wheel support for easier containerized deployment (#21127, #22106).
  • Python package updates: OpenAI Python dependency updated to latest version for API compatibility (#22316).
  • Dependency optimizations: Removed xformers requirement for Mistral-format Pixtral and Mistral3 models (#21154), deprecation warnings added for old DeepGEMM version (#22194).

V0 Deprecation

Important: As part of the ongoing V0 engine cleanup, several breaking changes have been introduced:

  • CLI flag updates: Replaced --task with --runner and --convert options (#21470), deprecated --disable-log-requests in favor of --enable-log-requests for clearer semantics (#21739), renamed --expand-tools-even-if-tool-choice-none to --exclude-tools-when-tool-choice-none for consistency (#20544).
  • API cleanup: Removed previously deprecated arguments and methods as part of ongoing V0 engine codebase cleanup (#21907).

What's Changed

  • Deduplicate Transformers backend code using inheritance by @hmellor in #21461
  • [Bugfix][ROCm] Fix for warp_size uses on host by @gshtras in #21205
  • [TPU][Bugfix] fix moe layer by @yaochengji in #21340
  • [v1][Core] Clean up usages of SpecializedManager by @zhouwfang in #21407
  • [Misc] Fix duplicate FusedMoEConfig debug messages by @njhill in #21455
  • [Core] Support model loader plugins by @22quinn in #21067
  • remove GLM-4 quantization wrong Code by @zRzRzRzRzRzRzR in #21435
  • Replace --expand-tools-even-if-tool-choice-none with --exclude-tools-when-tool-choice-none ...
Read more

v0.10.1rc1

17 Aug 22:57
0fc8fa7
Compare
Choose a tag to compare
v0.10.1rc1 Pre-release
Pre-release

What's Changed

Read more

v0.10.0

24 Jul 22:43
6d8d0a2
Compare
Choose a tag to compare

Highlights

v0.10.0 release includes 308 commits, 168 contributors (62 new!).

NOTE: This release begins the cleanup of V0 engine codebase. We have removed V0 CPU/XPU/TPU/HPU backends (#20412), long context LoRA (#21169), Prompt Adapters (#20588), Phi3-Small & BlockSparse Attention (#21217), and Spec Decode workers (#21152) so far and plan to continued to delete code that is no longer used.

Model Support

  • New families: Llama 4 with EAGLE support (#20591), EXAONE 4.0 (#21060), Microsoft Phi-4-mini-flash-reasoning (#20702), Hunyuan V1 Dense + A13B with reasoning/tool parsing (#21368, #20625, #20820), Ling MoE models (#20680), JinaVL Reranker (#20260), Nemotron-Nano-VL-8B-V1 (#20349), Arcee (#21296), Voxtral (#20970).
  • Enhanced compatibility: BERT/RoBERTa with AutoWeightsLoader (#20534), HF format support for MiniMax (#20211), Gemini configuration (#20971), GLM-4 updates (#20736).
  • Architecture expansions: Attention-free model support (#20811), Hybrid SSM/Attention models on V1 (#20016), LlamaForSequenceClassification (#20807), expanded Mamba2 layer support (#20660).
  • VLM improvements: VLM support with transformers backend (#20543), PrithviMAE on V1 engine (#20577).

Engine Core

  • Experimental async scheduling --async-scheduling flag to overlap engine core scheduling with GPU runner (#19970).
  • V1 engine improvements: backend-agnostic local attention (#21093), MLA FlashInfer ragged prefill (#20034), hybrid KV cache with local chunked attention (#19351).
  • Multi-task support: models can now support multiple tasks (#20771), multiple poolers (#21227), and dynamic pooling parameter configuration (#21128).
  • RLHF Support: new RPC methods for runtime weight reloading (#20096) and config updates (#20095), logprobs mode for selecting which stage of logprobs to return (#21398).
  • Enhanced caching: multi-modal caching for transformers backend (#21358), reproducible prefix cache hashing using SHA-256 + CBOR (#20511).
  • Startup time reduction via CUDA graph capture speedup via frozen GC (#21146).
  • Elastic expert parallel for dynamic GPU scaling while preserving state (#20775).

Hardwares & Performance

  • NVIDIA Blackwell/SM100 optimizations: CUTLASS block scaled group GEMM for smaller batches (#20640), FP8 groupGEMM support (#20447), DeepGEMM integration (#20087), FlashInfer MoE blockscale FP8 backend (#20645), CUDNN prefill API for MLA (#20411), Triton Fused MoE kernel config for FP8 E=16 on B200 (#20516).
  • Performance improvements: 48% request duration reduction via microbatch tokenization for concurrent requests (#19334), fused MLA QKV + strided layernorm (#21116), Triton causal-conv1d for Mamba models (#18218).
  • Hardware expansion: ARM CPU int8 quantization (#14129), PPC64LE/ARM V1 support (#20554), Intel XPU ray distributed execution (#20659), shared-memory pipeline parallel for CPU (#21289), FlashInfer ARM CUDA support (#21013).

Quantization

  • New quantization support: MXFP4 for MoE models (#17888), BNB support for Mixtral and additional MoE models (#20893, #21100), in-flight quantization for MoE (#20061).
  • Hardware-specific: FP8 KV cache quantization on TPU (#19292), FP8 support for BatchedTritonExperts (#18864), optimized INT8 vectorization kernels (#20331).
  • Performance optimizations: Triton backend for DeepGEMM per-token group quantization (#20841), CUDA kernel for per-token group quantization (#21083), CustomOp abstraction for FP8 (#19830).

API & Frontend

  • OpenAI compatibility: Responses API implementation (#20504, #20975), image object support in llm.chat (#19635), tool calling with required choice and $defs (#20629).
  • New endpoints: get_tokenizer_info for tokenizer/chat-template information (#20575), cache_salt support for completions/responses (#20981).
  • Model loading: Tensorizer S3 integration with arbitrary arguments (#19619), HF repo paths & URLs for GGUF models (#20793), tokenization_kwargs for embedding truncation (#21033).
  • CLI improvements: --help=page option for enhanced help documentation (#20961), default model changed to Qwen3-0.6B (#20335).

Dependencies

  • Updated PyTorch to 2.7.1 for CUDA (#21011)
  • FlashInfer updated to v0.2.8rc1 (#20718)

What's Changed

Read more

v0.10.0rc2

24 Jul 05:04
6d8d0a2
Compare
Choose a tag to compare
v0.10.0rc2 Pre-release
Pre-release

What's Changed

Read more

v0.10.0rc1

20 Jul 05:17
d1fb65b
Compare
Choose a tag to compare
v0.10.0rc1 Pre-release
Pre-release

What's Changed

  • [Kernel] Enable fp8 support for pplx and BatchedTritonExperts. by @bnellnm in #18864
  • [Misc] Fix Unable to detect current VLLM config. Defaulting to NHD kv cache layout warning by @NickLucche in #20400
  • [Bugfix] Register reducer even if transformers_modules not available by @eicherseiji in #19510
  • Change warn_for_unimplemented_methods to debug by @mgoin in #20455
  • [Platform] Add custom default max tokens by @gmarinho2 in #18557
  • Add ignore consolidated file in mistral example code by @princepride in #20420
  • [Misc] small update by @reidliu41 in #20462
  • [Structured Outputs][V1] Skipping with models doesn't contain tokenizers by @aarnphm in #20365
  • [Perf] Optimize Vectorization Utils for Int 8 Quantization Kernels by @yewentao256 in #20331
  • [Misc] Add SPDX-FileCopyrightText by @jeejeelee in #20428
  • Support Llama 4 for fused_marlin_moe by @mgoin in #20457
  • [Bug][Frontend] Fix structure of transcription's decoder_prompt by @sangbumlikeagod in #18809
  • [Model][3/N] Automatic conversion of CrossEncoding model by @noooop in #20168
  • [Doc] Fix classification table in list of supported models by @DarkLight1337 in #20489
  • [CI] add kvcache-connector dependency definition and add into CI build by @panpan0000 in #18193
  • [Misc] Small: Remove global media connector. Each test should have its own test connector object. by @huachenheli in #20395
  • Enable V1 for Hybrid SSM/Attention Models by @tdoublep in #20016
  • [feat]: CUTLASS block scaled group gemm for SM100 by @djmmoss in #19757
  • [CI Bugfix] Fix pre-commit failures on main by @mgoin in #20502
  • [Doc] fix mutltimodal_inputs.md gh examples link by @GuyStone in #20497
  • [Misc] Add security warning for development mode endpoints by @reidliu41 in #20508
  • [doc] small fix by @reidliu41 in #20506
  • [Misc] Remove the unused LoRA test code by @jeejeelee in #20494
  • Fix unknown attribute of topk_indices_dtype in CompressedTensorsW8A8Fp8MoECutlassMethod by @luccafong in #20507
  • [v1] Re-add fp32 support to v1 engine through FlexAttention by @Isotr0py in #19754
  • [Misc] Add logger.exception for TPU information collection failures by @reidliu41 in #20510
  • [Misc] remove unused import by @reidliu41 in #20517
  • test_attention compat with coming xformers change by @bottler in #20487
  • [BUG] Fix #20484. Support empty sequence in cuda penalty kernel by @vadiklyutiy in #20491
  • [Bugfix] Fix missing per_act_token parameter in compressed_tensors_moe by @luccafong in #20509
  • [BugFix] Fix: ImportError when building on hopper systems by @LucasWilkinson in #20513
  • [TPU][Bugfix] fix the MoE OOM issue by @yaochengji in #20339
  • [Frontend] Support image object in llm.chat by @sfeng33 in #19635
  • [Benchmark] Add support for multiple batch size benchmark through CLI in benchmark_moe.py + Add Triton Fused MoE kernel config for FP8 E=16 on B200 by @b8zhong in #20516
  • [Misc] call the pre-defined func by @reidliu41 in #20518
  • [V0 deprecation] Remove V0 CPU/XPU/TPU backends by @WoosukKwon in #20412
  • [V1] Support any head size for FlexAttention backend by @DarkLight1337 in #20467
  • [BugFix][Spec Decode] Fix spec token ids in model runner by @WoosukKwon in #20530
  • [Bugfix] Add use_cross_encoder flag to use correct activation in ClassifierPooler by @DarkLight1337 in #20527
  • Implement OpenAI Responses API [1/N] by @WoosukKwon in #20504
  • [Misc] add a tip for pre-commit by @reidliu41 in #20536
  • [Refactor]Abstract Platform Interface for Distributed Backend and Add xccl Support for Intel XPU by @dbyoung18 in #19410
  • [CI/Build] Enable phi2 lora test by @jeejeelee in #20540
  • [XPU][CI] add v1/core test in xpu hardware ci by @Liangliang-Ma in #20537
  • Add docstrings to url_schemes.py to improve readability by @windsonsea in #20545
  • [XPU] log clean up for XPU platform by @yma11 in #20553
  • [Docs] Clean up tables in supported_models.md by @windsonsea in #20552
  • [Misc] remove unused jinaai_serving_reranking by @Abirdcfly in #18878
  • [Misc] Set the minimum openai version by @jeejeelee in #20539
  • [Doc] Remove extra whitespace from CI failures doc by @hmellor in #20565
  • [Doc] Use gh-pr and gh-issue everywhere we can in the docs by @hmellor in #20564
  • [Doc] Fix internal links so they don't always point to latest by @hmellor in #20563
  • [Doc] Add outline for content tabs by @hmellor in #20571
  • [Doc] Fix some MkDocs snippets used in the installation docs by @hmellor in #20572
  • [Model][Last/4] Automatic conversion of CrossEncoding model by @noooop in #19675
  • [Bugfix] Prevent IndexError for cached requests when pipeline parallelism is disabled by @panpan0000 in #20486
  • [Feature] microbatch tokenization by @ztang2370 in #19334
  • [DP] Copy environment variables to Ray DPEngineCoreActors by @ruisearch42 in #20344
  • [Kernel] Optimize Prefill Attention in Unified Triton Attention Kernel by @jvlunteren in #20308
  • [Misc] Add fully interleaved support for multimodal 'string' content format by @Dekakhrone in #14047
  • [Misc] feat output content in stream response by @lengrongfu in #19608
  • Fix links in multi-modal model contributing page by @hmellor in #18615
  • [Config] Refactor mistral configs by @patrickvonplaten in #20570
  • [Misc] Improve logging for dynamic shape cache compilation by @kyolebu in #20573
  • [Bugfix] Fix Maverick correctness by filling zero to cache space in cutlass_moe by @minosfuture in #20167
  • [Optimize] Don't send token ids when kv connector is not used by @WoosukKwon in #20586
  • Make distinct code and console admonitions so readers are less likely to miss them by @hmellor in #20585
  • [Bugfix]: Fix messy code when using logprobs by @chaunceyjiang in #19209
  • [Doc] Syntax highlight request responses as JSON instead of bash by @hmellor in #20582
  • [Docs] Rewrite offline inference guide by @crypdick in #20594
  • [Docs] Improve docstring for ray data llm example by @crypdick in #20597
  • [Docs] Add Ray Serve LLM section to openai compatible server guide by @crypdick in #20595
  • [Docs] Add Anyscale to frameworks by @crypdick in #20590
  • [Misc] improve error msg by @reidliu41 in #20604
  • [CI/Build][CPU] Fix CPU CI and remove all CPU V0 files by @bigPYJ1151 in #20560
  • [TPU] Temporary fix vmem oom for long model len by reducing page size by @Chenyaaang in #20278
  • [Frontend] [Core] Integrate Tensorizer in to S3 loading machinery, allow passing arbitrary arguments during save/load by @sangstar in #19619
  • [PD][Nixl] Remote consumer READ timeout for clearing request blocks by @NickLucche in #20139
  • [Docs] Improve documentation for Deepseek R1 on Ray Serve LLM by @crypdick in #20601
  • Remove unnecessary explicit title anchors and use relative links instead by @hmellor in #20620
  • Stop using title frontmatter and fix doc that can only be ...
Read more

v0.9.2

07 Jul 17:05
Compare
Choose a tag to compare

Highlights

This release contains 452 commits from 167 contributors (31 new!)

NOTE: This is the last version where V0 engine code and features stay intact. We highly recommend migrating to V1 engine.

Engine Core

  • Priority Scheduling is now implemented in V1 engine (#19057), embedding models in V1 (#16188), Mamba2 in V1 (#19327).
  • Full CUDA‑Graph execution is now available for all FlashAttention v3 (FA3) and FlashMLA paths, including prefix‑caching. CUDA graph now has a live capture progress bar makes debugging easier (#20301, #18581, #19617, #19501).
  • FlexAttention update – any head size, FP32 fallback (#20467, #19754).
  • Shared CachedRequestData objects and cached sampler‑ID stores deliver perf enhancements (#20232, #20291).

Model Support

  • New families: Ernie 4.5 (+MoE) (#20220), MiniMax‑M1 (#19677, #20297), Slim‑MoE “Phi‑tiny‑MoE‑instruct” (#20286), Tencent HunYuan‑MoE‑V1 (#20114), Keye‑VL‑8B‑Preview (#20126), GLM‑4.1 V (#19331), Gemma‑3 (text‑only, #20134), Tarsier 2 (#19887), Qwen 3 Embedding & Reranker (#19260), dots1 (#18254), GPT‑2 for Sequence Classification (#19663).
  • Granite hybrid MoE configurations with shared experts are fully supported (#19652).

Large‑Scale Serving & Engine Improvements

  • Expert‑Parallel Load Balancer (EPLB) has been added! (#18343, #19790, #19885).
  • Disaggregated serving enhancements: Avoid stranding blocks in P when aborted in D's waiting queue (#19223), let toy proxy handle /chat/completions (#19730)
  • Native xPyD P2P NCCL transport as a base case for native PD without external dependency (#18242, #20246).

Hardware & Performance

  • NVIDIA Blackwell
    • SM120: CUTLASS W8A8/FP8 kernels and related tuning, added to Dockerfile (#17280, #19566, #20071, #19794)
    • SM100: block‑scaled‑group GEMM, INT8/FP8 vectorization, deep‑GEMM kernels, activation‑chunking for MoE, and group‑size 64 for Machete (#19757, #19572, #19168, #19085, #20290, #20331).
  • Intel GPU (V1) backend with Flash‑Attention support (#19560).
  • AMD ROCm: full‑graph capture for TritonAttention, quick All‑Reduce, and chunked pre‑fill (#19158, #19744, #18596).
    • Split‑KV support landed in the unified Triton Attention kernel, boosting long‑context throughput (#19152).
    • Full‑graph mode enabled in ROCm AITER MLA V1 decode path (#20254).
  • TPU: dynamic‑grid KV‑cache updates, head‑dim less than 128, tuned paged‑attention kernels, and KV‑padding fixes (#19928, #20235, #19620, #19813, #20048, #20339).
    • Add models and features supporting matrix. (#20230)

Quantization

  • Calibration‑free RTN INT4/INT8 pipeline for effortless, accurate compression (#18768).
  • Compressed‑Tensor NVFP4 (including MoE) + emulation; FP4 emulation removed on < SM100 devices (#19879, #19990, #19563).
  • Dynamic MoE‑layer quant (Marlin/GPTQ) and INT8 vectorization primitives (#19395, #20331, #19233).
  • Bits‑and‑Bytes 0.45 + with improved double‑quant logic and AWQ quality (#20424, #20033, #19431, #20076).

API · CLI · Frontend

  • API Server: Eliminate api_key and x_request_id headers middleware overhead (#19946)
  • New OpenAI‑compatible endpoints: /v1/audio/translations & revamped /v1/audio/transcriptions (#19615, #20179, #19597).
  • Token‑level progress bar for LLM.beam_search and cached template‑resolution speed‑ups (#19301, #20065).
  • Image‑object support in llm.chat, tool‑choice expansion, and custom‑arg passthroughs enrich multi‑modal agents (#19635, #17177, #16862).
  • CLI QoL: better parsing for -O/--compilation-config, batch‑size‑sweep benchmarking, richer --help, faster startup (#20156, #20516, #20430, #19941).
  • Metrics: Deprecate metrics with gpu_ prefix for non GPU specific metrics (#18354), Export NaNs in logits to scheduler_stats if output is corrupted (#18777)

Platform & Deployment

  • No‑privileged CPU / Docker / K8s mode (#19241) and custom default max‑tokens for hosted platforms (#18557).
  • Security hardening – runtime (cloud)pickle imports forbidden (#18018).
  • Hermetic builds and wheel slimming (FA2 8.0 + PTX only) shrink supply‑chain surface (#18064, #19336).

What's Changed

Read more

v0.9.2rc2

06 Jul 21:03
Compare
Choose a tag to compare
v0.9.2rc2 Pre-release
Pre-release

What's Changed

New Contributors

Full Changelog: v0.9.2rc1...v0.9.2rc2