Skip to content

Releases: vllm-project/vllm

v0.10.0

24 Jul 22:43
6d8d0a2
Compare
Choose a tag to compare

Highlights

v0.10.0 release includes 308 commits, 168 contributors (62 new!).

NOTE: This release begins the cleanup of V0 engine codebase. We have removed V0 CPU/XPU/TPU/HPU backends (#20412), long context LoRA (#21169), Prompt Adapters (#20588), Phi3-Small & BlockSparse Attention (#21217), and Spec Decode workers (#21152) so far and plan to continued to delete code that is no longer used.

Model Support

  • New families: Llama 4 with EAGLE support (#20591), EXAONE 4.0 (#21060), Microsoft Phi-4-mini-flash-reasoning (#20702), Hunyuan V1 Dense + A13B with reasoning/tool parsing (#21368, #20625, #20820), Ling MoE models (#20680), JinaVL Reranker (#20260), Nemotron-Nano-VL-8B-V1 (#20349), Arcee (#21296), Voxtral (#20970).
  • Enhanced compatibility: BERT/RoBERTa with AutoWeightsLoader (#20534), HF format support for MiniMax (#20211), Gemini configuration (#20971), GLM-4 updates (#20736).
  • Architecture expansions: Attention-free model support (#20811), Hybrid SSM/Attention models on V1 (#20016), LlamaForSequenceClassification (#20807), expanded Mamba2 layer support (#20660).
  • VLM improvements: VLM support with transformers backend (#20543), PrithviMAE on V1 engine (#20577).

Engine Core

  • Experimental async scheduling --async-scheduling flag to overlap engine core scheduling with GPU runner (#19970).
  • V1 engine improvements: backend-agnostic local attention (#21093), MLA FlashInfer ragged prefill (#20034), hybrid KV cache with local chunked attention (#19351).
  • Multi-task support: models can now support multiple tasks (#20771), multiple poolers (#21227), and dynamic pooling parameter configuration (#21128).
  • RLHF Support: new RPC methods for runtime weight reloading (#20096) and config updates (#20095), logprobs mode for selecting which stage of logprobs to return (#21398).
  • Enhanced caching: multi-modal caching for transformers backend (#21358), reproducible prefix cache hashing using SHA-256 + CBOR (#20511).
  • Startup time reduction via CUDA graph capture speedup via frozen GC (#21146).
  • Elastic expert parallel for dynamic GPU scaling while preserving state (#20775).

Hardwares & Performance

  • NVIDIA Blackwell/SM100 optimizations: CUTLASS block scaled group GEMM for smaller batches (#20640), FP8 groupGEMM support (#20447), DeepGEMM integration (#20087), FlashInfer MoE blockscale FP8 backend (#20645), CUDNN prefill API for MLA (#20411), Triton Fused MoE kernel config for FP8 E=16 on B200 (#20516).
  • Performance improvements: 48% request duration reduction via microbatch tokenization for concurrent requests (#19334), fused MLA QKV + strided layernorm (#21116), Triton causal-conv1d for Mamba models (#18218).
  • Hardware expansion: ARM CPU int8 quantization (#14129), PPC64LE/ARM V1 support (#20554), Intel XPU ray distributed execution (#20659), shared-memory pipeline parallel for CPU (#21289), FlashInfer ARM CUDA support (#21013).

Quantization

  • New quantization support: MXFP4 for MoE models (#17888), BNB support for Mixtral and additional MoE models (#20893, #21100), in-flight quantization for MoE (#20061).
  • Hardware-specific: FP8 KV cache quantization on TPU (#19292), FP8 support for BatchedTritonExperts (#18864), optimized INT8 vectorization kernels (#20331).
  • Performance optimizations: Triton backend for DeepGEMM per-token group quantization (#20841), CUDA kernel for per-token group quantization (#21083), CustomOp abstraction for FP8 (#19830).

API & Frontend

  • OpenAI compatibility: Responses API implementation (#20504, #20975), image object support in llm.chat (#19635), tool calling with required choice and $defs (#20629).
  • New endpoints: get_tokenizer_info for tokenizer/chat-template information (#20575), cache_salt support for completions/responses (#20981).
  • Model loading: Tensorizer S3 integration with arbitrary arguments (#19619), HF repo paths & URLs for GGUF models (#20793), tokenization_kwargs for embedding truncation (#21033).
  • CLI improvements: --help=page option for enhanced help documentation (#20961), default model changed to Qwen3-0.6B (#20335).

Dependencies

  • Updated PyTorch to 2.7.1 for CUDA (#21011)
  • FlashInfer updated to v0.2.8rc1 (#20718)

What's Changed

Read more

v0.10.0rc2

24 Jul 05:04
6d8d0a2
Compare
Choose a tag to compare
v0.10.0rc2 Pre-release
Pre-release

What's Changed

Read more

v0.10.0rc1

20 Jul 05:17
d1fb65b
Compare
Choose a tag to compare
v0.10.0rc1 Pre-release
Pre-release

What's Changed

  • [Kernel] Enable fp8 support for pplx and BatchedTritonExperts. by @bnellnm in #18864
  • [Misc] Fix Unable to detect current VLLM config. Defaulting to NHD kv cache layout warning by @NickLucche in #20400
  • [Bugfix] Register reducer even if transformers_modules not available by @eicherseiji in #19510
  • Change warn_for_unimplemented_methods to debug by @mgoin in #20455
  • [Platform] Add custom default max tokens by @gmarinho2 in #18557
  • Add ignore consolidated file in mistral example code by @princepride in #20420
  • [Misc] small update by @reidliu41 in #20462
  • [Structured Outputs][V1] Skipping with models doesn't contain tokenizers by @aarnphm in #20365
  • [Perf] Optimize Vectorization Utils for Int 8 Quantization Kernels by @yewentao256 in #20331
  • [Misc] Add SPDX-FileCopyrightText by @jeejeelee in #20428
  • Support Llama 4 for fused_marlin_moe by @mgoin in #20457
  • [Bug][Frontend] Fix structure of transcription's decoder_prompt by @sangbumlikeagod in #18809
  • [Model][3/N] Automatic conversion of CrossEncoding model by @noooop in #20168
  • [Doc] Fix classification table in list of supported models by @DarkLight1337 in #20489
  • [CI] add kvcache-connector dependency definition and add into CI build by @panpan0000 in #18193
  • [Misc] Small: Remove global media connector. Each test should have its own test connector object. by @huachenheli in #20395
  • Enable V1 for Hybrid SSM/Attention Models by @tdoublep in #20016
  • [feat]: CUTLASS block scaled group gemm for SM100 by @djmmoss in #19757
  • [CI Bugfix] Fix pre-commit failures on main by @mgoin in #20502
  • [Doc] fix mutltimodal_inputs.md gh examples link by @GuyStone in #20497
  • [Misc] Add security warning for development mode endpoints by @reidliu41 in #20508
  • [doc] small fix by @reidliu41 in #20506
  • [Misc] Remove the unused LoRA test code by @jeejeelee in #20494
  • Fix unknown attribute of topk_indices_dtype in CompressedTensorsW8A8Fp8MoECutlassMethod by @luccafong in #20507
  • [v1] Re-add fp32 support to v1 engine through FlexAttention by @Isotr0py in #19754
  • [Misc] Add logger.exception for TPU information collection failures by @reidliu41 in #20510
  • [Misc] remove unused import by @reidliu41 in #20517
  • test_attention compat with coming xformers change by @bottler in #20487
  • [BUG] Fix #20484. Support empty sequence in cuda penalty kernel by @vadiklyutiy in #20491
  • [Bugfix] Fix missing per_act_token parameter in compressed_tensors_moe by @luccafong in #20509
  • [BugFix] Fix: ImportError when building on hopper systems by @LucasWilkinson in #20513
  • [TPU][Bugfix] fix the MoE OOM issue by @yaochengji in #20339
  • [Frontend] Support image object in llm.chat by @sfeng33 in #19635
  • [Benchmark] Add support for multiple batch size benchmark through CLI in benchmark_moe.py + Add Triton Fused MoE kernel config for FP8 E=16 on B200 by @b8zhong in #20516
  • [Misc] call the pre-defined func by @reidliu41 in #20518
  • [V0 deprecation] Remove V0 CPU/XPU/TPU backends by @WoosukKwon in #20412
  • [V1] Support any head size for FlexAttention backend by @DarkLight1337 in #20467
  • [BugFix][Spec Decode] Fix spec token ids in model runner by @WoosukKwon in #20530
  • [Bugfix] Add use_cross_encoder flag to use correct activation in ClassifierPooler by @DarkLight1337 in #20527
  • Implement OpenAI Responses API [1/N] by @WoosukKwon in #20504
  • [Misc] add a tip for pre-commit by @reidliu41 in #20536
  • [Refactor]Abstract Platform Interface for Distributed Backend and Add xccl Support for Intel XPU by @dbyoung18 in #19410
  • [CI/Build] Enable phi2 lora test by @jeejeelee in #20540
  • [XPU][CI] add v1/core test in xpu hardware ci by @Liangliang-Ma in #20537
  • Add docstrings to url_schemes.py to improve readability by @windsonsea in #20545
  • [XPU] log clean up for XPU platform by @yma11 in #20553
  • [Docs] Clean up tables in supported_models.md by @windsonsea in #20552
  • [Misc] remove unused jinaai_serving_reranking by @Abirdcfly in #18878
  • [Misc] Set the minimum openai version by @jeejeelee in #20539
  • [Doc] Remove extra whitespace from CI failures doc by @hmellor in #20565
  • [Doc] Use gh-pr and gh-issue everywhere we can in the docs by @hmellor in #20564
  • [Doc] Fix internal links so they don't always point to latest by @hmellor in #20563
  • [Doc] Add outline for content tabs by @hmellor in #20571
  • [Doc] Fix some MkDocs snippets used in the installation docs by @hmellor in #20572
  • [Model][Last/4] Automatic conversion of CrossEncoding model by @noooop in #19675
  • [Bugfix] Prevent IndexError for cached requests when pipeline parallelism is disabled by @panpan0000 in #20486
  • [Feature] microbatch tokenization by @ztang2370 in #19334
  • [DP] Copy environment variables to Ray DPEngineCoreActors by @ruisearch42 in #20344
  • [Kernel] Optimize Prefill Attention in Unified Triton Attention Kernel by @jvlunteren in #20308
  • [Misc] Add fully interleaved support for multimodal 'string' content format by @Dekakhrone in #14047
  • [Misc] feat output content in stream response by @lengrongfu in #19608
  • Fix links in multi-modal model contributing page by @hmellor in #18615
  • [Config] Refactor mistral configs by @patrickvonplaten in #20570
  • [Misc] Improve logging for dynamic shape cache compilation by @kyolebu in #20573
  • [Bugfix] Fix Maverick correctness by filling zero to cache space in cutlass_moe by @minosfuture in #20167
  • [Optimize] Don't send token ids when kv connector is not used by @WoosukKwon in #20586
  • Make distinct code and console admonitions so readers are less likely to miss them by @hmellor in #20585
  • [Bugfix]: Fix messy code when using logprobs by @chaunceyjiang in #19209
  • [Doc] Syntax highlight request responses as JSON instead of bash by @hmellor in #20582
  • [Docs] Rewrite offline inference guide by @crypdick in #20594
  • [Docs] Improve docstring for ray data llm example by @crypdick in #20597
  • [Docs] Add Ray Serve LLM section to openai compatible server guide by @crypdick in #20595
  • [Docs] Add Anyscale to frameworks by @crypdick in #20590
  • [Misc] improve error msg by @reidliu41 in #20604
  • [CI/Build][CPU] Fix CPU CI and remove all CPU V0 files by @bigPYJ1151 in #20560
  • [TPU] Temporary fix vmem oom for long model len by reducing page size by @Chenyaaang in #20278
  • [Frontend] [Core] Integrate Tensorizer in to S3 loading machinery, allow passing arbitrary arguments during save/load by @sangstar in #19619
  • [PD][Nixl] Remote consumer READ timeout for clearing request blocks by @NickLucche in #20139
  • [Docs] Improve documentation for Deepseek R1 on Ray Serve LLM by @crypdick in #20601
  • Remove unnecessary explicit title anchors and use relative links instead by @hmellor in #20620
  • Stop using title frontmatter and fix doc that can only be ...
Read more

v0.9.2

07 Jul 17:05
Compare
Choose a tag to compare

Highlights

This release contains 452 commits from 167 contributors (31 new!)

NOTE: This is the last version where V0 engine code and features stay intact. We highly recommend migrating to V1 engine.

Engine Core

  • Priority Scheduling is now implemented in V1 engine (#19057), embedding models in V1 (#16188), Mamba2 in V1 (#19327).
  • Full CUDA‑Graph execution is now available for all FlashAttention v3 (FA3) and FlashMLA paths, including prefix‑caching. CUDA graph now has a live capture progress bar makes debugging easier (#20301, #18581, #19617, #19501).
  • FlexAttention update – any head size, FP32 fallback (#20467, #19754).
  • Shared CachedRequestData objects and cached sampler‑ID stores deliver perf enhancements (#20232, #20291).

Model Support

  • New families: Ernie 4.5 (+MoE) (#20220), MiniMax‑M1 (#19677, #20297), Slim‑MoE “Phi‑tiny‑MoE‑instruct” (#20286), Tencent HunYuan‑MoE‑V1 (#20114), Keye‑VL‑8B‑Preview (#20126), GLM‑4.1 V (#19331), Gemma‑3 (text‑only, #20134), Tarsier 2 (#19887), Qwen 3 Embedding & Reranker (#19260), dots1 (#18254), GPT‑2 for Sequence Classification (#19663).
  • Granite hybrid MoE configurations with shared experts are fully supported (#19652).

Large‑Scale Serving & Engine Improvements

  • Expert‑Parallel Load Balancer (EPLB) has been added! (#18343, #19790, #19885).
  • Disaggregated serving enhancements: Avoid stranding blocks in P when aborted in D's waiting queue (#19223), let toy proxy handle /chat/completions (#19730)
  • Native xPyD P2P NCCL transport as a base case for native PD without external dependency (#18242, #20246).

Hardware & Performance

  • NVIDIA Blackwell
    • SM120: CUTLASS W8A8/FP8 kernels and related tuning, added to Dockerfile (#17280, #19566, #20071, #19794)
    • SM100: block‑scaled‑group GEMM, INT8/FP8 vectorization, deep‑GEMM kernels, activation‑chunking for MoE, and group‑size 64 for Machete (#19757, #19572, #19168, #19085, #20290, #20331).
  • Intel GPU (V1) backend with Flash‑Attention support (#19560).
  • AMD ROCm: full‑graph capture for TritonAttention, quick All‑Reduce, and chunked pre‑fill (#19158, #19744, #18596).
    • Split‑KV support landed in the unified Triton Attention kernel, boosting long‑context throughput (#19152).
    • Full‑graph mode enabled in ROCm AITER MLA V1 decode path (#20254).
  • TPU: dynamic‑grid KV‑cache updates, head‑dim less than 128, tuned paged‑attention kernels, and KV‑padding fixes (#19928, #20235, #19620, #19813, #20048, #20339).
    • Add models and features supporting matrix. (#20230)

Quantization

  • Calibration‑free RTN INT4/INT8 pipeline for effortless, accurate compression (#18768).
  • Compressed‑Tensor NVFP4 (including MoE) + emulation; FP4 emulation removed on < SM100 devices (#19879, #19990, #19563).
  • Dynamic MoE‑layer quant (Marlin/GPTQ) and INT8 vectorization primitives (#19395, #20331, #19233).
  • Bits‑and‑Bytes 0.45 + with improved double‑quant logic and AWQ quality (#20424, #20033, #19431, #20076).

API · CLI · Frontend

  • API Server: Eliminate api_key and x_request_id headers middleware overhead (#19946)
  • New OpenAI‑compatible endpoints: /v1/audio/translations & revamped /v1/audio/transcriptions (#19615, #20179, #19597).
  • Token‑level progress bar for LLM.beam_search and cached template‑resolution speed‑ups (#19301, #20065).
  • Image‑object support in llm.chat, tool‑choice expansion, and custom‑arg passthroughs enrich multi‑modal agents (#19635, #17177, #16862).
  • CLI QoL: better parsing for -O/--compilation-config, batch‑size‑sweep benchmarking, richer --help, faster startup (#20156, #20516, #20430, #19941).
  • Metrics: Deprecate metrics with gpu_ prefix for non GPU specific metrics (#18354), Export NaNs in logits to scheduler_stats if output is corrupted (#18777)

Platform & Deployment

  • No‑privileged CPU / Docker / K8s mode (#19241) and custom default max‑tokens for hosted platforms (#18557).
  • Security hardening – runtime (cloud)pickle imports forbidden (#18018).
  • Hermetic builds and wheel slimming (FA2 8.0 + PTX only) shrink supply‑chain surface (#18064, #19336).

What's Changed

Read more

v0.9.2rc2

06 Jul 21:03
Compare
Choose a tag to compare
v0.9.2rc2 Pre-release
Pre-release

What's Changed

New Contributors

Full Changelog: v0.9.2rc1...v0.9.2rc2

v0.9.2rc1

03 Jul 21:54
2f2fcb3
Compare
Choose a tag to compare
v0.9.2rc1 Pre-release
Pre-release

What's Changed

Read more

v0.9.1

10 Jun 18:30
b6553be
Compare
Choose a tag to compare

Highlights

This release features 274 commits, from 123 contributors (27 new contributors!)

  • Progress in large scale serving
    • DP Attention + Expert Parallelism: CUDA graph support (#18724), DeepEP dispatch-combine kernel (#18434), batched/masked DeepGEMM kernel (#19111), CUTLASS MoE kernel with PPLX (#18762)
    • Heterogeneous TP (#18833), NixlConnector Enable FlashInfer backend (#19090)
    • DP: API-server scaleout with many-to-many server-engine comms (#17546), Support DP with Ray (#18779), allow AsyncLLMEngine.generate to target a specific DP rank (#19102), data parallel rank to KVEventBatch (#18925)
    • Tooling: Simplify EP kernels installation (#19412)
  • RLHF workflow: Support inplace model weights loading (#18745)
  • Initial full support for Hybrid Memory Allocator (#17996), support cross-layer KV sharing (#18212)
  • Add FlexAttention to vLLM V1 (#16078)
  • Various production hardening related to full cuda graph mode (#19171, #19106, #19321)

Model Support

  • Support Magistral (#19193), LoRA support for InternVL (#18842), minicpm eagle support (#18943), NemotronH support (#18863, #19249)
  • Enable data parallel for Llama4 vision encoder (#18368)
  • Add DeepSeek-R1-0528 function call chat template (#18874)

Hardware Support & Performance Optimizations

  • Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 (#19205), Qwen3-235B-A22B (#19315)
  • Blackwell: Add Cutlass MLA backend (#17625), Tunings for SM100 FP8 CUTLASS kernel (#18778), Use FlashInfer by default on Blackwell GPUs (#19118), Tune scaled_fp8_quant by increasing vectorization (#18844)
  • FP4: Add compressed-tensors NVFP4 support (#18312), FP4 MoE kernel optimization (#19110)
  • CPU: V1 support for the CPU backend (#16441)
  • ROCm: Add AITER grouped topk for DeepSeekV2 (#18825)
  • POWER: Add IBM POWER11 Support to CPU Extension Detection (#19082)
  • TPU: Initial support of model parallelism with single worker using SPMD (#18011), Multi-LoRA Optimizations for the V1 TPU backend (#15655)
  • Neuron: Add multi-LoRA support for Neuron. (#18284), Add Multi-Modal model support for Neuron (#18921), Support quantization on neuron (#18283)
  • Platform: Make torch distributed process group extendable (#18763)

Engine features

  • Add Lora Support to Beam Search (#18346)
  • Add rerank support to run_batch endpoint (#16278)
  • CLI: add run batch (#18804)
  • Server: custom logging (#18403), allowed_token_ids in ChatCompletionRequest (#19143)
  • LLM API: make use_tqdm accept a callable for custom progress bars (#19357)
  • perf: [KERNEL] Sampler. CUDA kernel for applying repetition penalty (#18437)

API Deprecations

  • Disallow pos-args other than model when initializing LLM (#18802)
  • Remove inputs arg fallback in Engine classes (#18799)
  • Remove fallbacks for Embeddings API (#18795)
  • Remove mean pooling default for Qwen2EmbeddingModel (#18913)
  • Require overriding get_dummy_text and get_dummy_mm_data (#18796)
  • Remove metrics that were deprecated in 0.8 (#18837)

Documentation

  • Add CLI doc (#18871)
  • Update SECURITY.md with link to our security guide (#18961), Add security warning to bug report template (#19365)

What's Changed

Read more

v0.9.1rc1

09 Jun 23:48
3a7cd62
Compare
Choose a tag to compare
v0.9.1rc1 Pre-release
Pre-release

What's Changed

Read more

v0.9.0.1

30 May 16:11
Compare
Choose a tag to compare

This patch release contains important bugfix for DeepSeek family of models on NVIDIA Ampere and below (#18807)

Full Changelog: v0.9.0...v0.9.0.1

v0.9.0

15 May 03:38
5873877
Compare
Choose a tag to compare

Highlights

This release features 649 commits, from 215 contributors (82 new contributors!)

  • vLLM has upgraded to PyTorch 2.7! (#16859) This is a breaking change for environment dependency.
    • The default wheel has been upgraded from CUDA 12.4 to CUDA 12.8. We will distribute CUDA 12.6 wheel on GitHub artifact.
    • As a general rule of thumb, our CUDA version policy follow PyTorch's CUDA version policy.
  • Enhanced NVIDIA Blackwell support. vLLM now ships with initial set of optimized kernels on NVIDIA Blackwell with both attention and mlp.
    • You can use our docker image or install FlashInfer nightly wheel pip install https://download.pytorch.org/whl/cu128/flashinfer/flashinfer_python-0.2.5%2Bcu128torch2.7-cp38-abi3-linux_x86_64.whl then set VLLM_ATTENTION_BACKEND=FLASHINFER for better performance.
    • Upgraded support for the new FlashInfer main branch. (#15777)
    • Please checkout #18153 for the full roadmap
  • Initial DP, EP, PD support for large scale inference
    • EP:
      • Permute and unpermute kernel for moe optimization (#14568)
      • Modularize fused experts and integrate PPLX kernels (#15956)
      • Refactor pplx init logic to make it modular (prepare for deepep) (#18200)
      • Add ep group and all2all interface (#18077)
    • DP:
      • Decouple engine process management and comms (#15977)
    • PD:
      • NIXL Integration (#17751)
      • Local attention optimization for NIXL (#18170)
      • Support multiple kv connectors (#17564)
  • Migrate docs from Sphinx to MkDocs (#18145, #18610, #18614, #18616. #18622, #18626, #18627, #18635, #18637, #18657, #18663, #18666, #18713)

Notable Changes

  • Removal of CUDA 12.4 support due to PyTorch upgrade to 2.7.
  • Change top_k to be disabled with 0 (still accept -1 for now) (#17773)
  • The seed is now set to 0 by default for V1 Engine, meaning that different vLLM runs now yield the same outputs even if temperature > 0. This does not modify the random state in user code since workers are run in separate processes unless VLLM_USE_V1_MULTIPROCESSING=0. (#17929, #18741)

Model Enhancements

  • Support MiMo-7B (#17433), MiniMax-VL-01 (#16328), Ovis 1.6 (#17861), Ovis 2 (#15826), GraniteMoeHybrid 4.0 (#17497), FalconH1* (#18406), LlamaGuard4 (#17315)
    • Please install the development version of transformers (from source) to use Falcon-H1.
  • Embedding models: nomic-embed-text-v2-moe (#17785), new class of gte models (#17986)
  • Progress in Hybrid Memory Allocator (#17394, #17479, #17474, #17483, #17193, #17946, #17945, #17999, #18001, #18593)
  • DeepSeek: perf enhancement by moving more calls into cuda-graph region(#17484, #17668), Function Call (#17784), MTP in V1 (#18435)
  • Qwen2.5-1M: Implements dual-chunk-flash-attn backend for dual chunk attention with sparse attention support (#11844)
  • Qwen2.5-VL speed enhancement via rotary_emb optimization (#17973)
  • InternVL models with Qwen2.5 backbone now support video inputs (#18499)

Performance, Production and Scaling

  • Support full cuda graph in v1 (#16072)
  • Pipeline Parallelism: MultiprocExecutor support (#14219), torchrun (#17827)
  • Support sequence parallelism combined with pipeline parallelism (#18243)
  • Async tensor parallelism using compilation pass (#17882)
  • Perf: Use small max_num_batched_tokens for A100 (#17885)
  • Fast Model Loading: Tensorizer support for V1 and LoRA (#17926)
  • Multi-modality: Automatically cast multi-modal input dtype before transferring device (#18756)

Security

  • Prevent side-channel attacks via cache salting (#17045)
  • Fix image hash collision in certain edge cases (#17378)
  • Add VLLM_ALLOW_INSECURE_SERIALIZATION env var (#17490)
  • Migrate to REGEX Library to prevent catastrophic backtracking (#18454, #18750)

Features

  • CLI: deprecated=True (#17426)
  • Frontend: progress bar for adding requests (#17525), chat_template_kwargs in LLM.chat (#17356), /classify endpoint (#17032), truncation control for embedding models (#14776), cached_tokens in response usage (#18149)
  • LoRA: default local directory LoRA resolver plugin. (#16855)
  • Metrics: kv event publishing (#16750), API for accessing in-memory Prometheus metrics (#17010)
  • Quantization: nvidia/DeepSeek-R1-FP4 (#16362), Quark MXFP4 format (#16943), AutoRound (#17850), torchao models with AOPerModuleConfig (#17826), CUDA Graph support for V1 GGUF support (#18646)
  • Reasoning: deprecate --enable-reasoning (#17452)
  • Spec Decode: EAGLE share input embedding (#17326), torch.compile & cudagraph to EAGLE (#17211), EAGLE3 (#17504), log accumulated metrics(#17913), Medusa (#17956)
  • Structured Outputs: Thinking compatibility (#16577), Spec Decoding (#14702), Qwen3 reasoning parser (#17466), tool_choice: required for Xgrammar (#17845), Structural Tag with Guidance backend (#17333)
  • Transformers backend: named parameters (#16868), interleaved sliding window attention (#18494)

Hardwares

  • NVIDIA: cutlass support for blackwell fp8 blockwise gemm (#14383)
  • TPU: Multi-LoRA implementation(#14238), default max-num-batched-tokens (#17508), V1 backend by default (#17673), top-logprobs (#17072)
  • Neuron: NeuronxDistributedInference support (#15970), Speculative Decoding, Dynamic on-device sampling (#16357), Mistral Model (#18222), Multi-LoRA (#18284)
  • AMD: Enable FP8 KV cache on V1 (#17870), Tuned fused moe config for Qwen3 MoE on MI300X (#17535, #17530), AITER biased group topk (#17955), Block-Scaled GEMM (#14968), MLA (#17523), Radeon GPU use Custom Paged Attention (#17004), reduce the number of environment variables in command line (#17229)
  • Extensibility: Make PiecewiseBackend pluggable and extendable (#18076)

Documentation

  • Update quickstart and install for cu128 using --torch-backend=auto (#18505)
  • NVIDIA TensorRT Model Optimizer (#17561)
  • Usage of Qwen3 thinking (#18291)

Developer Facing

What's Changed

Read more