Skip to content

Releases: vllm-project/vllm

v0.6.6

27 Dec 00:12
f49777b
Compare
Choose a tag to compare

Highlights

  • Support Deepseek V3 (#11523, #11502) model.

    • On 8xH200s or MI300x: vllm serve deepseek-ai/DeepSeek-V3 --tensor-parallel-size 8 --trust-remote-code --max-model-len 8192. The context length can be increased to about 32K beyond running into memory issue.
    • For other devices, follow our distributed inference guide to enable tensor parallel and/or pipeline parallel inference
    • We are just getting started for enhancing the support and unlock more performance. See #11539 for planned work.
  • Last mile stretch for V1 engine refactoring: API Server (#11529, #11530), penalties for sampler (#10681), prefix caching for vision language models (#11187, #11305), TP Ray executor (#11107,#11472)

  • Breaking change: X-Request-ID echoing is now opt-in instead of on by default for performance reason. Set --enable-request-id-headers to enable it.

Model Support

  • IBM Granite 3.1 (#11307), JambaForSequenceClassification model (#10860)
  • Add QVQ and QwQ to the list of supported models (#11509)

Performance

  • Cutlass 2:4 Sparsity + FP8/INT8 Quant Support (#10995)

Production Engine

  • Support streaming model from S3 using RunAI Model Streamer as optional loader (#10192)
  • Online Pooling API (#11457)
  • Load video from base64 (#11492)

Others

  • Add pypi index for every commit and nightly build (#11404)

What's Changed

Read more

v0.6.5

17 Dec 23:10
2d1b9ba
Compare
Choose a tag to compare

Highlights

Model Support

Hardware Support

Performance & Scheduling

  • Prefix-cache aware scheduling (#10128), sliding window support (#10462), disaggregated prefill enhancements (#10502, #10884), evictor optimization (#7209).

Benchmark & Frontend

Documentation & Plugins

Bugfixes & Misc

What's Changed

Read more

v0.6.4.post1

15 Nov 17:50
a6221a1
Compare
Choose a tag to compare

This patch release covers bug fixes (#10347, #10349, #10348, #10352, #10363), keep compatibility for vLLMConfig usage in out of tree models (#10356)

What's Changed

New Contributors

Full Changelog: v0.6.4...v0.6.4.post1

v0.6.4

15 Nov 07:32
02dbf30
Compare
Choose a tag to compare

Highlights

Model Support

  • New LLMs and VLMs: Idefics3 (#9767), H2OVL-Mississippi (#9747), Qwen2-Audio (#9248), Pixtral models in the HF Transformers format (#9036), FalconMamba (#9325), Florence-2 language backbone (#9555)
  • New encoder-decoder embedding models: BERT (#9056), RoBERTa & XLM-RoBERTa (#9387)
  • Expanded task support: Llama embeddings (#9806), Math-Shepherd (Mistral reward modeling) (#9697), Qwen2 classification (#9704), Qwen2 embeddings (#10184), VLM2Vec (Phi-3-Vision embeddings) (#9303), E5-V (LLaVA-NeXT embeddings) (#9576), Qwen2-VL embeddings (#9944)
    • Add user-configurable --task parameter for models that support both generation and embedding (#9424)
    • Chat-based Embeddings API (#9759)
  • Tool calling parser for Granite 3.0 (#9027), Jamba (#9154), granite-20b-functioncalling (#8339)
  • LoRA support for Granite 3.0 MoE (#9673), Idefics3 (#10281), Llama embeddings (#10071), Qwen (#9622), Qwen2-VL (#10022)
  • BNB quantization support for Idefics3 (#10310), Mllama (#9720), Qwen2 (#9467, #9574), MiniCPMV (#9891)
  • Unified multi-modal processor for VLM (#10040, #10044)
  • Simplify model interface (#9933, #10237, #9938, #9958, #10007, #9978, #9983, #10205)

Hardware Support

  • Gaudi: Add Intel Gaudi (HPU) inference backend (#6143)
  • CPU: Add embedding models support for CPU backend (#10193)
  • TPU: Correctly profile peak memory usage & Upgrade PyTorch XLA (#9438)
  • Triton: Add Triton implementation for scaled_mm_triton to support fp8 and int8 SmoothQuant, symmetric case (#9857)

Performance

  • Combine chunked prefill with speculative decoding (#9291)
  • fused_moe Performance Improvement (#9384)

Engine Core

  • Override HF config.json via CLI (#5836)
  • Add goodput metric support (#9338)
  • Move parallel sampling out from vllm core, paving way for V1 engine (#9302)
  • Add stateless process group for easier integration with RLHF and disaggregated prefill (#10216, #10072)

Others

  • Improvements to the pull request experience with DCO, mergify, stale bot, etc. (#9436, #9512, #9513, #9259, #10082, #10285, #9803)
  • Dropped support for Python 3.8 (#10038, #8464)
  • Basic Integration Test For TPU (#9968)
  • Document the class hierarchy in vLLM (#10240), explain the integration with Hugging Face (#10173).
  • Benchmark throughput now supports image input (#9851)

What's Changed

  • [TPU] Fix TPU SMEM OOM by Pallas paged attention kernel by @WoosukKwon in #9350
  • [Frontend] merge beam search implementations by @LunrEclipse in #9296
  • [Model] Make llama3.2 support multiple and interleaved images by @xiangxu-google in #9095
  • [Bugfix] Clean up some cruft in mamba.py by @tlrmchlsmth in #9343
  • [Frontend] Clarify model_type error messages by @stevegrubb in #9345
  • [Doc] Fix code formatting in spec_decode.rst by @mgoin in #9348
  • [Bugfix] Update InternVL input mapper to support image embeds by @hhzhang16 in #9351
  • [BugFix] Fix chat API continuous usage stats by @njhill in #9357
  • pass ignore_eos parameter to all benchmark_serving calls by @gracehonv in #9349
  • [Misc] Directly use compressed-tensors for checkpoint definitions by @mgoin in #8909
  • [Bugfix] Fix vLLM UsageInfo and logprobs None AssertionError with empty token_ids by @CatherineSue in #9034
  • [Bugfix][CI/Build] Fix CUDA 11.8 Build by @LucasWilkinson in #9386
  • [Bugfix] Molmo text-only input bug fix by @mrsalehi in #9397
  • [Misc] Standardize RoPE handling for Qwen2-VL by @DarkLight1337 in #9250
  • [Model] VLM2Vec, the first multimodal embedding model in vLLM by @DarkLight1337 in #9303
  • [CI/Build] Test VLM embeddings by @DarkLight1337 in #9406
  • [Core] Rename input data types by @DarkLight1337 in #8688
  • [Misc] Consolidate example usage of OpenAI client for multimodal models by @ywang96 in #9412
  • [Model] Support SDPA attention for Molmo vision backbone by @Isotr0py in #9410
  • Support mistral interleaved attn by @patrickvonplaten in #9414
  • [Kernel][Model] Improve continuous batching for Jamba and Mamba by @mzusman in #9189
  • [Model][Bugfix] Add FATReLU activation and support for openbmb/MiniCPM-S-1B-sft by @streaver91 in #9396
  • [Performance][Spec Decode] Optimize ngram lookup performance by @LiuXiaoxuanPKU in #9333
  • [CI/Build] mypy: Resolve some errors from checking vllm/engine by @russellb in #9267
  • [Bugfix][Kernel] Prevent integer overflow in fp8 dynamic per-token quantize kernel by @tlrmchlsmth in #9425
  • [BugFix] [Kernel] Fix GPU SEGV occurring in int8 kernels by @rasmith in #9391
  • Add notes on the use of Slack by @terrytangyuan in #9442
  • [Kernel] Add Exllama as a backend for compressed-tensors by @LucasWilkinson in #9395
  • [Misc] Print stack trace using logger.exception by @DarkLight1337 in #9461
  • [misc] CUDA Time Layerwise Profiler by @LucasWilkinson in #8337
  • [Bugfix] Allow prefill of assistant response when using mistral_common by @sasha0552 in #9446
  • [TPU] Call torch._sync(param) during weight loading by @WoosukKwon in #9437
  • [Hardware][CPU] compressed-tensor INT8 W8A8 AZP support by @bigPYJ1151 in #9344
  • [Core] Deprecating block manager v1 and make block manager v2 default by @KuntaiDu in #8704
  • [CI/Build] remove .github from .dockerignore, add dirty repo check by @dtrifiro in #9375
  • [Misc] Remove commit id file by @DarkLight1337 in #9470
  • [torch.compile] Fine-grained CustomOp enabling mechanism by @ProExpertProg in #9300
  • [Bugfix] Fix support for dimension like integers and ScalarType by @bnellnm in #9299
  • [Bugfix] Add random_seed to sample_hf_requests in benchmark_serving script by @wukaixingxp in #9013
  • [Bugfix] Print warnings related to mistral_common tokenizer only once by @sasha0552 in #9468
  • [Hardwware][Neuron] Simplify model load for transformers-neuronx library by @sssrijan-amazon in #9380
  • Support BERTModel (first encoder-only embedding model) by @robertgshaw2-neuralmagic in #9056
  • [BugFix] Stop silent failures on compressed-tensors parsing by @dsikka in #9381
  • [Bugfix][Core] Use torch.cuda.memory_stats() to profile peak memory usage by @joerunde in #9352
  • [Qwen2.5] Support bnb quant for Qwen2.5 by @blueyo0 in #9467
  • [CI/Build] Use commit hash references for github actions by @russellb in #9430
  • [BugFix] Typing fixes to RequestOutput.prompt and beam search by @njhill in #9473
  • [Frontend][Feature] Add jamba tool parser by @tomeras91 in #9154
  • [BugFix] Fix and simplify completion API usage streaming by @njhill in #9475
  • [CI/Build] Fix lint errors in mistral tokenizer by @DarkLight1337 in #9504
  • [Bugfix] Fix offline_inference_with_prefix.py by @tlrmchlsmth in #9505
  • [Misc] benchmark: Add option to set max concurrency by @russellb in #9390
  • [Model] Add user-configurable task for models that support both generation and embedding by @DarkLight1337 in #9424
  • [CI/Build] Add error matching config for mypy by @russellb in #9512
  • [Model] Support Pixtral models ...
Read more

v0.6.3.post1

17 Oct 17:26
a2c71c5
Compare
Choose a tag to compare

Highlights

New Models

  • Support Ministral 3B and Ministral 8B via interleaved attention (#9414)
  • Support multiple and interleaved images for Llama3.2 (#9095)
  • Support VLM2Vec, the first multimodal embedding model in vLLM (#9303)

Important bug fix

  • Fix chat API continuous usage stats (#9357)
  • Fix vLLM UsageInfo and logprobs None AssertionError with empty token_ids (#9034)
  • Fix Molmo text-only input bug (#9397)
  • Fix CUDA 11.8 Build (#9386)
  • Fix _version.py not found issue (#9375)

Other Enhancements

  • Remove block manager v1 and make block manager v2 default (#8704)
  • Spec Decode Optimize ngram lookup performance (#9333)

What's Changed

New Contributors

  • @gracehonv made their first contribution in #9349
  • @streaver91 made their first contribution in #9396

Full Changelog: v0.6.3...v0.6.3.post1

v0.6.3

14 Oct 20:20
fd47e57
Compare
Choose a tag to compare

Highlights

Model Support

  • New Models:
  • Expansion in functionality:
    • Add Gemma2 embedding model (#9004)
    • Support input embeddings for qwen2vl (#8856), minicpmv (#9237)
    • LoRA:
      • LoRA support for MiniCPMV2.5 (#7199), MiniCPMV2.6 (#8943)
      • Expand lora modules for mixtral (#9008)
    • Pipeline parallelism support to remaining text and embedding models (#7168, #9090)
    • Expanded bitsandbytes quantization support for Falcon, OPT, Gemma, Gemma2, and Phi (#9148)
    • Tool use:
      • Add support for Llama 3.1 and 3.2 tool use (#8343)
      • Support tool calling for InternLM2.5 (#8405)
  • Out of tree support enhancements: Explicit interface for vLLM models and support OOT embedding models (#9108)

Documentation

  • New compatibility matrix for mutual exclusive features (#8512)
  • Reorganized installation doc, note that we publish a per-commit docker image (#8931)

Hardware Support:

  • Cross-attention and Encoder-Decoder models support on x86 CPU backend (#9089)
  • Support AWQ for CPU backend (#7515)
  • Add async output processor for xpu (#8897)
  • Add on-device sampling support for Neuron (#8746)

Architectural Enhancements

  • Progress in vLLM's refactoring to a core core:
    • Spec decode removing batch expansion (#8839, #9298).
    • We have made block manager V2 the default. This is an internal refactoring for cleaner and more tested code path (#8678).
    • Moving beam search from the core to the API level (#9105, #9087, #9117, #8928)
    • Move guided decoding params into sampling params (#8252)
  • Torch Compile:
    • You can now set an env var VLLM_TORCH_COMPILE_LEVEL to control torch.compile various levels of compilation control and integration (#9058). Along with various improvements (#8982, #9258, #906, #8875), using VLLM_TORCH_COMPILE_LEVEL=3 can turn on Inductor's full graph compilation without vLLM's custom ops.

Others

  • Performance enhancements to turn on multi-step scheeduling by default (#8804, #8645, #8378)
  • Enhancements towards priority scheduling (#8965, #8956, #8850)

What's Changed

Read more

v0.6.2

25 Sep 21:50
7193774
Compare
Choose a tag to compare

Highlights

Model Support

  • Support Llama 3.2 models (#8811, #8822)

     vllm serve meta-llama/Llama-3.2-11B-Vision-Instruct --enforce-eager --max-num-seqs 16
    
  • Beam search have been soft deprecated. We are moving towards a version of beam search that's more performant and also simplifying vLLM's core. (#8684, #8763, #8713)

    • ⚠️ You will see the following error now, this is breaking change!

      Using beam search as a sampling parameter is deprecated, and will be removed in the future release. Please use the vllm.LLM.use_beam_search method for dedicated beam search instead, or set the environment variable VLLM_ALLOW_DEPRECATED_BEAM_SEARCH=1 to suppress this error. For more details, see #8306

  • Support for Solar Model (#8386), minicpm3 (#8297), LLaVA-Onevision model support (#8486)

  • Enhancements: pp for qwen2-vl (#8696), multiple images for qwen-vl (#8247), mistral function calling (#8515), bitsandbytes support for Gemma2 (#8338), tensor parallelism with bitsandbytes quantization (#8434)

Hardware Support

  • TPU: implement multi-step scheduling (#8489), use Ray for default distributed backend (#8389)
  • CPU: Enable mrope and support Qwen2-VL on CPU backend (#8770)
  • AMD: custom paged attention kernel for rocm (#8310), and fp8 kv cache support (#8577)

Production Engine

  • Initial support for priority sheduling (#5958)
  • Support Lora lineage and base model metadata management (#6315)
  • Batch inference for llm.chat() API (#8648)

Performance

  • Introduce MQLLMEngine for API Server, boost throughput 30% in single step and 7% in multistep (#8157, #8761, #8584)
  • Multi-step scheduling enhancements
    • Prompt logprobs support in Multi-step (#8199)
    • Add output streaming support to multi-step + async (#8335)
    • Add flashinfer backend (#7928)
  • Add cuda graph support during decoding for encoder-decoder models (#7631)

Others

  • Support sample from HF datasets and image input for benchmark_serving (#8495)
  • Progress in torch.compile integration (#8488, #8480, #8384, #8526, #8445)

What's Changed

Read more

v0.6.1.post2

13 Sep 18:35
9ba0817
Compare
Choose a tag to compare

Highlights

  • This release contains an important bugfix related to token streaming combined with stop string (#8468)

What's Changed

Full Changelog: v0.6.1.post1...v0.6.1.post2

v0.6.1.post1

13 Sep 04:40
acda0b3
Compare
Choose a tag to compare

Highlights

This release features important bug fixes and enhancements for

  • Pixtral models. (#8415, #8425, #8399, #8431)
    • Chunked scheduling has been turned off for vision models. Please replace --max_num_batched_tokens 16384 with --max-model-len 16384
  • Multistep scheduling. (#8417, #7928, #8427)
  • Tool use. (#8423, #8366)

Also

  • support multiple images for qwen-vl (#8247)
  • removes engine_use_ray (#8126)
  • add engine option to return only deltas or final output (#7381)
  • add bitsandbytes support for Gemma2 (#8338)

What's Changed

New Contributors

Full Changelog: v0.6.1...v0.6.1.post1

v0.6.1

11 Sep 21:44
3fd2b0d
Compare
Choose a tag to compare

Highlights

Model Support

  • Added support for Pixtral (mistralai/Pixtral-12B-2409). (#8377, #8168)
  • Added support for Llava-Next-Video (#7559), Qwen-VL (#8029), Qwen2-VL (#7905)
  • Multi-input support for LLaVA (#8238), InternVL2 models (#8201)

Performance Enhancements

  • Memory optimization for awq_gemm and awq_dequantize, 2x throughput (#8248)

Production Engine

  • Support load and unload LoRA in api server (#6566)
  • Add progress reporting to batch runner (#8060)
  • Add support for NVIDIA ModelOpt static scaling checkpoints. (#6112)

Others

  • Update the docker image to use Python 3.12 for small performance bump. (#8133)
  • Added CODE_OF_CONDUCT.md (#8161)

What's Changed

  • [Doc] [Misc] Create CODE_OF_CONDUCT.md by @mmcelaney in #8161
  • [bugfix] Upgrade minimum OpenAI version by @SolitaryThinker in #8169
  • [Misc] Clean up RoPE forward_native by @WoosukKwon in #8076
  • [ci] Mark LoRA test as soft-fail by @khluu in #8160
  • [Core/Bugfix] Add query dtype as per FlashInfer API requirements. by @elfiegg in #8173
  • [Doc] Add multi-image input example and update supported models by @DarkLight1337 in #8181
  • Inclusion of InternVLChatModel In PP_SUPPORTED_MODELS(Pipeline Parallelism) by @Manikandan-Thangaraj-ZS0321 in #7860
  • [MODEL] Qwen Multimodal Support (Qwen-VL / Qwen-VL-Chat) by @alex-jw-brooks in #8029
  • Move verify_marlin_supported to GPTQMarlinLinearMethod by @mgoin in #8165
  • [Documentation][Spec Decode] Add documentation about lossless guarantees in Speculative Decoding in vLLM by @sroy745 in #7962
  • [Core] Support load and unload LoRA in api server by @Jeffwan in #6566
  • [BugFix] Fix Granite model configuration by @njhill in #8216
  • [Frontend] Add --logprobs argument to benchmark_serving.py by @afeldman-nm in #8191
  • [Misc] Use ray[adag] dependency instead of cuda by @ruisearch42 in #7938
  • [CI/Build] Increasing timeout for multiproc worker tests by @alexeykondrat in #8203
  • [Kernel] [Triton] Memory optimization for awq_gemm and awq_dequantize, 2x throughput by @rasmith in #8248
  • [Misc] Remove SqueezeLLM by @dsikka in #8220
  • [Model] Allow loading from original Mistral format by @patrickvonplaten in #8168
  • [misc] [doc] [frontend] LLM torch profiler support by @SolitaryThinker in #7943
  • [Bugfix] Fix Hermes tool call chat template bug by @K-Mistele in #8256
  • [Model] Multi-input support for LLaVA and fix embedding inputs for multi-image models by @DarkLight1337 in #8238
  • Enable Random Prefix Caching in Serving Profiling Tool (benchmark_serving.py) by @wschin in #8241
  • [tpu][misc] fix typo by @youkaichao in #8260
  • [Bugfix] Fix broken OpenAI tensorizer test by @DarkLight1337 in #8258
  • [Model][VLM] Support multi-images inputs for InternVL2 models by @Isotr0py in #8201
  • [Model][VLM] Decouple weight loading logic for Paligemma by @Isotr0py in #8269
  • ppc64le: Dockerfile fixed, and a script for buildkite by @sumitd2 in #8026
  • [CI/Build] Use python 3.12 in cuda image by @joerunde in #8133
  • [Bugfix] Fix async postprocessor in case of preemption by @alexm-neuralmagic in #8267
  • [Bugfix] Streamed tool calls now more strictly follow OpenAI's format; ensures Vercel AI SDK compatibility by @K-Mistele in #8272
  • [Frontend] Add progress reporting to run_batch.py by @alugowski in #8060
  • [Bugfix] Correct adapter usage for cohere and jamba by @vladislavkruglikov in #8292
  • [Misc] GPTQ Activation Ordering by @kylesayrs in #8135
  • [Misc] Fused MoE Marlin support for GPTQ by @dsikka in #8217
  • Add NVIDIA Meetup slides, announce AMD meetup, and add contact info by @simon-mo in #8319
  • [Bugfix] Fix missing post_layernorm in CLIP by @DarkLight1337 in #8155
  • [CI/Build] enable ccache/scccache for HIP builds by @dtrifiro in #8327
  • [Frontend] Clean up type annotations for mistral tokenizer by @DarkLight1337 in #8314
  • [CI/Build] Enabling kernels tests for AMD, ignoring some of then that fail by @alexeykondrat in #8130
  • Fix ppc64le buildkite job by @sumitd2 in #8309
  • [Spec Decode] Move ops.advance_step to flash attn advance_step by @kevin314 in #8224
  • [Misc] remove peft as dependency for prompt models by @prashantgupta24 in #8162
  • [MISC] Keep chunked prefill enabled by default with long context when prefix caching is enabled by @comaniac in #8342
  • [Bugfix] Ensure multistep lookahead allocation is compatible with cuda graph max capture by @alexm-neuralmagic in #8340
  • [Core/Bugfix] pass VLLM_ATTENTION_BACKEND to ray workers by @SolitaryThinker in #8172
  • [CI/Build][Kernel] Update CUTLASS to 3.5.1 tag by @tlrmchlsmth in #8043
  • [Misc] Skip loading extra bias for Qwen2-MOE GPTQ models by @jeejeelee in #8329
  • [Bugfix] Fix InternVL2 vision embeddings process with pipeline parallel by @Isotr0py in #8299
  • [Hardware][NV] Add support for ModelOpt static scaling checkpoints. by @pavanimajety in #6112
  • [model] Support for Llava-Next-Video model by @TKONIY in #7559
  • [Frontend] Create ErrorResponse instead of raising exceptions in run_batch by @pooyadavoodi in #8347
  • [Model][VLM] Add Qwen2-VL model support by @fyabc in #7905
  • [Hardware][Intel] Support compressed-tensor W8A8 for CPU backend by @bigPYJ1151 in #7257
  • [CI/Build] Excluding test_moe.py from AMD Kernels tests for investigation by @alexeykondrat in #8373
  • [Bugfix] Add missing attributes in mistral tokenizer by @DarkLight1337 in #8364
  • [Kernel][Misc] Add meta functions for ops to prevent graph breaks by @bnellnm in #6917
  • [Misc] Move device options to a single place by @akx in #8322
  • [Speculative Decoding] Test refactor by @LiuXiaoxuanPKU in #8317
  • Pixtral by @patrickvonplaten in #8377
  • Bump version to v0.6.1 by @simon-mo in #8379

New Contributors

Full Changelog: v0.6.0...v0.6.1