From 7275d55c926217be373ffe3047757cf886f8a2ac Mon Sep 17 00:00:00 2001 From: ariG23498 Date: Mon, 21 Jul 2025 18:21:36 +0530 Subject: [PATCH 1/9] chore: update blog with vlm support Signed-off-by: ariG23498 --- _posts/2025-04-11-transformers-backend.md | 63 +++++++++++++++++++++++ 1 file changed, 63 insertions(+) diff --git a/_posts/2025-04-11-transformers-backend.md b/_posts/2025-04-11-transformers-backend.md index 84a08c3..5c25fdc 100644 --- a/_posts/2025-04-11-transformers-backend.md +++ b/_posts/2025-04-11-transformers-backend.md @@ -20,6 +20,69 @@ vLLM will therefore optimize throughput/latency on top of existing transformers In this post, we’ll explore how vLLM leverages the transformers backend to combine **flexibility** with **efficiency**, enabling you to deploy state-of-the-art models faster and smarter. +## Updates + +This section will hold all the updates that have been taken place over the course of the first release of the blog psot (11th April 2025). + +### Support for Vision Language Models (21st July 2025) + +vLLM with the transformers backend now supports Vision Langauge Models. Here is how one would use +the API. + +```python +from vllm import LLM, SamplingParams +from PIL import Image +import requests +from transformers import AutoProcessor + +model_id = "llava-hf/llava-onevision-qwen2-0.5b-ov-hf" +hf_processor = AutoProcessor.from_pretrained(model_id) # required to dynamically update the chat template + +messages = [ + { + "role": "user", + "content": [ + {"type": "image", "url": "dummy_image.jpg"}, + {"type": "text", "text": "What is the content of this image?"}, + ], + }, +] +prompt = hf_processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) +image = Image.open( + requests.get( + "http://images.cocodataset.org/val2017/000000039769.jpg", stream=True + ).raw +) + +# initialize the vlm using the `model_impl="transformers"` +vlm = LLM( + model="llava-hf/llava-onevision-qwen2-0.5b-ov-hf", + model_impl="transformers", + disable_mm_preprocessor_cache=True, # we disable the mm preprocessor cache for the time being + enable_prefix_caching=False, + enable_chunked_prefill=False +) + +outputs = vlm.generate( + { + "prompt": prompt, + "multi_modal_data": {"image": image}, + }, + sampling_params=SamplingParams(max_tokens=100) +) + +for o in outputs: + generated_text = o.outputs[0].text + print(generated_text) + +# OUTPUTS: +# In the tranquil setting of this image, two feline companions are enjoying a peaceful slumber on a +# cozy pink couch. The couch, adorned with a plush red fabric across the seating area, serves as their perfect resting place. +# +# On the left side of the couch, a gray tabby cat is curled up at rest, its body relaxed in a display +# of feline serenity. One paw playfully stretches out, perhaps in mid-jump or simply exploring its surroundings. +``` + ## Transformers and vLLM: Inference in Action Let’s start with a simple text generation task using the `meta-llama/Llama-3.2-1B` model to see how From 09e614d217c9fe2ebf3351caee15b925c80653b8 Mon Sep 17 00:00:00 2001 From: Aritra Roy Gosthipaty Date: Tue, 22 Jul 2025 10:23:32 +0530 Subject: [PATCH 2/9] Update _posts/2025-04-11-transformers-backend.md Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: ariG23498 --- _posts/2025-04-11-transformers-backend.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2025-04-11-transformers-backend.md b/_posts/2025-04-11-transformers-backend.md index 5c25fdc..2063371 100644 --- a/_posts/2025-04-11-transformers-backend.md +++ b/_posts/2025-04-11-transformers-backend.md @@ -22,7 +22,7 @@ with **efficiency**, enabling you to deploy state-of-the-art models faster and s ## Updates -This section will hold all the updates that have been taken place over the course of the first release of the blog psot (11th April 2025). +This section will hold all the updates that have been taken place over the course of the first release of the blog post (11th April 2025). ### Support for Vision Language Models (21st July 2025) From 402cad6afd1687407cd03e228d84baad83099087 Mon Sep 17 00:00:00 2001 From: Aritra Roy Gosthipaty Date: Tue, 22 Jul 2025 10:24:00 +0530 Subject: [PATCH 3/9] Update _posts/2025-04-11-transformers-backend.md Co-authored-by: Sergio Paniego Blanco Signed-off-by: ariG23498 --- _posts/2025-04-11-transformers-backend.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2025-04-11-transformers-backend.md b/_posts/2025-04-11-transformers-backend.md index 2063371..5123909 100644 --- a/_posts/2025-04-11-transformers-backend.md +++ b/_posts/2025-04-11-transformers-backend.md @@ -26,7 +26,7 @@ This section will hold all the updates that have been taken place over the cours ### Support for Vision Language Models (21st July 2025) -vLLM with the transformers backend now supports Vision Langauge Models. Here is how one would use +vLLM with the transformers backend now supports Vision Language Models. Here is how one would use the API. ```python From d22035abc1cf2a90a22a0470f2c8fcdeaa6720ad Mon Sep 17 00:00:00 2001 From: ariG23498 Date: Tue, 22 Jul 2025 10:25:33 +0530 Subject: [PATCH 4/9] review suggestions Signed-off-by: ariG23498 --- _posts/2025-04-11-transformers-backend.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2025-04-11-transformers-backend.md b/_posts/2025-04-11-transformers-backend.md index 5123909..41823b4 100644 --- a/_posts/2025-04-11-transformers-backend.md +++ b/_posts/2025-04-11-transformers-backend.md @@ -22,7 +22,7 @@ with **efficiency**, enabling you to deploy state-of-the-art models faster and s ## Updates -This section will hold all the updates that have been taken place over the course of the first release of the blog post (11th April 2025). +This section will hold all the updates that have taken place since the blog post was first released (11th April 2025). ### Support for Vision Language Models (21st July 2025) From 3a0c98b7ace199d21c0ea918018679c041e9a622 Mon Sep 17 00:00:00 2001 From: ariG23498 Date: Tue, 22 Jul 2025 10:30:10 +0530 Subject: [PATCH 5/9] vb's suggestions Signed-off-by: ariG23498 --- _posts/2025-04-11-transformers-backend.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/_posts/2025-04-11-transformers-backend.md b/_posts/2025-04-11-transformers-backend.md index 41823b4..3f13b40 100644 --- a/_posts/2025-04-11-transformers-backend.md +++ b/_posts/2025-04-11-transformers-backend.md @@ -26,8 +26,10 @@ This section will hold all the updates that have taken place since the blog post ### Support for Vision Language Models (21st July 2025) -vLLM with the transformers backend now supports Vision Language Models. Here is how one would use -the API. +vLLM with the transformers backend now supports **Vision Language Models**. When user adds `model_impl="transformers"`, +the correct class for text-only and multimodality will be deduced and loaded. + +Here is how one would use the API. ```python from vllm import LLM, SamplingParams From e7178f6803ec13354c0544db47548c1fabb860f2 Mon Sep 17 00:00:00 2001 From: ariG23498 Date: Tue, 22 Jul 2025 16:58:59 +0530 Subject: [PATCH 6/9] adding openai consumption and serving Signed-off-by: ariG23498 --- _posts/2025-04-11-transformers-backend.md | 39 ++++++++++++++++++++++- 1 file changed, 38 insertions(+), 1 deletion(-) diff --git a/_posts/2025-04-11-transformers-backend.md b/_posts/2025-04-11-transformers-backend.md index 3f13b40..88ab936 100644 --- a/_posts/2025-04-11-transformers-backend.md +++ b/_posts/2025-04-11-transformers-backend.md @@ -29,7 +29,44 @@ This section will hold all the updates that have taken place since the blog post vLLM with the transformers backend now supports **Vision Language Models**. When user adds `model_impl="transformers"`, the correct class for text-only and multimodality will be deduced and loaded. -Here is how one would use the API. +Here is how one can serve a multimodal model using the transformers backend. +```bash +vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf \ +--model_impl transformers \ +--disable-mm-preprocessor-cache \ +--no-enable-prefix-caching \ +--no-enable-chunked-prefill +``` + +To consume the model one can use the `openai` API like so: +```python +from openai import OpenAI +openai_api_key = "EMPTY" +openai_api_base = "http://localhost:8000/v1" +client = OpenAI( + api_key=openai_api_key, + base_url=openai_api_base, +) +chat_response = client.chat.completions.create( + model="llava-hf/llava-onevision-qwen2-0.5b-ov-hf", + messages=[{ + "role": "user", + "content": [ + {"type": "text", "text": "What's in this image?"}, + { + "type": "image_url", + "image_url": { + "url": "http://images.cocodataset.org/val2017/000000039769.jpg", + }, + }, + ], + }], +) +print("Chat response:", chat_response) +``` + +You can also directly initialize the vLLM engine using the `LLM` API. Here is the same model being +served using the `LLM` API. ```python from vllm import LLM, SamplingParams From c6bef35f709e22d0be3f09643703371cc3e6f5f2 Mon Sep 17 00:00:00 2001 From: Aritra Roy Gosthipaty Date: Wed, 23 Jul 2025 07:16:32 +0530 Subject: [PATCH 7/9] Update _posts/2025-04-11-transformers-backend.md Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: ariG23498 --- _posts/2025-04-11-transformers-backend.md | 3 --- 1 file changed, 3 deletions(-) diff --git a/_posts/2025-04-11-transformers-backend.md b/_posts/2025-04-11-transformers-backend.md index 88ab936..9eeaaf2 100644 --- a/_posts/2025-04-11-transformers-backend.md +++ b/_posts/2025-04-11-transformers-backend.md @@ -97,9 +97,6 @@ image = Image.open( vlm = LLM( model="llava-hf/llava-onevision-qwen2-0.5b-ov-hf", model_impl="transformers", - disable_mm_preprocessor_cache=True, # we disable the mm preprocessor cache for the time being - enable_prefix_caching=False, - enable_chunked_prefill=False ) outputs = vlm.generate( From 58ca04c741b5ba76b4140bad6c2a17e2a703e342 Mon Sep 17 00:00:00 2001 From: Aritra Roy Gosthipaty Date: Wed, 23 Jul 2025 07:16:38 +0530 Subject: [PATCH 8/9] Update _posts/2025-04-11-transformers-backend.md Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: ariG23498 --- _posts/2025-04-11-transformers-backend.md | 3 --- 1 file changed, 3 deletions(-) diff --git a/_posts/2025-04-11-transformers-backend.md b/_posts/2025-04-11-transformers-backend.md index 9eeaaf2..88691b9 100644 --- a/_posts/2025-04-11-transformers-backend.md +++ b/_posts/2025-04-11-transformers-backend.md @@ -33,9 +33,6 @@ Here is how one can serve a multimodal model using the transformers backend. ```bash vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf \ --model_impl transformers \ ---disable-mm-preprocessor-cache \ ---no-enable-prefix-caching \ ---no-enable-chunked-prefill ``` To consume the model one can use the `openai` API like so: From 5492bbbc5a7982f391749b7eb077e2af1172299a Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Tue, 22 Jul 2025 12:37:14 +0100 Subject: [PATCH 9/9] Bump nokogiri from 1.18.8 to 1.18.9 (#62) Bumps [nokogiri](https://github.com/sparklemotion/nokogiri) from 1.18.8 to 1.18.9. - [Release notes](https://github.com/sparklemotion/nokogiri/releases) - [Changelog](https://github.com/sparklemotion/nokogiri/blob/main/CHANGELOG.md) - [Commits](https://github.com/sparklemotion/nokogiri/compare/v1.18.8...v1.18.9) --- updated-dependencies: - dependency-name: nokogiri dependency-version: 1.18.9 dependency-type: indirect ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Signed-off-by: ariG23498 --- Gemfile.lock | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Gemfile.lock b/Gemfile.lock index 1c3abb9..6e67a2f 100644 --- a/Gemfile.lock +++ b/Gemfile.lock @@ -240,9 +240,9 @@ GEM minitest (5.25.4) net-http (0.6.0) uri - nokogiri (1.18.8-arm64-darwin) + nokogiri (1.18.9-arm64-darwin) racc (~> 1.4) - nokogiri (1.18.8-x86_64-linux-gnu) + nokogiri (1.18.9-x86_64-linux-gnu) racc (~> 1.4) octicons (19.15.1) octokit (4.25.1)