Skip to content

Commit 1b57f32

Browse files
ariG23498hmellorsergiopaniegodependabot[bot]
authored
[Update] transformers backend with VLM support (#61)
Signed-off-by: ariG23498 <[email protected]> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Sergio Paniego Blanco <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
1 parent cd09940 commit 1b57f32

File tree

1 file changed

+96
-0
lines changed

1 file changed

+96
-0
lines changed

_posts/2025-04-11-transformers-backend.md

Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,102 @@ vLLM will therefore optimize throughput/latency on top of existing transformers
2020
In this post, we’ll explore how vLLM leverages the transformers backend to combine **flexibility**
2121
with **efficiency**, enabling you to deploy state-of-the-art models faster and smarter.
2222

23+
## Updates
24+
25+
This section will hold all the updates that have taken place since the blog post was first released (11th April 2025).
26+
27+
### Support for Vision Language Models (21st July 2025)
28+
29+
vLLM with the transformers backend now supports **Vision Language Models**. When user adds `model_impl="transformers"`,
30+
the correct class for text-only and multimodality will be deduced and loaded.
31+
32+
Here is how one can serve a multimodal model using the transformers backend.
33+
```bash
34+
vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf \
35+
--model_impl transformers \
36+
```
37+
38+
To consume the model one can use the `openai` API like so:
39+
```python
40+
from openai import OpenAI
41+
openai_api_key = "EMPTY"
42+
openai_api_base = "http://localhost:8000/v1"
43+
client = OpenAI(
44+
api_key=openai_api_key,
45+
base_url=openai_api_base,
46+
)
47+
chat_response = client.chat.completions.create(
48+
model="llava-hf/llava-onevision-qwen2-0.5b-ov-hf",
49+
messages=[{
50+
"role": "user",
51+
"content": [
52+
{"type": "text", "text": "What's in this image?"},
53+
{
54+
"type": "image_url",
55+
"image_url": {
56+
"url": "http://images.cocodataset.org/val2017/000000039769.jpg",
57+
},
58+
},
59+
],
60+
}],
61+
)
62+
print("Chat response:", chat_response)
63+
```
64+
65+
You can also directly initialize the vLLM engine using the `LLM` API. Here is the same model being
66+
served using the `LLM` API.
67+
68+
```python
69+
from vllm import LLM, SamplingParams
70+
from PIL import Image
71+
import requests
72+
from transformers import AutoProcessor
73+
74+
model_id = "llava-hf/llava-onevision-qwen2-0.5b-ov-hf"
75+
hf_processor = AutoProcessor.from_pretrained(model_id) # required to dynamically update the chat template
76+
77+
messages = [
78+
{
79+
"role": "user",
80+
"content": [
81+
{"type": "image", "url": "dummy_image.jpg"},
82+
{"type": "text", "text": "What is the content of this image?"},
83+
],
84+
},
85+
]
86+
prompt = hf_processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
87+
image = Image.open(
88+
requests.get(
89+
"http://images.cocodataset.org/val2017/000000039769.jpg", stream=True
90+
).raw
91+
)
92+
93+
# initialize the vlm using the `model_impl="transformers"`
94+
vlm = LLM(
95+
model="llava-hf/llava-onevision-qwen2-0.5b-ov-hf",
96+
model_impl="transformers",
97+
)
98+
99+
outputs = vlm.generate(
100+
{
101+
"prompt": prompt,
102+
"multi_modal_data": {"image": image},
103+
},
104+
sampling_params=SamplingParams(max_tokens=100)
105+
)
106+
107+
for o in outputs:
108+
generated_text = o.outputs[0].text
109+
print(generated_text)
110+
111+
# OUTPUTS:
112+
# In the tranquil setting of this image, two feline companions are enjoying a peaceful slumber on a
113+
# cozy pink couch. The couch, adorned with a plush red fabric across the seating area, serves as their perfect resting place.
114+
#
115+
# On the left side of the couch, a gray tabby cat is curled up at rest, its body relaxed in a display
116+
# of feline serenity. One paw playfully stretches out, perhaps in mid-jump or simply exploring its surroundings.
117+
```
118+
23119
## Transformers and vLLM: Inference in Action
24120

25121
Let’s start with a simple text generation task using the `meta-llama/Llama-3.2-1B` model to see how

0 commit comments

Comments
 (0)