Skip to content

[Update] transformers backend with VLM support #61

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
96 changes: 96 additions & 0 deletions _posts/2025-04-11-transformers-backend.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,102 @@ vLLM will therefore optimize throughput/latency on top of existing transformers
In this post, we’ll explore how vLLM leverages the transformers backend to combine **flexibility**
with **efficiency**, enabling you to deploy state-of-the-art models faster and smarter.

## Updates

This section will hold all the updates that have taken place since the blog post was first released (11th April 2025).

### Support for Vision Language Models (21st July 2025)

vLLM with the transformers backend now supports **Vision Language Models**. When user adds `model_impl="transformers"`,
the correct class for text-only and multimodality will be deduced and loaded.

Here is how one can serve a multimodal model using the transformers backend.
```bash
vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf \
--model_impl transformers \
```

To consume the model one can use the `openai` API like so:
```python
from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_response = client.chat.completions.create(
model="llava-hf/llava-onevision-qwen2-0.5b-ov-hf",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{
"type": "image_url",
"image_url": {
"url": "http://images.cocodataset.org/val2017/000000039769.jpg",
},
},
],
}],
)
print("Chat response:", chat_response)
```

You can also directly initialize the vLLM engine using the `LLM` API. Here is the same model being
served using the `LLM` API.

```python
from vllm import LLM, SamplingParams
from PIL import Image
import requests
from transformers import AutoProcessor

model_id = "llava-hf/llava-onevision-qwen2-0.5b-ov-hf"
hf_processor = AutoProcessor.from_pretrained(model_id) # required to dynamically update the chat template

messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "dummy_image.jpg"},
{"type": "text", "text": "What is the content of this image?"},
],
},
]
prompt = hf_processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image = Image.open(
requests.get(
"http://images.cocodataset.org/val2017/000000039769.jpg", stream=True
).raw
)

# initialize the vlm using the `model_impl="transformers"`
vlm = LLM(
model="llava-hf/llava-onevision-qwen2-0.5b-ov-hf",
model_impl="transformers",
)

outputs = vlm.generate(
{
"prompt": prompt,
"multi_modal_data": {"image": image},
},
sampling_params=SamplingParams(max_tokens=100)
)

for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)

# OUTPUTS:
# In the tranquil setting of this image, two feline companions are enjoying a peaceful slumber on a
# cozy pink couch. The couch, adorned with a plush red fabric across the seating area, serves as their perfect resting place.
#
# On the left side of the couch, a gray tabby cat is curled up at rest, its body relaxed in a display
# of feline serenity. One paw playfully stretches out, perhaps in mid-jump or simply exploring its surroundings.
```

## Transformers and vLLM: Inference in Action

Let’s start with a simple text generation task using the `meta-llama/Llama-3.2-1B` model to see how
Expand Down