@@ -20,6 +20,102 @@ vLLM will therefore optimize throughput/latency on top of existing transformers
20
20
In this post, we’ll explore how vLLM leverages the transformers backend to combine ** flexibility**
21
21
with ** efficiency** , enabling you to deploy state-of-the-art models faster and smarter.
22
22
23
+ ## Updates
24
+
25
+ This section will hold all the updates that have taken place since the blog post was first released (11th April 2025).
26
+
27
+ ### Support for Vision Language Models (21st July 2025)
28
+
29
+ vLLM with the transformers backend now supports ** Vision Language Models** . When user adds ` model_impl="transformers" ` ,
30
+ the correct class for text-only and multimodality will be deduced and loaded.
31
+
32
+ Here is how one can serve a multimodal model using the transformers backend.
33
+ ``` bash
34
+ vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf \
35
+ --model_impl transformers \
36
+ ```
37
+
38
+ To consume the model one can use the ` openai ` API like so:
39
+ ``` python
40
+ from openai import OpenAI
41
+ openai_api_key = " EMPTY"
42
+ openai_api_base = " http://localhost:8000/v1"
43
+ client = OpenAI(
44
+ api_key = openai_api_key,
45
+ base_url = openai_api_base,
46
+ )
47
+ chat_response = client.chat.completions.create(
48
+ model = " llava-hf/llava-onevision-qwen2-0.5b-ov-hf" ,
49
+ messages = [{
50
+ " role" : " user" ,
51
+ " content" : [
52
+ {" type" : " text" , " text" : " What's in this image?" },
53
+ {
54
+ " type" : " image_url" ,
55
+ " image_url" : {
56
+ " url" : " http://images.cocodataset.org/val2017/000000039769.jpg" ,
57
+ },
58
+ },
59
+ ],
60
+ }],
61
+ )
62
+ print (" Chat response:" , chat_response)
63
+ ```
64
+
65
+ You can also directly initialize the vLLM engine using the ` LLM ` API. Here is the same model being
66
+ served using the ` LLM ` API.
67
+
68
+ ``` python
69
+ from vllm import LLM , SamplingParams
70
+ from PIL import Image
71
+ import requests
72
+ from transformers import AutoProcessor
73
+
74
+ model_id = " llava-hf/llava-onevision-qwen2-0.5b-ov-hf"
75
+ hf_processor = AutoProcessor.from_pretrained(model_id) # required to dynamically update the chat template
76
+
77
+ messages = [
78
+ {
79
+ " role" : " user" ,
80
+ " content" : [
81
+ {" type" : " image" , " url" : " dummy_image.jpg" },
82
+ {" type" : " text" , " text" : " What is the content of this image?" },
83
+ ],
84
+ },
85
+ ]
86
+ prompt = hf_processor.apply_chat_template(messages, tokenize = False , add_generation_prompt = True )
87
+ image = Image.open(
88
+ requests.get(
89
+ " http://images.cocodataset.org/val2017/000000039769.jpg" , stream = True
90
+ ).raw
91
+ )
92
+
93
+ # initialize the vlm using the `model_impl="transformers"`
94
+ vlm = LLM(
95
+ model = " llava-hf/llava-onevision-qwen2-0.5b-ov-hf" ,
96
+ model_impl = " transformers" ,
97
+ )
98
+
99
+ outputs = vlm.generate(
100
+ {
101
+ " prompt" : prompt,
102
+ " multi_modal_data" : {" image" : image},
103
+ },
104
+ sampling_params = SamplingParams(max_tokens = 100 )
105
+ )
106
+
107
+ for o in outputs:
108
+ generated_text = o.outputs[0 ].text
109
+ print (generated_text)
110
+
111
+ # OUTPUTS:
112
+ # In the tranquil setting of this image, two feline companions are enjoying a peaceful slumber on a
113
+ # cozy pink couch. The couch, adorned with a plush red fabric across the seating area, serves as their perfect resting place.
114
+ #
115
+ # On the left side of the couch, a gray tabby cat is curled up at rest, its body relaxed in a display
116
+ # of feline serenity. One paw playfully stretches out, perhaps in mid-jump or simply exploring its surroundings.
117
+ ```
118
+
23
119
## Transformers and vLLM: Inference in Action
24
120
25
121
Let’s start with a simple text generation task using the ` meta-llama/Llama-3.2-1B ` model to see how
0 commit comments