@@ -20,6 +20,69 @@ vLLM will therefore optimize throughput/latency on top of existing transformers
20
20
In this post, we’ll explore how vLLM leverages the transformers backend to combine ** flexibility**
21
21
with ** efficiency** , enabling you to deploy state-of-the-art models faster and smarter.
22
22
23
+ ## Updates
24
+
25
+ This section will hold all the updates that have been taken place over the course of the first release of the blog psot (11th April 2025).
26
+
27
+ ### Support for Vision Language Models (21st July 2025)
28
+
29
+ vLLM with the transformers backend now supports Vision Langauge Models. Here is how one would use
30
+ the API.
31
+
32
+ ``` python
33
+ from vllm import LLM , SamplingParams
34
+ from PIL import Image
35
+ import requests
36
+ from transformers import AutoProcessor
37
+
38
+ model_id = " llava-hf/llava-onevision-qwen2-0.5b-ov-hf"
39
+ hf_processor = AutoProcessor.from_pretrained(model_id) # required to dynamically update the chat template
40
+
41
+ messages = [
42
+ {
43
+ " role" : " user" ,
44
+ " content" : [
45
+ {" type" : " image" , " url" : " dummy_image.jpg" },
46
+ {" type" : " text" , " text" : " What is the content of this image?" },
47
+ ],
48
+ },
49
+ ]
50
+ prompt = hf_processor.apply_chat_template(messages, tokenize = False , add_generation_prompt = True )
51
+ image = Image.open(
52
+ requests.get(
53
+ " http://images.cocodataset.org/val2017/000000039769.jpg" , stream = True
54
+ ).raw
55
+ )
56
+
57
+ # initialize the vlm using the `model_impl="transformers"`
58
+ vlm = LLM(
59
+ model = " llava-hf/llava-onevision-qwen2-0.5b-ov-hf" ,
60
+ model_impl = " transformers" ,
61
+ disable_mm_preprocessor_cache = True , # we disable the mm preprocessor cache for the time being
62
+ enable_prefix_caching = False ,
63
+ enable_chunked_prefill = False
64
+ )
65
+
66
+ outputs = vlm.generate(
67
+ {
68
+ " prompt" : prompt,
69
+ " multi_modal_data" : {" image" : image},
70
+ },
71
+ sampling_params = SamplingParams(max_tokens = 100 )
72
+ )
73
+
74
+ for o in outputs:
75
+ generated_text = o.outputs[0 ].text
76
+ print (generated_text)
77
+
78
+ # OUTPUTS:
79
+ # In the tranquil setting of this image, two feline companions are enjoying a peaceful slumber on a
80
+ # cozy pink couch. The couch, adorned with a plush red fabric across the seating area, serves as their perfect resting place.
81
+ #
82
+ # On the left side of the couch, a gray tabby cat is curled up at rest, its body relaxed in a display
83
+ # of feline serenity. One paw playfully stretches out, perhaps in mid-jump or simply exploring its surroundings.
84
+ ```
85
+
23
86
## Transformers and vLLM: Inference in Action
24
87
25
88
Let’s start with a simple text generation task using the ` meta-llama/Llama-3.2-1B ` model to see how
0 commit comments