[Feature] support qwen2.5-vl for pytorch engine (#3194)

CUHKSZzxy · web-flow · commit 1fab2f5e54f9 · 2025-03-03T15:11:08.000+08:00
* support qwen2.5-vl for pytorch engine

* reuse qwen2 code

* update doc

* reuse vl qwen2 code

* update doc
diff --git a/README.md b/README.md
@@ -153,6 +153,7 @@ LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by
   <li>InternLM-XComposer2.5 (7B)</li>
   <li>Qwen-VL (7B)</li>
   <li>Qwen2-VL (2B, 7B, 72B)</li>
+  <li>Qwen2.5-VL (3B, 7B, 72B)</li>
   <li>DeepSeek-VL (7B)</li>
   <li>DeepSeek-VL2 (3B, 16B, 27B)</li>
   <li>InternVL-Chat (v1.1-v1.5)</li>
diff --git a/README_ja.md b/README_ja.md
@@ -150,6 +150,8 @@ LMDeploy TurboMindエンジンは卓越した推論能力を持ち、さまざ
   <li>InternLM-XComposer2 (7B, 4khd-7B)</li>
   <li>InternLM-XComposer2.5 (7B)</li>
   <li>Qwen-VL (7B)</li>
+  <li>Qwen2-VL (2B, 7B, 72B)</li>
+  <li>Qwen2.5-VL (3B, 7B, 72B)</li>
   <li>DeepSeek-VL (7B)</li>
   <li>DeepSeek-VL2 (3B, 16B, 27B)</li>
   <li>InternVL-Chat (v1.1-v1.5)</li>
diff --git a/README_zh-CN.md b/README_zh-CN.md
@@ -155,6 +155,7 @@ LMDeploy TurboMind 引擎拥有卓越的推理能力，在各种规模的模型
   <li>InternLM-XComposer2.5 (7B)</li>
   <li>Qwen-VL (7B)</li>
   <li>Qwen2-VL (2B, 7B, 72B)</li>
+  <li>Qwen2.5-VL (3B, 7B, 72B)</li>
   <li>DeepSeek-VL (7B)</li>
   <li>DeepSeek-VL2 (3B, 16B, 27B)</li>
   <li>InternVL-Chat (v1.1-v1.5)</li>
diff --git a/docs/en/multi_modal/index.rst b/docs/en/multi_modal/index.rst
@@ -14,4 +14,5 @@ Vision-Language Models
    phi3.md
    mllama.md
    qwen2_vl.md
+   qwen2_5_vl.md
    molmo.md
diff --git a/docs/en/multi_modal/qwen2_5_vl.md b/docs/en/multi_modal/qwen2_5_vl.md
@@ -0,0 +1,156 @@
+# Qwen2.5-VL
+
+LMDeploy supports the following Qwen-VL series of models, which are detailed in the table below:
+
+|   Model    |    Size     | Supported Inference Engine |
+| :--------: | :---------: | :------------------------: |
+| Qwen2.5-VL | 3B, 7B, 72B |          PyTorch           |
+
+The next chapter demonstrates how to deploy a Qwen-VL model using LMDeploy, with [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) as an example.
+
+## Installation
+
+Please install LMDeploy by following the [installation guide](../get_started/installation.md), and install other packages that Qwen2.5-VL needs
+
+```shell
+# Qwen2.5-VL requires the latest transformers (transformers >= 4.49.0)
+pip install git+https://github.com/huggingface/transformers
+# It's highly recommended to use `[decord]` feature for faster video loading.
+pip install qwen-vl-utils[decord]==0.0.8
+```
+
+## Offline inference
+
+The following sample code shows the basic usage of the VLM pipeline. For detailed information, please refer to [VLM Offline Inference Pipeline](./vl_pipeline.md)
+
+```python
+from lmdeploy import pipeline
+from lmdeploy.vl import load_image
+
+pipe = pipeline('Qwen/Qwen2.5-VL-7B-Instruct')
+
+image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
+response = pipe((f'describe this image', image))
+print(response)
+```
+
+More examples are listed below:
+
+<details>
+  <summary>
+    <b>multi-image multi-round conversation, combined images</b>
+  </summary>
+
+```python
+from lmdeploy import pipeline, GenerationConfig
+
+pipe = pipeline('Qwen/Qwen2.5-VL-7B-Instruct', log_level='INFO')
+messages = [
+    dict(role='user', content=[
+        dict(type='text', text='Describe the two images in detail.'),
+        dict(type='image_url', image_url=dict(url='https://raw.githubusercontent.com/QwenLM/Qwen-VL/master/assets/mm_tutorial/Beijing_Small.jpeg')),
+        dict(type='image_url', image_url=dict(url='https://raw.githubusercontent.com/QwenLM/Qwen-VL/master/assets/mm_tutorial/Chongqing_Small.jpeg'))
+    ])
+]
+out = pipe(messages, gen_config=GenerationConfig(top_k=1))
+
+messages.append(dict(role='assistant', content=out.text))
+messages.append(dict(role='user', content='What are the similarities and differences between these two images.'))
+out = pipe(messages, gen_config=GenerationConfig(top_k=1))
+```
+
+</details>
+
+<details>
+  <summary>
+    <b>image resolution for performance boost</b>
+  </summary>
+
+```python
+from lmdeploy import pipeline, GenerationConfig
+
+pipe = pipeline('Qwen/Qwen2.5-VL-7B-Instruct', log_level='INFO')
+
+min_pixels = 64 * 28 * 28
+max_pixels = 64 * 28 * 28
+messages = [
+    dict(role='user', content=[
+        dict(type='text', text='Describe the two images in detail.'),
+        dict(type='image_url', image_url=dict(min_pixels=min_pixels, max_pixels=max_pixels, url='https://raw.githubusercontent.com/QwenLM/Qwen-VL/master/assets/mm_tutorial/Beijing_Small.jpeg')),
+        dict(type='image_url', image_url=dict(min_pixels=min_pixels, max_pixels=max_pixels, url='https://raw.githubusercontent.com/QwenLM/Qwen-VL/master/assets/mm_tutorial/Chongqing_Small.jpeg'))
+    ])
+]
+out = pipe(messages, gen_config=GenerationConfig(top_k=1))
+
+messages.append(dict(role='assistant', content=out.text))
+messages.append(dict(role='user', content='What are the similarities and differences between these two images.'))
+out = pipe(messages, gen_config=GenerationConfig(top_k=1))
+```
+
+</details>
+
+<details>
+  <summary>
+    <b>video multi-round conversation</b>
+  </summary>
+
+```python
+import numpy as np
+from lmdeploy import pipeline, GenerationConfig
+from decord import VideoReader, cpu
+from lmdeploy.vl.constants import IMAGE_TOKEN
+from lmdeploy.vl.utils import encode_image_base64
+from PIL import Image
+pipe = pipeline('Qwen/Qwen2.5-VL-7B-Instruct', log_level='INFO')
+
+
+def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
+    if bound:
+        start, end = bound[0], bound[1]
+    else:
+        start, end = -100000, 100000
+    start_idx = max(first_idx, round(start * fps))
+    end_idx = min(round(end * fps), max_frame)
+    seg_size = float(end_idx - start_idx) / num_segments
+    frame_indices = np.array([
+        int(start_idx + (seg_size / 2) + np.round(seg_size * idx))
+        for idx in range(num_segments)
+    ])
+    return frame_indices
+
+
+def load_video(video_path, bound=None, num_segments=32):
+    vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
+    max_frame = len(vr) - 1
+    fps = float(vr.get_avg_fps())
+    pixel_values_list, num_patches_list = [], []
+    frame_indices = get_index(bound, fps, max_frame, first_idx=0, num_segments=num_segments)
+    imgs = []
+    for frame_index in frame_indices:
+        img = Image.fromarray(vr[frame_index].asnumpy()).convert('RGB')
+        imgs.append(img)
+    return imgs
+
+
+video_path = 'red-panda.mp4'
+imgs = load_video(video_path, num_segments=8)
+
+question = ''
+for i in range(len(imgs)):
+    question = question + f'Frame{i+1}: {IMAGE_TOKEN}\n'
+
+question += 'What is the red panda doing?'
+
+content = [{'type': 'text', 'text': question}]
+for img in imgs:
+    content.append({'type': 'image_url', 'image_url': {'max_dynamic_patch': 1, 'url': f'data:image/jpeg;base64,{encode_image_base64(img)}'}})
+
+messages = [dict(role='user', content=content)]
+out = pipe(messages, gen_config=GenerationConfig(top_k=1))
+
+messages.append(dict(role='assistant', content=out.text))
+messages.append(dict(role='user', content='Describe this video in detail. Don\'t repeat.'))
+out = pipe(messages, gen_config=GenerationConfig(top_k=1))
+```
+
+</details>
diff --git a/docs/en/supported_models/supported_models.md b/docs/en/supported_models/supported_models.md
@@ -78,6 +78,7 @@ The following tables detail the models supported by LMDeploy's TurboMind engine
 |             QWen2              | 0.5B - 72B  | LLM  |    Yes    |   Yes   |   No    | Yes  |  Yes  |
 |            Qwen2.5             | 0.5B - 72B  | LLM  |    Yes    |   Yes   |   No    | Yes  |  Yes  |
 |            QWen2-VL            |   2B, 7B    | MLLM |    Yes    |   Yes   |   No    |  No  |  Yes  |
+|           QWen2.5-VL           |  3B - 72B   | MLLM |    Yes    |   No    |   No    |  No  |  No   |
 |          DeepSeek-MoE          |     16B     | LLM  |    Yes    |   No    |   No    |  No  |  No   |
 |          DeepSeek-V2           |  16B, 236B  | LLM  |    Yes    |   No    |   No    |  No  |  No   |
 |         DeepSeek-V2.5          |    236B     | LLM  |    Yes    |   No    |   No    |  No  |  No   |
diff --git a/docs/zh_cn/multi_modal/index.rst b/docs/zh_cn/multi_modal/index.rst
@@ -14,4 +14,5 @@
    phi3.md
    mllama.md
    qwen2_vl.md
+   qwen2_5_vl.md
    molmo.md
diff --git a/docs/zh_cn/multi_modal/qwen2_5_vl.md b/docs/zh_cn/multi_modal/qwen2_5_vl.md
@@ -0,0 +1,156 @@
+# Qwen2.5-VL
+
+LMDeploy 支持 Qwen-VL 系列模型，具体如下：
+
+|   Model    |    Size     | Supported Inference Engine |
+| :--------: | :---------: | :------------------------: |
+| Qwen2.5-VL | 3B, 7B, 72B |          PyTorch           |
+
+本文将以[Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)为例，演示使用 LMDeploy 部署 Qwen2.5-VL 系列模型的方法
+
+## 安装
+
+请参考[安装文档](../get_started/installation.md)安装 LMDeploy，并安装上游 Qwen2.5-VL 模型库所需的依赖。
+
+```shell
+# Qwen2.5-VL requires the latest transformers (transformers >= 4.49.0)
+pip install git+https://github.com/huggingface/transformers
+# It's highly recommended to use `[decord]` feature for faster video loading.
+pip install qwen-vl-utils[decord]==0.0.8
+```
+
+## 离线推理
+
+以下是使用 pipeline 进行离线推理的示例，更多用法参考[VLM离线推理 pipeline](./vl_pipeline.md)
+
+```python
+from lmdeploy import pipeline
+from lmdeploy.vl import load_image
+
+pipe = pipeline('Qwen/Qwen2.5-VL-7B-Instruct')
+
+image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
+response = pipe((f'describe this image', image))
+print(response)
+```
+
+更多例子如下：
+
+<details>
+  <summary>
+    <b>多图多轮对话</b>
+  </summary>
+
+```python
+from lmdeploy import pipeline, GenerationConfig
+
+pipe = pipeline('Qwen/Qwen2.5-VL-7B-Instruct', log_level='INFO')
+messages = [
+    dict(role='user', content=[
+        dict(type='text', text='Describe the two images in detail.'),
+        dict(type='image_url', image_url=dict(url='https://raw.githubusercontent.com/QwenLM/Qwen-VL/master/assets/mm_tutorial/Beijing_Small.jpeg')),
+        dict(type='image_url', image_url=dict(url='https://raw.githubusercontent.com/QwenLM/Qwen-VL/master/assets/mm_tutorial/Chongqing_Small.jpeg'))
+    ])
+]
+out = pipe(messages, gen_config=GenerationConfig(top_k=1))
+
+messages.append(dict(role='assistant', content=out.text))
+messages.append(dict(role='user', content='What are the similarities and differences between these two images.'))
+out = pipe(messages, gen_config=GenerationConfig(top_k=1))
+```
+
+</details>
+
+<details>
+  <summary>
+    <b>控制图片分辨率，加速推理</b>
+  </summary>
+
+```python
+from lmdeploy import pipeline, GenerationConfig
+
+pipe = pipeline('Qwen/Qwen2.5-VL-7B-Instruct', log_level='INFO')
+
+min_pixels = 64 * 28 * 28
+max_pixels = 64 * 28 * 28
+messages = [
+    dict(role='user', content=[
+        dict(type='text', text='Describe the two images in detail.'),
+        dict(type='image_url', image_url=dict(min_pixels=min_pixels, max_pixels=max_pixels, url='https://raw.githubusercontent.com/QwenLM/Qwen-VL/master/assets/mm_tutorial/Beijing_Small.jpeg')),
+        dict(type='image_url', image_url=dict(min_pixels=min_pixels, max_pixels=max_pixels, url='https://raw.githubusercontent.com/QwenLM/Qwen-VL/master/assets/mm_tutorial/Chongqing_Small.jpeg'))
+    ])
+]
+out = pipe(messages, gen_config=GenerationConfig(top_k=1))
+
+messages.append(dict(role='assistant', content=out.text))
+messages.append(dict(role='user', content='What are the similarities and differences between these two images.'))
+out = pipe(messages, gen_config=GenerationConfig(top_k=1))
+```
+
+</details>
+
+<details>
+  <summary>
+    <b>视频多轮对话</b>
+  </summary>
+
+```python
+import numpy as np
+from lmdeploy import pipeline, GenerationConfig
+from decord import VideoReader, cpu
+from lmdeploy.vl.constants import IMAGE_TOKEN
+from lmdeploy.vl.utils import encode_image_base64
+from PIL import Image
+pipe = pipeline('Qwen/Qwen2.5-VL-7B-Instruct', log_level='INFO')
+
+
+def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
+    if bound:
+        start, end = bound[0], bound[1]
+    else:
+        start, end = -100000, 100000
+    start_idx = max(first_idx, round(start * fps))
+    end_idx = min(round(end * fps), max_frame)
+    seg_size = float(end_idx - start_idx) / num_segments
+    frame_indices = np.array([
+        int(start_idx + (seg_size / 2) + np.round(seg_size * idx))
+        for idx in range(num_segments)
+    ])
+    return frame_indices
+
+
+def load_video(video_path, bound=None, num_segments=32):
+    vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
+    max_frame = len(vr) - 1
+    fps = float(vr.get_avg_fps())
+    pixel_values_list, num_patches_list = [], []
+    frame_indices = get_index(bound, fps, max_frame, first_idx=0, num_segments=num_segments)
+    imgs = []
+    for frame_index in frame_indices:
+        img = Image.fromarray(vr[frame_index].asnumpy()).convert('RGB')
+        imgs.append(img)
+    return imgs
+
+
+video_path = 'red-panda.mp4'
+imgs = load_video(video_path, num_segments=8)
+
+question = ''
+for i in range(len(imgs)):
+    question = question + f'Frame{i+1}: {IMAGE_TOKEN}\n'
+
+question += 'What is the red panda doing?'
+
+content = [{'type': 'text', 'text': question}]
+for img in imgs:
+    content.append({'type': 'image_url', 'image_url': {'max_dynamic_patch': 1, 'url': f'data:image/jpeg;base64,{encode_image_base64(img)}'}})
+
+messages = [dict(role='user', content=content)]
+out = pipe(messages, gen_config=GenerationConfig(top_k=1))
+
+messages.append(dict(role='assistant', content=out.text))
+messages.append(dict(role='user', content='Describe this video in detail. Don\'t repeat.'))
+out = pipe(messages, gen_config=GenerationConfig(top_k=1))
+```
+
+</details>
diff --git a/docs/zh_cn/supported_models/supported_models.md b/docs/zh_cn/supported_models/supported_models.md
@@ -78,6 +78,7 @@
 |             QWen2              | 0.5B - 72B  | LLM  |    Yes    |   Yes   |   No    | Yes  |  Yes  |
 |            Qwen2.5             | 0.5B - 72B  | LLM  |    Yes    |   Yes   |   No    | Yes  |  Yes  |
 |            QWen2-VL            |   2B, 7B    | MLLM |    Yes    |   Yes   |   No    |  No  |  Yes  |
+|           QWen2.5-VL           |  3B - 72B   | MLLM |    Yes    |   No    |   No    |  No  |  No   |
 |          DeepSeek-MoE          |     16B     | LLM  |    Yes    |   No    |   No    |  No  |  No   |
 |          DeepSeek-V2           |  16B, 236B  | LLM  |    Yes    |   No    |   No    |  No  |  No   |
 |         DeepSeek-V2.5          |    236B     | LLM  |    Yes    |   No    |   No    |  No  |  No   |
diff --git a/lmdeploy/archs.py b/lmdeploy/archs.py
@@ -119,7 +119,8 @@ def check_vl_llm(config: dict) -> bool:
         'LlavaLlamaForCausalLM', 'LlavaMistralForCausalLM', 'CogVLMForCausalLM', 'InternLMXComposer2ForCausalLM',
         'InternVLChatModel', 'MiniGeminiLlamaForCausalLM', 'MGMLlamaForCausalLM', 'MiniCPMV',
         'LlavaForConditionalGeneration', 'LlavaNextForConditionalGeneration', 'Phi3VForCausalLM',
-        'Qwen2VLForConditionalGeneration', 'MllamaForConditionalGeneration', 'MolmoForCausalLM'
+        'Qwen2VLForConditionalGeneration', 'Qwen2_5_VLForConditionalGeneration', 'MllamaForConditionalGeneration',
+        'MolmoForCausalLM'
     ])
     if arch == 'QWenLMHeadModel' and 'visual' in config:
         return True
diff --git a/lmdeploy/pytorch/models/module_map.py b/lmdeploy/pytorch/models/module_map.py
@@ -103,6 +103,12 @@
     f'{LMDEPLOY_PYTORCH_MODEL_PATH}.qwen2_vl.Qwen2VLForConditionalGeneration',
 })
 
+# qwen2_5_vl
+MODULE_MAP.update({
+    'Qwen2_5_VLForConditionalGeneration':
+    f'{LMDEPLOY_PYTORCH_MODEL_PATH}.qwen2_5_vl.Qwen2_5_VLForConditionalGeneration',
+})
+
 # dbrx
 MODULE_MAP.update({
     'DbrxForCausalLM': f'{LMDEPLOY_PYTORCH_MODEL_PATH}.dbrx.DbrxForCausalLM',
diff --git a/lmdeploy/pytorch/models/qwen2_5_vl.py b/lmdeploy/pytorch/models/qwen2_5_vl.py
diff --git a/lmdeploy/pytorch/supported_models.py b/lmdeploy/pytorch/supported_models.py
diff --git a/lmdeploy/vl/model/qwen2.py b/lmdeploy/vl/model/qwen2.py