Skip to content

Commit f2855b0

Browse files
committed
update documentation of multi-modal chat templating with extra information about including video object in chat template.
1 parent 42e7a34 commit f2855b0

File tree

1 file changed

+67
-0
lines changed

1 file changed

+67
-0
lines changed

docs/source/en/chat_templating_multimodal.md

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -137,6 +137,73 @@ messages = [
137137
]
138138
```
139139

140+
### Passing decoded video objects
141+
In addition to loading videos from a URL or file path, you can also pass decoded video data directly.
142+
143+
This is useful if you’ve already preprocessed or decoded video frames elsewhere in memory (e.g., using OpenCV, decord, or torchvision). You don't need to save to files or store it in an URL.
144+
145+
- Use the `"video"` type with a dictionary that includes:
146+
- `"frames"` (`np.ndarray` or `torch.Tensor`):
147+
A 4D array of shape (num_frames, channels, height, width) containing decoded video frames.
148+
- `"metadata"` (`"VideoMetadata"` or `"dict"`):
149+
Describes metadata for the video. If you provide a dictionary, it must include at least one of:
150+
- `"fps"` (frames per second)
151+
- `"duration"` (video duration in seconds)
152+
if both `"fps"` and `"duration"` is provided, `"fps"` gets priority and `"duration"` is calculated based on `"fps"`
153+
154+
```python
155+
import numpy as np
156+
157+
video_object1 = {
158+
"frames": np.random.randint(0, 255, size=(16, 3, 224, 224), dtype=np.uint8),
159+
"metadata": {"fps": 16, "duration": 2.0}
160+
}
161+
162+
messages = [
163+
{
164+
"role": "system",
165+
"content": [{"type": "text", "text": "You are a friendly chatbot who always responds in the style of a pirate"}],
166+
},
167+
{
168+
"role": "user",
169+
"content": [
170+
{"type": "video", "video": video_object1},
171+
{"type": "text", "text": "What do you see in this video?"}
172+
],
173+
},
174+
]
175+
```
176+
You can also use existing (`"load_video()"`) function to load a video, edit the video in memory and pass it in the messages.
177+
```python
178+
179+
# Make sure a video backend library (pyav, decord, or torchvision) is available.
180+
from transformers.video_utils import load_video
181+
182+
# load a video file in memory for testing
183+
frames, metadata = load_video(
184+
"https://test-videos.co.uk/vids/bigbuckbunny/mp4/h264/720/Big_Buck_Bunny_720_10s_10MB.mp4"
185+
)
186+
187+
video_object2 = {
188+
"frames": frames,
189+
"metadata": metadata,
190+
}
191+
192+
messages = [
193+
{
194+
"role": "system",
195+
"content": [{"type": "text", "text": "You are a friendly chatbot who always responds in the style of a pirate"}],
196+
},
197+
{
198+
"role": "user",
199+
"content": [
200+
{"type": "video", "video": video_object2},
201+
{"type": "text", "text": "What do you see in this video?"}
202+
],
203+
},
204+
]
205+
```
206+
140207
Pass `messages` to [`~ProcessorMixin.apply_chat_template`] to tokenize the input content. There are a few extra parameters to include in [`~ProcessorMixin.apply_chat_template`] that controls the sampling process.
141208

142209
The `video_load_backend` parameter refers to a specific framework to load a video. It supports [PyAV](https://pyav.basswood-io.com/docs/stable/), [Decord](https://github.com/dmlc/decord), [OpenCV](https://github.com/opencv/opencv), and [torchvision](https://pytorch.org/vision/stable/index.html).

0 commit comments

Comments
 (0)