Skip to content

Conversation

akibjawad
Copy link
Contributor

@akibjawad akibjawad commented Jul 18, 2025

What does this PR do?

Fixes #36560, This PR allows inclusion of in-memory video objects, as dictionary of frames and metadata, in the chat template.

Previously:
Chat template accepted only file-paths or urls in the chat_template. If user (a developer using transformers library) collected videos from a continuous stream or any input devices, user had to store the video in a file and provide file path in chat messages.

Now (after this PR):
Users can collect video frames from streams or devices, provide metadata (describing fps), and directly pass those in the chat_template as a dictionary object. It frees the user from saving the video in files, and increases efficiency by reducing extra IO operation to reload the video again from files.

Notes:
Additionally, this PR also fixes hardcoded values used for testing (in assertions) apply_chat_template_videos for models like internvl, qwen2_vl, qwen2_5_vl, qwen2_5_omni.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). No
  • Did you read the contributor guideline,
    Pull Request section? Yes
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
    Yes. Issue link: Allow video objects (np array etc.) in apply_chat_template (not just paths or urls) #36560
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?
    • in tests/test_processing_common.py file: I added a new type of video input which will be included in the chat messages while testing functionality of apply_chat_template.
    • Added a new test with batchsize 3 for testing in-memory video objects in chat_template. Additionally updated hardcoded assertion (video_len check) for testing with increased batch_size in 4 models:
      • tests/models/internvl/test_processor_internvl.py
      • tests/models/qwen2_vl/test_processor_qwen2_vl.py
      • tests/models/qwen2_5_vl/test_processor_qwen2_5_vl.py
      • tests/models/qwen2_5_omni/test_processor_qwen2_5_omni.py
      • tests/models/smolvlm/test_processor_smolvlm.py (skip testing smolvlm with list of frames)

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Specifically mentioning @zucchini-nlp for review. Feel free to tag other members/contributors who may be interested to review this PR.

@akibjawad akibjawad marked this pull request as draft July 18, 2025 03:38
@Rocketknight1
Copy link
Member

cc @zucchini-nlp

@akibjawad akibjawad force-pushed the video_object_in_apply_chat_template branch 2 times, most recently from 7e0880b to a3d74ed Compare July 27, 2025 17:37
@akibjawad akibjawad force-pushed the video_object_in_apply_chat_template branch 4 times, most recently from dacb18e to c203e9f Compare July 29, 2025 00:09
@akibjawad akibjawad changed the title [WIP] Add support for including video object in apply_chat_template function [WIP] Add support for including in-memory videos (not just files/urls) in apply_chat_template Jul 29, 2025
@akibjawad akibjawad marked this pull request as ready for review July 29, 2025 01:12
@akibjawad akibjawad changed the title [WIP] Add support for including in-memory videos (not just files/urls) in apply_chat_template Add support for including in-memory videos (not just files/urls) in apply_chat_template Jul 29, 2025
@akibjawad akibjawad force-pushed the video_object_in_apply_chat_template branch 6 times, most recently from f2855b0 to c7142ea Compare July 30, 2025 04:53
@akibjawad
Copy link
Contributor Author

requesting review @zucchini-nlp @Rocketknight1 @ArthurZucker @FredrikNoren

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the PR @akibjawad ! I feel like this is increasing LOC unnecessarily and could be done with less changes from our side. What if we update load_video to early exit when an array is found instead of trying to decode

The only constraint would be that users have to be consistent with video type within one conversation. So if one started using decoded frames in convo, they have to use decoded frames format for subsequent videos in the chat. Otherwise handling video_metadata can become hard

@akibjawad akibjawad force-pushed the video_object_in_apply_chat_template branch from c7142ea to b4b905e Compare July 30, 2025 19:46
@akibjawad
Copy link
Contributor Author

akibjawad commented Jul 31, 2025

Thanks a lot for the PR @akibjawad ! I feel like this is increasing LOC unnecessarily and could be done with less changes from our side. What if we update load_video to early exit when an array is found instead of trying to decode

The only constraint would be that users have to be consistent with video type within one conversation. So if one started using decoded frames in convo, they have to use decoded frames format for subsequent videos in the chat. Otherwise handling video_metadata can become hard

@zucchini-nlp, thank you very much for reviewing. I do agree with your notion to keep the library lean. In fact, my initial implementation was exactly what you mentioned. Later I changed the design, because load_video() is meant to collect video_frames and metadata from a source (such as file, url). When a user provides video frames (ndarray/tensor) in a conversation, user already loaded the video some way (either from a file, livestream, screen record, camera devices, or randomly generated etc.) and user do not want to save the frames to a file. While collecting frames, user might also collect metadata. That is why I kept option for both frames & metadata. Additionally, without any metadata, frame_sampling with fps is not possible. As you mentioned earlier, user must be consistent and cannot use fps parameter for sampling while using decoded_frames or video as a list of image file names. Because in those cases metadata will be none. Although we can provide a default metadata for consistent sampling.

To accept video frames as array, do we actually need to modify load_video() function? Because, if we are returning early from the function, we can simply detect video type is an array and collect the video frames from the if else block in apply_chat_template function of the processor, saving an extra step of calling the function. However load_video() is an utility function and I assume it is used in many parts of the code-base, If we include array handling in the load_video() function, it will be useful for other places also. Although load_video() has other parameters (fps, num_frames) for sampling. From the current implementation, it looks like sampling is done at the video processor class, not at the apply_chat_template phase.

I have noticed you have been working on video_processor for a long time and you have better idea about the complete pipeline and future of this ever changing code. Let me know, which solution would you prefer.

@zucchini-nlp
Copy link
Member

zucchini-nlp commented Jul 31, 2025

To accept video frames as array, do we actually need to modify load_video() function?

Yes, it currently has a bug because it checks for isinstance(video, array) later and thus fails. We just need to change the order of conditional checks

The user is still free to pass metadata as kwargs to apply_chat_template and it should be picked up, I am doing another update here #39600 and prob that will fix it. I am currently stuck on different task but will continue on video decoding soon

@akibjawad akibjawad force-pushed the video_object_in_apply_chat_template branch 2 times, most recently from 03e400b to ef6ce1f Compare July 31, 2025 19:37
@akibjawad
Copy link
Contributor Author

@zucchini-nlp Thank you for the clarification, I updated the code of load_video() and handled decoded frames same as handling a list of image file names. I kept everything else (metadata handling) same so that this changes will not create too much conflict with your PR (#39600). Please review again and let me know if I need to change anything else.

@akibjawad akibjawad force-pushed the video_object_in_apply_chat_template branch from 377acc9 to cf3df35 Compare July 31, 2025 21:02
Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating on this, a few comments and if tests are passing, let's merge

@akibjawad akibjawad force-pushed the video_object_in_apply_chat_template branch from cf3df35 to 08612aa Compare August 1, 2025 10:39
@akibjawad
Copy link
Contributor Author

@zucchini-nlp Thank you very much for the review. I addressed your reviews in the most recent commit. As a maintainer, you need to initiate some github workflows for complete testing. Let me know, if there is any remaining issues with current implementation.

@akibjawad akibjawad force-pushed the video_object_in_apply_chat_template branch from 7429b30 to db06f5b Compare August 2, 2025 12:27
Copy link
Contributor

github-actions bot commented Aug 2, 2025

[For maintainers] Suggested jobs to run (before merge)

run-slow: internvl, qwen2_5_omni, qwen2_5_vl, qwen2_vl, smolvlm

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating! LGTM

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@zucchini-nlp zucchini-nlp merged commit 2a9febd into huggingface:main Aug 4, 2025
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow video objects (np array etc.) in apply_chat_template (not just paths or urls)
4 participants