-
Notifications
You must be signed in to change notification settings - Fork 30.3k
Add support for including in-memory videos (not just files/urls) in apply_chat_template #39494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for including in-memory videos (not just files/urls) in apply_chat_template #39494
Conversation
7e0880b
to
a3d74ed
Compare
dacb18e
to
c203e9f
Compare
f2855b0
to
c7142ea
Compare
requesting review @zucchini-nlp @Rocketknight1 @ArthurZucker @FredrikNoren |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the PR @akibjawad ! I feel like this is increasing LOC unnecessarily and could be done with less changes from our side. What if we update load_video
to early exit when an array
is found instead of trying to decode
The only constraint would be that users have to be consistent with video type within one conversation. So if one started using decoded frames
in convo, they have to use decoded frames
format for subsequent videos in the chat. Otherwise handling video_metadata
can become hard
c7142ea
to
b4b905e
Compare
@zucchini-nlp, thank you very much for reviewing. I do agree with your notion to keep the library lean. In fact, my initial implementation was exactly what you mentioned. Later I changed the design, because To accept video frames as array, do we actually need to modify I have noticed you have been working on video_processor for a long time and you have better idea about the complete pipeline and future of this ever changing code. Let me know, which solution would you prefer. |
Yes, it currently has a bug because it checks for The user is still free to pass metadata as kwargs to |
03e400b
to
ef6ce1f
Compare
@zucchini-nlp Thank you for the clarification, I updated the code of |
377acc9
to
cf3df35
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating on this, a few comments and if tests are passing, let's merge
cf3df35
to
08612aa
Compare
@zucchini-nlp Thank you very much for the review. I addressed your reviews in the most recent commit. As a maintainer, you need to initiate some github workflows for complete testing. Let me know, if there is any remaining issues with current implementation. |
…adata, in chat template
…tadata) in the chat template
… number of tests cases.
…ation about including video object in chat template.
7429b30
to
db06f5b
Compare
[For maintainers] Suggested jobs to run (before merge) run-slow: internvl, qwen2_5_omni, qwen2_5_vl, qwen2_vl, smolvlm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating! LGTM
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
What does this PR do?
Fixes #36560, This PR allows inclusion of in-memory video objects, as dictionary of frames and metadata, in the chat template.
Previously:
Chat template accepted only file-paths or urls in the chat_template. If user (a developer using transformers library) collected videos from a continuous stream or any input devices, user had to store the video in a file and provide file path in chat messages.
Now (after this PR):
Users can collect video frames from streams or devices, provide metadata (describing fps), and directly pass those in the chat_template as a dictionary object. It frees the user from saving the video in files, and increases efficiency by reducing extra IO operation to reload the video again from files.
Notes:
Additionally, this PR also fixes hardcoded values used for testing (in assertions) apply_chat_template_videos for models like internvl, qwen2_vl, qwen2_5_vl, qwen2_5_omni.
Before submitting
Pull Request section? Yes
to it if that's the case.
Yes. Issue link: Allow video objects (np array etc.) in apply_chat_template (not just paths or urls) #36560
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Specifically mentioning @zucchini-nlp for review. Feel free to tag other members/contributors who may be interested to review this PR.