Skip to content

🧑‍🍳 Added Post training an VLM for reasoning with GRPO using TRL recipe#312

Merged
merveenoyan merged 13 commits intohuggingface:mainfrom
sergiopaniego:add-vlm-grpo
Aug 4, 2025
Merged

🧑‍🍳 Added Post training an VLM for reasoning with GRPO using TRL recipe#312
merveenoyan merged 13 commits intohuggingface:mainfrom
sergiopaniego:add-vlm-grpo

Conversation

@sergiopaniego
Copy link
Member

@sergiopaniego sergiopaniego commented Jul 10, 2025

What does this PR do?

Fixes #311

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@review-notebook-app
Copy link

review-notebook-app bot commented Jul 23, 2025

View / edit / reply to this conversation on ReviewNB

ariG23498 commented on 2025-07-23T06:15:39Z
----------------------------------------------------------------

I think the title should be "Post training a VLM for reasoning with GRPO using TRL" instead of "an VLM"

"In this recipe, we'll demonstrate how to post-train a" instead of "post-training".

We should have "Vision Language Model (VLM)" spelled out right at the beginning somewhere and then use "VLM"


@review-notebook-app
Copy link

review-notebook-app bot commented Jul 23, 2025

View / edit / reply to this conversation on ReviewNB

ariG23498 commented on 2025-07-23T06:15:40Z
----------------------------------------------------------------

Question: Do we still use the notebook_login method? I think login works here as well.


@review-notebook-app
Copy link

review-notebook-app bot commented Jul 23, 2025

View / edit / reply to this conversation on ReviewNB

ariG23498 commented on 2025-07-23T06:15:41Z
----------------------------------------------------------------

Line #4.    processor = AutoProcessor.from_pretrained(model_id, use_fast=True, padding_side="left")

Is there a particular reason to put padding_side to left?


sergiopaniego commented on 2025-07-28T16:00:38Z
----------------------------------------------------------------

Yes!

It's needed so the generations during training are concatenated directly to the input. Otherwise, we could have [PAD] gaps between the input and the generation. I have added a line explaining that since it's relevant :) Thanks for pointing it out!!

@review-notebook-app
Copy link

review-notebook-app bot commented Jul 23, 2025

View / edit / reply to this conversation on ReviewNB

ariG23498 commented on 2025-07-23T06:15:41Z
----------------------------------------------------------------

Line #19.                    {"type": "image"},

I think we should also add the image to this dictionary. Something like the following:

{"type": "image", "image": example["image"]}

@review-notebook-app
Copy link

review-notebook-app bot commented Jul 23, 2025

View / edit / reply to this conversation on ReviewNB

ariG23498 commented on 2025-07-23T06:15:42Z
----------------------------------------------------------------

Line #11.        # Parameters that control de data preprocessing

de -> the


@sergiopaniego sergiopaniego marked this pull request as ready for review July 28, 2025 15:33
Copy link
Member Author

Yes!

It's needed so the generations during training are concatenated directly to the input. Otherwise, we could have [PAD] gaps between the input and the generation. I have added a line explaining that since it's relevant :) Thanks for pointing it out!!


View entire conversation on ReviewNB

@sergiopaniego
Copy link
Member Author

Ready for review!
Already addressed @ariG23498's comments.

We can see that the reward goes up below:
Screenshot 2025-07-28 at 18 02 11

@@ -0,0 +1,3392 @@
{
Copy link
Collaborator

@merveenoyan merveenoyan Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sentence is a bit inverted:

For our particular case where we want the model to learn to reason using images, we use as input image and problem and as output solution columns.


Reply via ReviewNB

@@ -0,0 +1,3392 @@
{
Copy link
Collaborator

@merveenoyan merveenoyan Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: trainig on last sentence


Reply via ReviewNB

@@ -0,0 +1,3392 @@
{
Copy link
Collaborator

@merveenoyan merveenoyan Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

traininig* in last sentence

also more explanation on these params would be nice, i.e. what hardware limitations, for reasoning what could be the most important etc


Reply via ReviewNB

@@ -0,0 +1,3392 @@
{
Copy link
Collaborator

@merveenoyan merveenoyan Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in here we only see train loss and not reward, if we can't enable reward we could mention slightly as loss is a bit odd


Reply via ReviewNB

@sergiopaniego
Copy link
Member Author

Thanks a lot for the comments, super valuable for improvement ❤️!

Recipe improved based on feedback 😄

Copy link

@ariG23498 ariG23498 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once @merveenoyan's comments are addressed, it is okay to be merged.

This is a very nice recipe. Kudos on the work!

@merveenoyan merveenoyan merged commit af061cd into huggingface:main Aug 4, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

🧑‍🍳 New Post training an VLM for reasoning with GRPO using TRL recipe

4 participants