🧑🍳 Added Post training an VLM for reasoning with GRPO using TRL recipe#312
🧑🍳 Added Post training an VLM for reasoning with GRPO using TRL recipe#312merveenoyan merged 13 commits intohuggingface:mainfrom
Post training an VLM for reasoning with GRPO using TRL recipe#312Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
View / edit / reply to this conversation on ReviewNB ariG23498 commented on 2025-07-23T06:15:39Z I think the title should be "Post training a VLM for reasoning with GRPO using TRL" instead of "an VLM"
"In this recipe, we'll demonstrate how to post-train a" instead of "post-training".
We should have "Vision Language Model (VLM)" spelled out right at the beginning somewhere and then use "VLM" |
|
View / edit / reply to this conversation on ReviewNB ariG23498 commented on 2025-07-23T06:15:40Z Question: Do we still use the |
|
View / edit / reply to this conversation on ReviewNB ariG23498 commented on 2025-07-23T06:15:41Z Line #4. processor = AutoProcessor.from_pretrained(model_id, use_fast=True, padding_side="left") Is there a particular reason to put sergiopaniego commented on 2025-07-28T16:00:38Z Yes! It's needed so the generations during training are concatenated directly to the input. Otherwise, we could have [PAD] gaps between the input and the generation. I have added a line explaining that since it's relevant :) Thanks for pointing it out!! |
|
View / edit / reply to this conversation on ReviewNB ariG23498 commented on 2025-07-23T06:15:41Z Line #19. {"type": "image"},
I think we should also add the image to this dictionary. Something like the following:
{"type": "image", "image": example["image"]}
|
|
View / edit / reply to this conversation on ReviewNB ariG23498 commented on 2025-07-23T06:15:42Z Line #11. # Parameters that control de data preprocessing
|
|
Yes! It's needed so the generations during training are concatenated directly to the input. Otherwise, we could have [PAD] gaps between the input and the generation. I have added a line explaining that since it's relevant :) Thanks for pointing it out!! View entire conversation on ReviewNB |
|
Ready for review! We can see that the reward goes up below: |
| @@ -0,0 +1,3392 @@ | |||
| { | |||
There was a problem hiding this comment.
this sentence is a bit inverted:
For our particular case where we want the model to learn to reason using images, we use as input image and problem and as output solution columns.
Reply via ReviewNB
| @@ -0,0 +1,3392 @@ | |||
| { | |||
There was a problem hiding this comment.
| @@ -0,0 +1,3392 @@ | |||
| { | |||
There was a problem hiding this comment.
traininig* in last sentence
also more explanation on these params would be nice, i.e. what hardware limitations, for reasoning what could be the most important etc
Reply via ReviewNB
| @@ -0,0 +1,3392 @@ | |||
| { | |||
There was a problem hiding this comment.
in here we only see train loss and not reward, if we can't enable reward we could mention slightly as loss is a bit odd
Reply via ReviewNB
|
Thanks a lot for the comments, super valuable for improvement ❤️! Recipe improved based on feedback 😄 |
ariG23498
left a comment
There was a problem hiding this comment.
Once @merveenoyan's comments are addressed, it is okay to be merged.
This is a very nice recipe. Kudos on the work!

What does this PR do?
Fixes #311