[GRPO]: Sample from a Replay Buffer To Substitute Groups with 0 std. #4060

pramodith · 2025-09-10T23:07:18Z

What does this PR do?

Introduces the idea of a ReplayBuffer to GRPO. Implementation details

A ReplayBuffer is implemented as a heap/priority queue. The buffer stores a score and a dict with the same keys as _generate_and_score_completions.
The ReplayBuffer stores entire groups and all the keys associated with a group that'd be needed for computing the loss. Storing the old/ref_log_probs ensures that we don't run any extra forward passes through models.
Currently the scoring is based on the summed product of absolute advantages and std of a group.
Everytime _generate_and_score_completions is called we check if 1. There are any groups with non-zero variance, these are candidates to be added to the ReplayBuffer. 2. There are any groups with 0 variance, these need to be substituted out with values from the replay buffer.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

pramodith · 2025-09-10T23:11:55Z

I'm still mulling over the multi-gpu scenario, I'm wondering if we should have the same buffer used on all gpus/processes or if its okay for each gpu/process to have its own buffer. Happy to hear your views on this @qgallouedec

Also still need to add an e2e test for training with the ReplayBuffer.

pramodith · 2025-09-11T22:24:55Z

I should probably break the update_with_replay_buffer function into two smaller functions, its too big rn:

add_to_buffer
replace_from_buffer.

Also need to add new test cases to confirm that the code works when the seq lengths in the buffer are different from the current batch.

qgallouedec · 2025-09-12T18:54:32Z

can you migrate this into trl.experimental? 🙏

pramodith · 2025-09-13T15:20:31Z

can you migrate this into trl.experimental? 🙏

Yes will do.

HuggingFaceDocBuilderDev · 2025-09-18T21:54:58Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

qgallouedec · 2025-09-24T02:59:23Z

trl/experimental/grpo_with_replay_buffer/grpo_with_replay_buffer_config.py

+class GRPOWithReplayBufferConfig(GRPOConfig):
+    """
+    New Parameters:
+        replay_buffer_size (`int`, *optional*, defaults to `0`):


Suggested change

replay_buffer_size (`int`, *optional*, defaults to `0`):

replay_buffer_size (`int`, *optional*, defaults to `64`):

qgallouedec

lgtm!

pramodith added 4 commits September 10, 2025 12:05

Skeleton of what using a replay buffer would look like.

b34330c

Add tests

2d8f491

Refactor tests.

8500784

minor refactor

8593e26

pramodith added 2 commits September 10, 2025 23:13

nit

4fa6216

Account for different sequence lengths.

cee65dd

pramodith added 9 commits September 14, 2025 21:29

Merge branch 'main' into pramodith/grpo_replay_buffer

7c835b4

Unpad buffer tensors to max seq len in group.

1139182

pad_token_id

1c759e9

Refactor

2bfad0b

fix test case

cdcde4f

move to experimental

da692b1

precommit

16f74a5

Update docs

d50cafd

revert

50ccffb

pramodith marked this pull request as ready for review September 18, 2025 21:43

pramodith added 2 commits September 18, 2025 21:47

update docs

43f1973

Merge branch 'main' into pramodith/grpo_replay_buffer

0dba161

pramodith and others added 6 commits September 19, 2025 16:57

fix typing

d325f9e

fix test

8cb3d4e

Merge branch 'main' into pramodith/grpo_replay_buffer

7501d8f

nits and low prio tests

c99cae1

Merge branch 'main' into pramodith/grpo_replay_buffer

195c2c9

style

d68dc02

qgallouedec reviewed Sep 24, 2025

View reviewed changes

qgallouedec approved these changes Sep 24, 2025

View reviewed changes

Merge branch 'main' into pramodith/grpo_replay_buffer

12b8a28

pramodith merged commit d1e24df into huggingface:main Sep 24, 2025
1 of 10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[GRPO]: Sample from a Replay Buffer To Substitute Groups with 0 std. #4060

[GRPO]: Sample from a Replay Buffer To Substitute Groups with 0 std. #4060

Uh oh!

pramodith commented Sep 10, 2025

Uh oh!

pramodith commented Sep 10, 2025 •

edited

Loading

Uh oh!

pramodith commented Sep 11, 2025 •

edited

Loading

Uh oh!

qgallouedec commented Sep 12, 2025

Uh oh!

pramodith commented Sep 13, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Sep 18, 2025

Uh oh!

qgallouedec Sep 24, 2025

Uh oh!

qgallouedec left a comment

Uh oh!

Uh oh!

Uh oh!

	replay_buffer_size (`int`, optional, defaults to `0`):
	replay_buffer_size (`int`, optional, defaults to `64`):

[GRPO]: Sample from a Replay Buffer To Substitute Groups with 0 std. #4060

[GRPO]: Sample from a Replay Buffer To Substitute Groups with 0 std. #4060

Uh oh!

Conversation

pramodith commented Sep 10, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

pramodith commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pramodith commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qgallouedec commented Sep 12, 2025

Uh oh!

pramodith commented Sep 13, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Sep 18, 2025

Uh oh!

qgallouedec Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

qgallouedec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pramodith commented Sep 10, 2025 •

edited

Loading

pramodith commented Sep 11, 2025 •

edited

Loading