Description
in agent_ppo_train.py
rollout_probs_diff = calculate_log_prob_diff(actor_probs, rollout_probs, response_mask_bool),
while mask is:
```
attention_mask = batch.batch["attention_mask"]
responses = batch.batch["responses"]
response_length = responses.size(1)
response_mask = attention_mask[:, -response_length:]
In agentic rl,response part with tool call should be masked. If using attention mask[:-response_len], the metric would be inaccurate
### Steps to Reproduce
in agent_ppo_train.py
### Error Output / Traceback
```shell
rLLM Version
0.2.1post
Training Backend
verl
Python Version
3.9
GPU / CUDA Version
No response
vLLM Version (if applicable)
No response
Training Script / Config
Additional Context
No response
Description
in agent_ppo_train.py
rollout_probs_diff = calculate_log_prob_diff(actor_probs, rollout_probs, response_mask_bool),while mask is:
```
attention_mask = batch.batch["attention_mask"]
responses = batch.batch["responses"]
response_length = responses.size(1)
response_mask = attention_mask[:, -response_length:]
rLLM Version
0.2.1post
Training Backend
verl
Python Version
3.9
GPU / CUDA Version
No response
vLLM Version (if applicable)
No response
Training Script / Config
Additional Context
No response