Model weights fail to update while MLPs update with group_map on MeltingPot/predator_prey__orchard

When applying `group_map` recommended settings in #78 for `meltingpot/predator_prey__orchard` environment, with IPPO + CNN encoder. 
BenchMARL: 1.5.0
torchrl: 0.10.0
torch: 2.9.0+cu126
hardware: 96-core CPU; RTX 4090 GPU
system: Ubuntu

The config file that I used.

```
defaults:
  - experiment: base_experiment
  - algorithm: ippo
  - task: meltingpot/predator_prey__orchard
  - model: layers/cnn
  - model@critic_model: layers/cnn
  - _self_

hydra:
  searchpath:
    - pkg://benchmarl/conf

seed: 199

task:
  max_steps: 500
  group_map:
    pred_policy: ["player_0", "player_1", "player_2", "player_3", "player_4"]
    prey_policy: ["player_5", "player_6", "player_7", "player_8", "player_9", "player_10", "player_11", "player_12"]

model:
  mlp_num_cells: [256, 256]
  cnn_num_cells: [16, 32, 256]
  cnn_kernel_sizes: [8, 4, 11]
  cnn_strides: [4, 2, 1]
  cnn_paddings: [2, 1, 5]
  cnn_activation_class: torch.nn.ReLU

critic_model:
  mlp_num_cells: [256, 256]
  cnn_num_cells: [16, 32, 256]
  cnn_kernel_sizes: [8, 4, 11]
  cnn_strides: [4, 2, 1]
  cnn_paddings: [2, 1, 5]
  cnn_activation_class: torch.nn.ReLU

algorithm:
  entropy_coef: 0.001
  use_tanh_normal: True

experiment:
  sampling_device: "cpu"
  train_device: "cuda:1"
  buffer_device: "cpu"

  share_policy_params: False
  prefer_continuous_actions: False
  collect_with_grad: False
  gamma: 0.99

  adam_eps: 0.000001
  lr: 0.00025
  clip_grad_norm: True
  clip_grad_val: 5

  max_n_iters: null
  max_n_frames: 10_000_000

  parallel_collection: True
  on_policy_collected_frames_per_batch: 2_000
  on_policy_n_envs_per_worker: 20
  on_policy_n_minibatch_iters: 32
  on_policy_minibatch_size: 100

  evaluation: true
  render: True
  evaluation_interval: 20_000
  evaluation_episodes: 5
  evaluation_deterministic_actions: False
  evaluation_static: False

  loggers: [wandb, csv]
  create_json: False

  save_folder: null
  restore_file: null
  restore_map_location: null

  checkpoint_interval: 200_000
  checkpoint_at_end: true
  keep_checkpoints_num: 10000
  exclude_buffer_from_checkpoint: True
```

Expect:
Grouped IPPO should show a positive learning trend on orchard (comparable to ungrouped baseline within variance).
Actual observation:
Episode return moves up and down, likely due to sampling
Weight-norm trajectories show very small drift.

<img width="1163" height="508" alt="Image" src="https://github.com/user-attachments/assets/1bcf32af-c7b8-4b44-83d8-b1f446743de0" />
<img width="2000" height="1200" alt="Image" src="https://github.com/user-attachments/assets/4e7a8f56-3a7a-4e22-8db3-fe8ef372b716" />
<img width="2000" height="1200" alt="Image" src="https://github.com/user-attachments/assets/0c7c0ea4-1e7e-498b-8458-e31a422165c8" />

Diagnostics done
1. Verified optimizer contains all params (by identity after .to(device)), including *.cnn*
2. set `share_policy_params` to `False` lead to GPU memory overflow. Not yet figured out how to deal with this.
3. Using MASAC may lead to nan in sampled logits. (manual clamping to 0)

Some other questions:
1. Any config caveats group vs ungrouped (e.g., share_policy_params, batch sizes)? 
2. In the IPPO training path, when collect_with_grad=false, are encoder features ever reused from the collector under no_grad (via TensorDict) in a way that could cut autograd to encoders for grouped policies?
3. Any known interactions between group_map and multi-policy training that can lead to tiny/zero encoder updates? Because the next step is to switch back to train Meltingpot's CNN + LSTM with MASAC + grouped critics, any suggestions in the parameter setting would be appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Model weights fail to update while MLPs update with group_map on MeltingPot/predator_prey__orchard #242

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model weights fail to update while MLPs update with group_map on MeltingPot/predator_prey__orchard #242

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions