Skip to content

Model weights fail to update while MLPs update with group_map on MeltingPot/predator_prey__orchard #242

@QLs-Learning-Bot

Description

@QLs-Learning-Bot

When applying group_map recommended settings in #78 for meltingpot/predator_prey__orchard environment, with IPPO + CNN encoder.
BenchMARL: 1.5.0
torchrl: 0.10.0
torch: 2.9.0+cu126
hardware: 96-core CPU; RTX 4090 GPU
system: Ubuntu

The config file that I used.

defaults:
  - experiment: base_experiment
  - algorithm: ippo
  - task: meltingpot/predator_prey__orchard
  - model: layers/cnn
  - model@critic_model: layers/cnn
  - _self_

hydra:
  searchpath:
    - pkg://benchmarl/conf

seed: 199

task:
  max_steps: 500
  group_map:
    pred_policy: ["player_0", "player_1", "player_2", "player_3", "player_4"]
    prey_policy: ["player_5", "player_6", "player_7", "player_8", "player_9", "player_10", "player_11", "player_12"]

model:
  mlp_num_cells: [256, 256]
  cnn_num_cells: [16, 32, 256]
  cnn_kernel_sizes: [8, 4, 11]
  cnn_strides: [4, 2, 1]
  cnn_paddings: [2, 1, 5]
  cnn_activation_class: torch.nn.ReLU

critic_model:
  mlp_num_cells: [256, 256]
  cnn_num_cells: [16, 32, 256]
  cnn_kernel_sizes: [8, 4, 11]
  cnn_strides: [4, 2, 1]
  cnn_paddings: [2, 1, 5]
  cnn_activation_class: torch.nn.ReLU

algorithm:
  entropy_coef: 0.001
  use_tanh_normal: True

experiment:
  sampling_device: "cpu"
  train_device: "cuda:1"
  buffer_device: "cpu"

  share_policy_params: False
  prefer_continuous_actions: False
  collect_with_grad: False
  gamma: 0.99

  adam_eps: 0.000001
  lr: 0.00025
  clip_grad_norm: True
  clip_grad_val: 5

  max_n_iters: null
  max_n_frames: 10_000_000

  parallel_collection: True
  on_policy_collected_frames_per_batch: 2_000
  on_policy_n_envs_per_worker: 20
  on_policy_n_minibatch_iters: 32
  on_policy_minibatch_size: 100

  evaluation: true
  render: True
  evaluation_interval: 20_000
  evaluation_episodes: 5
  evaluation_deterministic_actions: False
  evaluation_static: False

  loggers: [wandb, csv]
  create_json: False

  save_folder: null
  restore_file: null
  restore_map_location: null

  checkpoint_interval: 200_000
  checkpoint_at_end: true
  keep_checkpoints_num: 10000
  exclude_buffer_from_checkpoint: True

Expect:
Grouped IPPO should show a positive learning trend on orchard (comparable to ungrouped baseline within variance).
Actual observation:
Episode return moves up and down, likely due to sampling
Weight-norm trajectories show very small drift.

Image Image Image

Diagnostics done

  1. Verified optimizer contains all params (by identity after .to(device)), including .cnn
  2. set share_policy_params to False lead to GPU memory overflow. Not yet figured out how to deal with this.
  3. Using MASAC may lead to nan in sampled logits. (manual clamping to 0)

Some other questions:

  1. Any config caveats group vs ungrouped (e.g., share_policy_params, batch sizes)?
  2. In the IPPO training path, when collect_with_grad=false, are encoder features ever reused from the collector under no_grad (via TensorDict) in a way that could cut autograd to encoders for grouped policies?
  3. Any known interactions between group_map and multi-policy training that can lead to tiny/zero encoder updates? Because the next step is to switch back to train Meltingpot's CNN + LSTM with MASAC + grouped critics, any suggestions in the parameter setting would be appreciated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions