-
Notifications
You must be signed in to change notification settings - Fork 106
Open
Description
When applying group_map recommended settings in #78 for meltingpot/predator_prey__orchard environment, with IPPO + CNN encoder.
BenchMARL: 1.5.0
torchrl: 0.10.0
torch: 2.9.0+cu126
hardware: 96-core CPU; RTX 4090 GPU
system: Ubuntu
The config file that I used.
defaults:
- experiment: base_experiment
- algorithm: ippo
- task: meltingpot/predator_prey__orchard
- model: layers/cnn
- model@critic_model: layers/cnn
- _self_
hydra:
searchpath:
- pkg://benchmarl/conf
seed: 199
task:
max_steps: 500
group_map:
pred_policy: ["player_0", "player_1", "player_2", "player_3", "player_4"]
prey_policy: ["player_5", "player_6", "player_7", "player_8", "player_9", "player_10", "player_11", "player_12"]
model:
mlp_num_cells: [256, 256]
cnn_num_cells: [16, 32, 256]
cnn_kernel_sizes: [8, 4, 11]
cnn_strides: [4, 2, 1]
cnn_paddings: [2, 1, 5]
cnn_activation_class: torch.nn.ReLU
critic_model:
mlp_num_cells: [256, 256]
cnn_num_cells: [16, 32, 256]
cnn_kernel_sizes: [8, 4, 11]
cnn_strides: [4, 2, 1]
cnn_paddings: [2, 1, 5]
cnn_activation_class: torch.nn.ReLU
algorithm:
entropy_coef: 0.001
use_tanh_normal: True
experiment:
sampling_device: "cpu"
train_device: "cuda:1"
buffer_device: "cpu"
share_policy_params: False
prefer_continuous_actions: False
collect_with_grad: False
gamma: 0.99
adam_eps: 0.000001
lr: 0.00025
clip_grad_norm: True
clip_grad_val: 5
max_n_iters: null
max_n_frames: 10_000_000
parallel_collection: True
on_policy_collected_frames_per_batch: 2_000
on_policy_n_envs_per_worker: 20
on_policy_n_minibatch_iters: 32
on_policy_minibatch_size: 100
evaluation: true
render: True
evaluation_interval: 20_000
evaluation_episodes: 5
evaluation_deterministic_actions: False
evaluation_static: False
loggers: [wandb, csv]
create_json: False
save_folder: null
restore_file: null
restore_map_location: null
checkpoint_interval: 200_000
checkpoint_at_end: true
keep_checkpoints_num: 10000
exclude_buffer_from_checkpoint: True
Expect:
Grouped IPPO should show a positive learning trend on orchard (comparable to ungrouped baseline within variance).
Actual observation:
Episode return moves up and down, likely due to sampling
Weight-norm trajectories show very small drift.
Diagnostics done
- Verified optimizer contains all params (by identity after .to(device)), including .cnn
- set
share_policy_paramstoFalselead to GPU memory overflow. Not yet figured out how to deal with this. - Using MASAC may lead to nan in sampled logits. (manual clamping to 0)
Some other questions:
- Any config caveats group vs ungrouped (e.g., share_policy_params, batch sizes)?
- In the IPPO training path, when collect_with_grad=false, are encoder features ever reused from the collector under no_grad (via TensorDict) in a way that could cut autograd to encoders for grouped policies?
- Any known interactions between group_map and multi-policy training that can lead to tiny/zero encoder updates? Because the next step is to switch back to train Meltingpot's CNN + LSTM with MASAC + grouped critics, any suggestions in the parameter setting would be appreciated.
Metadata
Metadata
Assignees
Labels
No labels