-
Notifications
You must be signed in to change notification settings - Fork 375
Description
layer.output_adapters = BottleneckLayer("output_adapter", is_layer_hooked=True)
ln_2_get_fn = lambda: multigetattr(layer, model.adapter_interface.layer_ln_2, None)
layer_output_proj.register_forward_hook(partial(hook_fn, layer.output_adapters, ln_2_get_fn))
This code causes the layer.output_adapters of cuda:n to always point to the layer.output_adapters of cuda 0 during multi-GPU training with the default distributed settings of the Huggingface trainer. The model can be properly distributed to different GPUs. I suspect it is due to partial. So I tried to save variables like layer.xxx and layer in the context so that it can run on multiple GPUs.
Variables like residual and hidden state are both shown to be on cuda1 during debugging, but layer is shown to be on cuda0. I printed the addresses of the layer variable on two GPUs. The address of layer on cuda:1 is the same as that on cuda:0. Since my GPU can't handle models like Qwen, and it's not easy to provide data for my own model, could you please test whether this problem occurs in multi-GPU training? Thank you! I followed the process of adapters-for-any-transformer.