DDP training and storing rank specific info in checkpoints #21097
-
I'm working on preserving state between start/stop of training runs in a manner that guarantees reproducible results. That is, I'd like to be able to stop my training at any given checkpoint, then restart the training from that checkpoint and finish to completion, and have these results match (exactly) the results obtained from a single continuous run. I've been able to do this on single node setups by storing the outputs of
within the model checkpoint, and using the corresponding I'm now trying to perform the same checkpoint save/load procedure using a multi-node setup, with a DDP strategy. My attempt was to append the global-rank-specific rng states to the checkpoint dictionary, which I had thought would then be saved appropriately. However, when I executed the code, the only rng state that is preserved within the checkpoint dictionary, is the rank 0 state. Can someone please advise on how to preserve the rng states from other ranks within the checkpoint in a DDP setup? As a higher level question: if there is a better way to preserve these states between training runs rather than checkpoint storage and re-instantiation, that information would also be welcome. The main Callback save routine I'm using is posted below. I've then been checking the contents of the saved checkpoint dictionary by using a manual python version: 3.9.12
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
I believe I have at least narrowed down why my approach is not working:
This at least explains why I only see the rank-0 information in the saved checkpoint. So my question can now be reduced to: Is there any way to synchronize and otherwise send the checkpoint dictionaries for each rank to the global-0 process? As a workaround, I can do some pretty hacky temporary-save and load routines in the |
Beta Was this translation helpful? Give feedback.
-
After extensive doc-searching and even more extensive trial-and-error, I believe I have a good understanding of this issue. Unfortunately, unless there is a flag within For starters, my second comment is more-or-less correct. Each rank (i.e., each process) gets its own instantiation of the entire training run (thus all object instantiations including trainers, dataloaders, callbacks, etc.), but under the hood I was able to determine the gradients are the first location I encountered a discrepancy between a continuous training run and a run that involved a stop-checkpoint-restart at a given epoch (epoch 3 in my case). At this breakpoint, the two training runs had different gradients going into the next optimization step (epoch 3 update step), where the discrepancy was O(10^-11). After a lot of additional model hacking to debug the local gradients across each rank, I determined this discrepancy came purely from the order in which the gradients are synchronized (i.e., floating point arithmetic does not obey commutativity of addition perfectly). For example, on the continuous training run with 4 ranks (GPUs), the gradients were aggregated in an order [0, 1, 2, 3], while on the restarted training run the gradients were aggregated in an order [1, 2, 3, 0] (or possibly a similar ordering, but certainly not the same order). I verified this by manually combining all of the local gradients from the restarted training run in different orders until I was able to reproduce the aggregated gradient from both the continuous training run and the restarted training run. That is, I could get different aggregated gradient values purely from the order in which I performed the addition/averaging of the local gradients, one corresponding to the restarted run and another corresponding to the continuous run. This indicated to me that the order in which the GPUs are aggregating their gradients is not the same between my two training runs, though it is still deterministic (i.e., the restarted training run always gave the same discrepancy from the continuous training run). Short of controlling the exact scheduling of GPU processes, and/or rewriting my model code to perform the aggregation/optimization steps manually, it seems exact reproducibility between these two runs is not possible. If anyone does stumble across this and has more information and better ideas on this, I'd be happy to learn more here. Debugging local gradientsHere is little more information for anyone encountering similar problems. These are the steps I had to take within my The synchronization of the gradients happens almost immediately in the training process, and is handled using some form of hook/observer/subscriber pattern. I'm not sure the specifics here, but it is definitely opaque to the high-level pytorch-lightning user. This means by the time one is at a callback such as
Once the local gradients are cloned, detached, and stored inside of |
Beta Was this translation helpful? Give feedback.
After extensive doc-searching and even more extensive trial-and-error, I believe I have a good understanding of this issue. Unfortunately, unless there is a flag within
pytorch
,ddp
,lightning
, orCUDA
governing GPU scheduling determinism that I don't know about, my main problem of forcing exact reproducibility seems impossible (or at least largely impractical) for reasons I'll summarize below. I am moving on from this problem, by simply avoiding the model stop/restart process I mentioned above via other methods. But I'm posting the information I have discovered in case it helps anyone with a similar problem.For starters, my second comment is more-or-less correct. Each rank (i.e., each p…