DDP training and storing rank specific info in checkpoints #21097

bardsleypt · 2025-08-19T16:56:02Z

bardsleypt
Aug 19, 2025

I'm working on preserving state between start/stop of training runs in a manner that guarantees reproducible results. That is, I'd like to be able to stop my training at any given checkpoint, then restart the training from that checkpoint and finish to completion, and have these results match (exactly) the results obtained from a single continuous run. I've been able to do this on single node setups by storing the outputs of

torch.get_rng_state()
torch.cuda.get_rng_state()
np.random.get_state()
random.getstate()

within the model checkpoint, and using the corresponding set method upon loading the checkpoint. I've been performing the save/load routines within a custom pytorch_lightning.callbacks.Callback by overriding the on_save_checkpoint and on_load_checkpoint appropriately.

I'm now trying to perform the same checkpoint save/load procedure using a multi-node setup, with a DDP strategy. My attempt was to append the global-rank-specific rng states to the checkpoint dictionary, which I had thought would then be saved appropriately. However, when I executed the code, the only rng state that is preserved within the checkpoint dictionary, is the rank 0 state. Can someone please advise on how to preserve the rng states from other ranks within the checkpoint in a DDP setup? As a higher level question: if there is a better way to preserve these states between training runs rather than checkpoint storage and re-instantiation, that information would also be welcome.

The main Callback save routine I'm using is posted below. I've then been checking the contents of the saved checkpoint dictionary by using a manual torch.load() call.

python version: 3.9.12
pytorch version: 2.2.0+cu121
pytorch_lightning version: 2.2.0

class CustomCallback(ptl.callbacks.Callback):
    def on_save_checkpoint(self, trainer, module, checkpoint):
        # get random states
        state = {
            'torch': torch.get_rng_state().cpu(),
            'cuda': torch.cuda.get_rng_state().cpu() if torch.cuda.is_available() else None,
            'numpy': np.random.get_state(),
            'random': random.getstate(),
        }
        rank = trainer.global_rank
        checkpoint[f'state_{rank}'] = state.  # note: this key never appears in the saved checkpoint except for rank 0

        # note: this code *does* execute, I do see the saved data for each rank, 
        # but I'd rather store it cleanly in the checkpoint file
        torch.save(state, f'rng_state_{rank}.pt')

    def on_load_checkpoint(self, trainer, module, checkpoint):
        # pass for now, easy enough to update if I get the on_save_ method working appropriately
        pass

Answered by bardsleypt

Aug 26, 2025

After extensive doc-searching and even more extensive trial-and-error, I believe I have a good understanding of this issue. Unfortunately, unless there is a flag within pytorch, ddp, lightning, or CUDA governing GPU scheduling determinism that I don't know about, my main problem of forcing exact reproducibility seems impossible (or at least largely impractical) for reasons I'll summarize below. I am moving on from this problem, by simply avoiding the model stop/restart process I mentioned above via other methods. But I'm posting the information I have discovered in case it helps anyone with a similar problem.

For starters, my second comment is more-or-less correct. Each rank (i.e., each p…

View full answer

bardsleypt · 2025-08-19T17:42:34Z

bardsleypt
Aug 19, 2025
Author

I believe I have at least narrowed down why my approach is not working:

The CustomCallback is instantiated on each process (i.e., each rank)
The per-process-checkpoint-dictionary is updated on its own process with its corresponding rank information
The only checkpoint that is saved is the so-called "global-zero" process

This at least explains why I only see the rank-0 information in the saved checkpoint. So my question can now be reduced to:

Is there any way to synchronize and otherwise send the checkpoint dictionaries for each rank to the global-0 process?

As a workaround, I can do some pretty hacky temporary-save and load routines in the on_save_checkpoint method, but I'd prefer a cleaner way if anyone has any suggestions.

0 replies

bardsleypt · 2025-08-26T16:11:13Z

bardsleypt
Aug 26, 2025
Author

After extensive doc-searching and even more extensive trial-and-error, I believe I have a good understanding of this issue. Unfortunately, unless there is a flag within pytorch, ddp, lightning, or CUDA governing GPU scheduling determinism that I don't know about, my main problem of forcing exact reproducibility seems impossible (or at least largely impractical) for reasons I'll summarize below. I am moving on from this problem, by simply avoiding the model stop/restart process I mentioned above via other methods. But I'm posting the information I have discovered in case it helps anyone with a similar problem.

For starters, my second comment is more-or-less correct. Each rank (i.e., each process) gets its own instantiation of the entire training run (thus all object instantiations including trainers, dataloaders, callbacks, etc.), but under the hood ddp + lightning is only letting a given process see a subset of the training data. During a single batch on a given rank, the data is put through the forward pass of the lightning module, after which the backward pass is called. The backward pass computes gradients locally (on the given rank/process) but crucially as soon as the backward pass completes, an all-reduce routine is called deep within the ddp + lightning source code to aggregate and synchronize all gradients across all ranks. This way, when it is time to make the optimization step, all processes have the same gradient values (i.e., the mean of the gradients over all ranks).

I was able to determine the gradients are the first location I encountered a discrepancy between a continuous training run and a run that involved a stop-checkpoint-restart at a given epoch (epoch 3 in my case). At this breakpoint, the two training runs had different gradients going into the next optimization step (epoch 3 update step), where the discrepancy was O(10^-11). After a lot of additional model hacking to debug the local gradients across each rank, I determined this discrepancy came purely from the order in which the gradients are synchronized (i.e., floating point arithmetic does not obey commutativity of addition perfectly). For example, on the continuous training run with 4 ranks (GPUs), the gradients were aggregated in an order [0, 1, 2, 3], while on the restarted training run the gradients were aggregated in an order [1, 2, 3, 0] (or possibly a similar ordering, but certainly not the same order). I verified this by manually combining all of the local gradients from the restarted training run in different orders until I was able to reproduce the aggregated gradient from both the continuous training run and the restarted training run. That is, I could get different aggregated gradient values purely from the order in which I performed the addition/averaging of the local gradients, one corresponding to the restarted run and another corresponding to the continuous run. This indicated to me that the order in which the GPUs are aggregating their gradients is not the same between my two training runs, though it is still deterministic (i.e., the restarted training run always gave the same discrepancy from the continuous training run).

Short of controlling the exact scheduling of GPU processes, and/or rewriting my model code to perform the aggregation/optimization steps manually, it seems exact reproducibility between these two runs is not possible. If anyone does stumble across this and has more information and better ideas on this, I'd be happy to learn more here.

Debugging local gradients

Here is little more information for anyone encountering similar problems. These are the steps I had to take within my lightning module and training code to debug and output the local (i.e., per process or per rank) gradients. I didn't find this process all that intuitive nor well-documented, so hopefully this helps someone.

The synchronization of the gradients happens almost immediately in the training process, and is handled using some form of hook/observer/subscriber pattern. I'm not sure the specifics here, but it is definitely opaque to the high-level pytorch-lightning user. This means by the time one is at a callback such as on_before_optimizer_step or even on_after_backward, the gradients have already been synchronized and the local gradients are lost. To access the local gradients then, one needs to compute them in a manual fashion without the mentioned synchornization hook in place, store them, and then output them whenever convenient. Here is my approach:

class MyModel(ptl.LightningModule):
    def __init__(self, *args, **kwargs):
        ...
        self.automatic_optimization = False     # this allows calls to manual_backwards() but changes the train_step functionality
        self.local_grads = [] # storage of local gradients

    def training_step(self, batch, batch_idx):
        ...
       loss = self.forward(*args, **kwargs)
        
        with self.trainer.strategy.model.no_sync():  # prevents immediate aggregation of gradients 
            self.manual_backward(loss, retain_graph=True).   # debug backward pass to populate local gradients
            self.local_grads = [p.grad.clone().detach() if p.grad is not None else None for p in self.parameters()]

        opt = self.optimizers()
        opt.zero_grad()  # clear any/all gradients from non-synced step above
        self.manual_backward(loss)  # actual backward pass used for gradients in optimization step
        opt.zero_grad()
        opt.step()
        
        return loss

Once the local gradients are cloned, detached, and stored inside of local_grads, they can be accessed from any of the various callbacks (either lightning-module callback hooks or lighting-callback hooks). Since the object is instantiated on each rank/process, each rank/process will store its own copy of the local gradients. In this manner, I was able to pull the local gradients together (after saving them to disk) for each rank, and reassemble them in different orders to achieve slightly varying aggregated gradient values.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DDP training and storing rank specific info in checkpoints #21097

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

DDP training and storing rank specific info in checkpoints #21097

Uh oh!

Uh oh!

bardsleypt Aug 19, 2025

Replies: 2 comments

Uh oh!

Uh oh!

bardsleypt Aug 19, 2025 Author

Uh oh!

bardsleypt Aug 26, 2025 Author

Debugging local gradients

bardsleypt
Aug 19, 2025

bardsleypt
Aug 19, 2025
Author

bardsleypt
Aug 26, 2025
Author