@@ -1587,26 +1587,35 @@ def _get_most_recently_modified_file_matching_pattern(self, pattern):
1587
1587
class BackupAndRestore (Callback ):
1588
1588
"""Callback to back up and restore the training state.
1589
1589
1590
- `BackupAndRestore` callback is intended to recover from interruptions that
1591
- happened in the middle of a model.fit execution by backing up the
1592
- training states in a temporary checkpoint file (based on TF CheckpointManager)
1593
- at the end of each epoch. If training restarted before completion, the
1594
- training state and model are restored to the most recently saved state at the
1595
- beginning of a new model.fit() run.
1596
- Note that user is responsible to bring jobs back up.
1590
+ `BackupAndRestore` callback is intended to recover training from an
1591
+ interruption that has happened in the middle of a `Model.fit` execution, by
1592
+ backing up the training states in a temporary checkpoint file (with the help
1593
+ of a `tf.train.CheckpointManager`), at the end of each epoch. Each backup
1594
+ overwrites the previously written checkpoint file, so at any given time there
1595
+ is at most one such checkpoint file for backup/restoring purpose.
1596
+
1597
+ If training restarts before completion, the training state (which includes the
1598
+ `Model` weights and epoch number) is restored to the most recently saved state
1599
+ at the beginning of a new `Model.fit` run. At the completion of a `Model.fit`
1600
+ run, the temporary checkpoint file is deleted.
1601
+
1602
+ Note that the user is responsible to bring jobs back after the interruption.
1597
1603
This callback is important for the backup and restore mechanism for fault
1598
- tolerance purpose. And the model to be restored from an previous checkpoint is
1604
+ tolerance purpose, and the model to be restored from an previous checkpoint is
1599
1605
expected to be the same as the one used to back up. If user changes arguments
1600
1606
passed to compile or fit, the checkpoint saved for fault tolerance can become
1601
1607
invalid.
1602
1608
1603
1609
Note:
1604
- 1. This callback is not compatible with disabling eager execution.
1605
- 2. A checkpoint is saved at the end of each epoch, when restoring we'll redo
1606
- any partial work from an unfinished epoch in which the training got restarted
1607
- (so the work done before a interruption doesn't affect the final model state).
1608
- 3. This works for both single worker and multi-worker mode, only
1609
- MirroredStrategy and MultiWorkerMirroredStrategy are supported for now.
1610
+ 1. This callback is not compatible with eager execution disabled.
1611
+ 2. A checkpoint is saved at the end of each epoch. After restoring,
1612
+ `Model.fit` redoes any partial work during the unfinished epoch in which the
1613
+ training got restarted (so the work done before the interruption doesn't
1614
+ affect the final model state).
1615
+ 3. This works for both single worker and multi-worker modes. When `Model.fit`
1616
+ is used with `tf.distribute`, it supports `tf.distribute.MirroredStrategy`,
1617
+ `tf.distribute.MultiWorkerMirroredStrategy`, `tf.distribute.TPUStrategy`, and
1618
+ `tf.distribute.experimental.ParameterServerStrategy`.
1610
1619
1611
1620
Example:
1612
1621
0 commit comments