Add `save_on_exception` option to `ModelCheckpoint` #20916

vsey · 2025-06-19T06:33:17Z

What does this PR do?
This PR adds a save_on_exception option to the ModelCheckpoint callback. This some of this functionality is already implemented in the OnExceptionCheckpoint checkpoint, but I believe that bundling all checkpoint options in the ModelCheckpoint is more intuitive. Additionally, this leads to the same naming conventions and directory paths used for the exception checkpoint as for all the others.
When enabled, this option serves as a contingency in case of any disruption during training, allowing one to continue from the last step before the exception occurs without losing too much progress. By printing the exception type and message, this also alleviates issue #20187.

Fixes #19686

Before submitting

Was this discussed/agreed via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

📚 Documentation preview 📚: https://pytorch-lightning--20916.org.readthedocs.build/en/20916/

…ining part of callbacks in individal test for better overview

…lidation callback

… and val step

for more information, see https://pre-commit.ci

…sly defined epoch lenght

…ntefere with current checkpoint behavoir

…model checkpoint

…ions

for more information, see https://pre-commit.ci

src/lightning/pytorch/callbacks/model_checkpoint.py

tests/tests_pytorch/checkpointing/test_model_checkpoint.py

… test, to clarify what it is doing

for more information, see https://pre-commit.ci

* add saving of checkpoint if an exception is raised * import callback to checkpoint test file * add test for exception in training callbacks --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: Jirka B <[email protected]> (cherry picked from commit 6f93a90)

vsey added 10 commits June 19, 2025 05:18

add saving of checkpoint if an exception is raised

0f73167

import callback to checkpoint test file

136e59a

add test for exception in training callbacks

e0dae53

split test for save checksave point on expection for expetions in tra…

2113acc

…ining part of callbacks in individal test for better overview

add extra condition for checking if we should save on exception

7d750e6

add for saving checkpoint on exeption if the exception occurs in a va…

34e598a

…lidation callback

add test for save model chekpoint on exception for exception in train…

d4d933b

… and val step

disable trainer prog bar for test of model checkpoint on exception

9f6063b

model checkpoint on eception split trainer setup over two lines

02477d5

remove trainling braket from shoukd_save_on_eception condition

e5b0498

github-actions bot added the pl Generic label for PyTorch Lightning package label Jun 19, 2025

vsey and others added 3 commits June 19, 2025 08:33

Merge branch 'master' into feat/ModelCheckpointException

8bc93e2

[pre-commit.ci] auto fixes from pre-commit.com hooks

985c1e1

for more information, see https://pre-commit.ci

Update save checkpoint on exception tests to use a shorter more preci…

f0502ec

…sly defined epoch lenght

vsey force-pushed the feat/ModelCheckpointException branch from 6249794 to f0502ec Compare June 19, 2025 20:42

vsey and others added 15 commits June 20, 2025 04:30

switch default on save on checkpoint on exception to false to don't i…

99af7ed

…ntefere with current checkpoint behavoir

checkpoint on exception put callback tests into a pytest prametrization

c092385

change doc string to reflect new default value for save on exception …

904bd74

…model checkpoint

checkpoint on exception add test function for exception in callback

3a3204e

Merge branch 'master' into feat/ModelCheckpointException

0b1eb77

add prefix option to generate checkpoint file name

467c57b

add exception prefix to checkpoints saved on exception

8ba6381

add test to test prefix for checkpoint name

3076ea1

add test for exceptions at diffrent position in a model

d78ea3e

add description to on exception hook in model checkpoint

42bbac1

add test to check saving on exception in all relevalnt callback posit…

c4b8063

…ions

also print exception when saving checkpoint

2ca6dab

test checkpointing on exception in varoius model steps

9e9e580

remove deviders in test_model_checkpoint

d2f74e9

add test for run conditions for save checkpoint on exception

ac33670

vsey and others added 11 commits June 28, 2025 01:34

Merge branch 'master' into feat/ModelCheckpointException

e81e291

Merge branch 'master' into feat/ModelCheckpointException

28c61e1

Merge branch 'master' into feat/ModelCheckpointException

6b67499

Merge branch 'master' into feat/ModelCheckpointException

7e990ca

Added save_on_exception option for ModelCheckpoint to changelog

f0cf90b

Fix tense of changelog entrie to be in line with rest of changelog

921c9ee

change changelog entrie tense back

e388ee7

Merge branch 'master' into feat/ModelCheckpointException

7c70d46

Merge branch 'master' into feat/ModelCheckpointException

97900b1

Merge branch 'master' into feat/ModelCheckpointException

36f4bc9

Merge branch 'master' into feat/ModelCheckpointException

7427184

github-actions bot added the has conflicts label Aug 7, 2025

Merge branch 'master' into feat/ModelCheckpointException

0336478

github-actions bot added has conflicts and removed has conflicts labels Aug 8, 2025

vsey and others added 2 commits August 9, 2025 03:46

Merge branch 'master' into feat/ModelCheckpointException

744cbbf

[pre-commit.ci] auto fixes from pre-commit.com hooks

c2c7e93

for more information, see https://pre-commit.ci

github-actions bot removed the has conflicts label Aug 9, 2025

Borda reviewed Aug 9, 2025

View reviewed changes

src/lightning/pytorch/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved

Borda reviewed Aug 9, 2025

View reviewed changes

src/lightning/pytorch/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved

Apply suggestions from code review

37a5369

Borda reviewed Aug 9, 2025

View reviewed changes

tests/tests_pytorch/checkpointing/test_model_checkpoint.py Outdated Show resolved Hide resolved

vsey and others added 4 commits August 10, 2025 18:08

Change the comment for the save_checkpoint_on_exception run condition…

43eae27

… test, to clarify what it is doing

[pre-commit.ci] auto fixes from pre-commit.com hooks

f4428e5

for more information, see https://pre-commit.ci

split

c272633

[pre-commit.ci] auto fixes from pre-commit.com hooks

2d3f833

for more information, see https://pre-commit.ci

Borda approved these changes Aug 11, 2025

View reviewed changes

Borda merged commit 6f93a90 into Lightning-AI:master Aug 12, 2025
85 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `save_on_exception` option to `ModelCheckpoint` #20916

Add `save_on_exception` option to `ModelCheckpoint` #20916

Uh oh!

vsey commented Jun 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add save_on_exception option to ModelCheckpoint #20916

Add save_on_exception option to ModelCheckpoint #20916

Uh oh!

Conversation

vsey commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add `save_on_exception` option to `ModelCheckpoint` #20916

Add `save_on_exception` option to `ModelCheckpoint` #20916

vsey commented Jun 19, 2025 •

edited

Loading