Skip to content

Conversation

hummuscience
Copy link
Contributor

This PR addresses this issue: #282

I think that eventually it should be possible to choose the checkpoint to use (I think DLC also allows the as an option).

But it might be too specific of a use case for now. So choosing the one marked with "best" is the easiest.

@hummuscience
Copy link
Contributor Author

I just realized it also includes the glob.escape fix that I sent a PR for a while ago. Feel free to remove it as its something that probably only affects me in this case.

Otherwise, let me know if I should push a fix for this so keep the history clean

# Found the 'best' checkpoint
return best_ckpt_file
else:
# No 'best' checkpoint found
Copy link
Collaborator

@ksikka ksikka Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what scenario would there be no "best" checkpoint? A model that was trained for too few epochs, or keeps getting worse with training?

Since this is an unusual scenario, maybe we should emit a warning

import warnings
warnings.warn("No 'best' checkpoint found, falling back to latest checkpoint'.")

I think it's fine to say we fall back to latest checkpoint, and then throw the error that this logic is unimplemented in the case that there are multiple ckpts. (expresses our intent at least). Ideally we'd parse out the step= value and return the highest one, but that is work that will likely never actually get use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Technically that should "never" happen, right? Even if one interrupts the training. Does it fall back to the latest model being the "best"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might not exist, perhaps if the loss doesn't improve from the initialization weights? Not precisely sure in what condition, but I think we've witnessed this before.

for f in latest_version_files:
if "-best.ckpt" in os.path.basename(f):
best_ckpt_file = f
break # Found the best file, stop searching
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think multiple best checkpoints are possible if someone adds the save_top_k parameter to the ModelCheckpoint callback. In this case we should still raise an error about not being able to select from multiple checkpoints, instead of returning the first one found.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I implement this possibility or just output an error/warning for now?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you've implemented it

@ksikka
Copy link
Collaborator

ksikka commented Apr 22, 2025

Thanks for the fix, I did not think about ckpt_every_n_epochs. I am glad you found and fixed the issue.

OK to leave the glob escape.
Nice use of comments btw, I found it very easy to understand, thanks.

@ksikka ksikka self-assigned this Apr 22, 2025
match = re.search(r"step=(\d+)", f)
if match:
step = int(match.group(1))
ckpt_step_counts[f] = step
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dict is unused and should be removed


if latest_ckpt is not None:
return latest_ckpt
elif parse_errors == len(latest_version_files):
Copy link
Collaborator

@ksikka ksikka Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove any logic based on parse errors, it's being needlessly cautious IMO. If we're in this scenario it needs manual investigation - ie. fine to give up and throw an error.

@ksikka
Copy link
Collaborator

ksikka commented Apr 24, 2025

A couple of pending comments. The test failure seems totally unrelated. I will have to investigate asynchronously.

Muad Abd El Hay and others added 4 commits August 26, 2025 15:01
- Fixed typo in warning message (removed extra apostrophe)
- Cleaned up step parsing logic (removed unused ckpt_step_counts dict)
- Removed per-file warning when step parsing fails to reduce noise
- Improved fallback behavior to always return a checkpoint when possible
- Better handling of edge cases in step parsing

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Check for all files containing '-best.ckpt' instead of stopping at first
- Raise ValueError when multiple best checkpoints exist (e.g., from save_top_k > 1)
- Updated docstring to reflect the actual behavior
- Improved error messages to be more descriptive

This prevents silent selection of arbitrary checkpoint when multiple
best checkpoints exist, which could happen with save_top_k parameter
in ModelCheckpoint callback.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Removed parse_errors tracking and fallback logic
- Now raises clear ValueError when unable to determine which checkpoint to use
- Cleaner code without overly cautious fallback strategies

When automated selection fails, it's better to fail explicitly and
require manual investigation rather than making potentially incorrect
guesses.
@hummuscience
Copy link
Contributor Author

I finally had the mental capacity to go through the comments >.<

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants