-
Notifications
You must be signed in to change notification settings - Fork 44
Pick best checkpoint by default #283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
I just realized it also includes the Otherwise, let me know if I should push a fix for this so keep the history clean |
# Found the 'best' checkpoint | ||
return best_ckpt_file | ||
else: | ||
# No 'best' checkpoint found |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what scenario would there be no "best" checkpoint? A model that was trained for too few epochs, or keeps getting worse with training?
Since this is an unusual scenario, maybe we should emit a warning
import warnings
warnings.warn("No 'best' checkpoint found, falling back to latest checkpoint'.")
I think it's fine to say we fall back to latest checkpoint, and then throw the error that this logic is unimplemented in the case that there are multiple ckpts. (expresses our intent at least). Ideally we'd parse out the step= value and return the highest one, but that is work that will likely never actually get use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. Technically that should "never" happen, right? Even if one interrupts the training. Does it fall back to the latest model being the "best"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might not exist, perhaps if the loss doesn't improve from the initialization weights? Not precisely sure in what condition, but I think we've witnessed this before.
lightning_pose/utils/io.py
Outdated
for f in latest_version_files: | ||
if "-best.ckpt" in os.path.basename(f): | ||
best_ckpt_file = f | ||
break # Found the best file, stop searching |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think multiple best checkpoints are possible if someone adds the save_top_k
parameter to the ModelCheckpoint callback. In this case we should still raise an error about not being able to select from multiple checkpoints, instead of returning the first one found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I implement this possibility or just output an error/warning for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you've implemented it
Thanks for the fix, I did not think about OK to leave the glob escape. |
lightning_pose/utils/io.py
Outdated
match = re.search(r"step=(\d+)", f) | ||
if match: | ||
step = int(match.group(1)) | ||
ckpt_step_counts[f] = step |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This dict is unused and should be removed
lightning_pose/utils/io.py
Outdated
|
||
if latest_ckpt is not None: | ||
return latest_ckpt | ||
elif parse_errors == len(latest_version_files): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd remove any logic based on parse errors, it's being needlessly cautious IMO. If we're in this scenario it needs manual investigation - ie. fine to give up and throw an error.
A couple of pending comments. The test failure seems totally unrelated. I will have to investigate asynchronously. |
- Fixed typo in warning message (removed extra apostrophe) - Cleaned up step parsing logic (removed unused ckpt_step_counts dict) - Removed per-file warning when step parsing fails to reduce noise - Improved fallback behavior to always return a checkpoint when possible - Better handling of edge cases in step parsing 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
- Check for all files containing '-best.ckpt' instead of stopping at first - Raise ValueError when multiple best checkpoints exist (e.g., from save_top_k > 1) - Updated docstring to reflect the actual behavior - Improved error messages to be more descriptive This prevents silent selection of arbitrary checkpoint when multiple best checkpoints exist, which could happen with save_top_k parameter in ModelCheckpoint callback. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
- Removed parse_errors tracking and fallback logic - Now raises clear ValueError when unable to determine which checkpoint to use - Cleaner code without overly cautious fallback strategies When automated selection fails, it's better to fail explicitly and require manual investigation rather than making potentially incorrect guesses.
I finally had the mental capacity to go through the comments >.< |
This PR addresses this issue: #282
I think that eventually it should be possible to choose the checkpoint to use (I think DLC also allows the as an option).
But it might be too specific of a use case for now. So choosing the one marked with "best" is the easiest.