Pick best checkpoint by default #283

hummuscience · 2025-04-22T11:08:56Z

This PR addresses this issue: #282

I think that eventually it should be possible to choose the checkpoint to use (I think DLC also allows the as an option).

But it might be too specific of a use case for now. So choosing the one marked with "best" is the easiest.

hummuscience · 2025-04-22T11:11:38Z

I just realized it also includes the glob.escape fix that I sent a PR for a while ago. Feel free to remove it as its something that probably only affects me in this case.

Otherwise, let me know if I should push a fix for this so keep the history clean

lightning_pose/utils/io.py

ksikka · 2025-04-22T15:49:10Z

lightning_pose/utils/io.py

+        # Found the 'best' checkpoint
+        return best_ckpt_file
+    else:
+        # No 'best' checkpoint found


In what scenario would there be no "best" checkpoint? A model that was trained for too few epochs, or keeps getting worse with training?

Since this is an unusual scenario, maybe we should emit a warning

import warnings warnings.warn("No 'best' checkpoint found, falling back to latest checkpoint'.")

I think it's fine to say we fall back to latest checkpoint, and then throw the error that this logic is unimplemented in the case that there are multiple ckpts. (expresses our intent at least). Ideally we'd parse out the step= value and return the highest one, but that is work that will likely never actually get use.

Yeah. Technically that should "never" happen, right? Even if one interrupts the training. Does it fall back to the latest model being the "best"?

It might not exist, perhaps if the loss doesn't improve from the initialization weights? Not precisely sure in what condition, but I think we've witnessed this before.

ksikka · 2025-04-22T15:50:35Z

lightning_pose/utils/io.py

+    for f in latest_version_files:
+        if "-best.ckpt" in os.path.basename(f):
+            best_ckpt_file = f
+            break  # Found the best file, stop searching


I think multiple best checkpoints are possible if someone adds the save_top_k parameter to the ModelCheckpoint callback. In this case we should still raise an error about not being able to select from multiple checkpoints, instead of returning the first one found.

Should I implement this possibility or just output an error/warning for now?

I see you've implemented it

ksikka · 2025-04-22T15:52:31Z

Thanks for the fix, I did not think about ckpt_every_n_epochs. I am glad you found and fixed the issue.

OK to leave the glob escape.
Nice use of comments btw, I found it very easy to understand, thanks.

…found

ksikka · 2025-04-24T10:35:41Z

lightning_pose/utils/io.py

+                match = re.search(r"step=(\d+)", f)
+                if match:
+                    step = int(match.group(1))
+                    ckpt_step_counts[f] = step


This dict is unused and should be removed

ksikka · 2025-04-24T10:37:20Z

lightning_pose/utils/io.py

+
+            if latest_ckpt is not None:
+                return latest_ckpt
+            elif parse_errors == len(latest_version_files):


I'd remove any logic based on parse errors, it's being needlessly cautious IMO. If we're in this scenario it needs manual investigation - ie. fine to give up and throw an error.

ksikka · 2025-04-24T10:39:25Z

A couple of pending comments. The test failure seems totally unrelated. I will have to investigate asynchronously.

- Fixed typo in warning message (removed extra apostrophe) - Cleaned up step parsing logic (removed unused ckpt_step_counts dict) - Removed per-file warning when step parsing fails to reduce noise - Improved fallback behavior to always return a checkpoint when possible - Better handling of edge cases in step parsing 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

- Check for all files containing '-best.ckpt' instead of stopping at first - Raise ValueError when multiple best checkpoints exist (e.g., from save_top_k > 1) - Updated docstring to reflect the actual behavior - Improved error messages to be more descriptive This prevents silent selection of arbitrary checkpoint when multiple best checkpoints exist, which could happen with save_top_k parameter in ModelCheckpoint callback. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

- Removed parse_errors tracking and fallback logic - Now raises clear ValueError when unable to determine which checkpoint to use - Cleaner code without overly cautious fallback strategies When automated selection fails, it's better to fail explicitly and require manual investigation rather than making potentially incorrect guesses.

hummuscience · 2025-08-26T13:22:12Z

I finally had the mental capacity to go through the comments >.<

Pick best checkpoint by default

c9ef5d5

ksikka reviewed Apr 22, 2025

View reviewed changes

lightning_pose/utils/io.py Show resolved Hide resolved

ksikka reviewed Apr 22, 2025

View reviewed changes

ksikka self-assigned this Apr 22, 2025

Muad Abd El Hay added 2 commits April 23, 2025 10:21

Replaced os.bsename with f.endswith

2680721

Parsing checkpoints and returning the latest when no best checkpoint …

83bff1f

…found

ksikka reviewed Apr 24, 2025

View reviewed changes

Muad Abd El Hay and others added 4 commits August 26, 2025 15:01

Merge remote-tracking branch 'origin/main' into best_ckpt

457f1b3

Pick best checkpoint by default #283

Are you sure you want to change the base?

Pick best checkpoint by default #283

Uh oh!

Conversation

hummuscience commented Apr 22, 2025

Uh oh!

hummuscience commented Apr 22, 2025

Uh oh!

Uh oh!

ksikka Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hummuscience Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

ksikka Apr 24, 2025

Choose a reason for hiding this comment

Uh oh!

ksikka Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

hummuscience Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

ksikka Apr 24, 2025

Choose a reason for hiding this comment

Uh oh!

ksikka commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ksikka Apr 24, 2025

Choose a reason for hiding this comment

Uh oh!

ksikka Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ksikka commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hummuscience commented Aug 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ksikka Apr 22, 2025 •

edited

Loading

ksikka commented Apr 22, 2025 •

edited

Loading

ksikka Apr 24, 2025 •

edited

Loading

ksikka commented Apr 24, 2025 •

edited

Loading