Skip to content

Conversation

makaveli10
Copy link

This PR adds support for saving a checkpoint after N train steps and resuming training from any of the saved checkpoints.

makaveli10 and others added 19 commits August 19, 2025 10:07
This PR adds checkpointing for fine-tuning:
- Add checkpoint saving every N steps with --checkpoint-save-steps
- Save complete training state: model weights, optimizer state, metadata
- Implement two-phase optimizer state loading to avoid memory issues
- Add --resume-from-checkpoint and --auto-resume functionality
- Store optimizer momentum/variance tensors in GGUF format
- Add checkpoint validation for rank, alpha, and target modules
- Update README.md with checkpointing documentation

The optimizer state loading: iteration count is loaded during initialization,
while tensor data (grad_m, grad_v) is loaded after ggml_opt_alloc creates
the proper tensor structures.
@makaveli10 makaveli10 closed this Sep 2, 2025
@makaveli10 makaveli10 reopened this Sep 2, 2025
@makaveli10 makaveli10 changed the title Save resume lora ckpt Draft: Save resume lora ckpt Sep 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants