Draft: Save resume lora ckpt #6

makaveli10 · 2025-08-26T05:45:18Z

This PR adds support for saving a checkpoint after N train steps and resuming training from any of the saved checkpoints.

…a is provided

Signed-off-by: vineet <[email protected]>

…lation Signed-off-by: vineet <[email protected]>

Signed-off-by: vineet <[email protected]>

This PR adds checkpointing for fine-tuning: - Add checkpoint saving every N steps with --checkpoint-save-steps - Save complete training state: model weights, optimizer state, metadata - Implement two-phase optimizer state loading to avoid memory issues - Add --resume-from-checkpoint and --auto-resume functionality - Store optimizer momentum/variance tensors in GGUF format - Add checkpoint validation for rank, alpha, and target modules - Update README.md with checkpointing documentation The optimizer state loading: iteration count is loaded during initialization, while tensor data (grad_m, grad_v) is loaded after ggml_opt_alloc creates the proper tensor structures.

makaveli10 and others added 19 commits August 19, 2025 10:07

Add lora finetuning from adapter

f7b0025

Add: create new lora adapter for target modules to finetune if no lor…

116f3dd

…a is provided

Fix identical loss over epochs; fix garbage lora initization

9e6d8ce

Signed-off-by: vineet <[email protected]>

Remove lora training from finetune.cpp

8bb11c0

Signed-off-by: vineet <[email protected]>

Add adapter saving & other lora target modules

486ebc1

Signed-off-by: vineet <[email protected]>

Add finetune-lora for lora finetuning in examples

c23ada9

Signed-off-by: vineet <[email protected]>

Add dequantization to out_prod cuda kernel

3f295e1

Signed-off-by: vineet <[email protected]>

Update README with finetune-lora

0c1ffd1

Signed-off-by: vineet <[email protected]>

Vulkan: add support for fp32 OUT_PROD op

e9f5d88

CPU: add support for fp16_fp32 OUT_PROD op

fb0e501

Vulkan: add support for f16_f32 OUT_PROD op

2b0c835

Vulkan: Add Q4_0/Q8_0 OUT_PROD Vulkan support

0aef6c8

vulkan: Add initial cross entropy loss backward shader

25c5316

Signed-off-by: vineet <[email protected]>

vulkan: Fix cross-entropy-loss-back dispatch size and wg denominator

0721550

Signed-off-by: vineet <[email protected]>

vulkan: Change uint32 cast to int32 for outprod; allows android compi…

bc7dd9f

…lation Signed-off-by: vineet <[email protected]>

wip vulkan crash fix

b820541

vulkan: Set out_prod pipeline disable_robustness to true

25dfd75

Signed-off-by: vineet <[email protected]>

vulkan: Add subgroup size to out_prod q8/q4

9bf57f1

Signed-off-by: vineet <[email protected]>

makaveli10 closed this Sep 2, 2025

makaveli10 reopened this Sep 2, 2025

makaveli10 changed the title ~~Save resume lora ckpt~~ Draft: Save resume lora ckpt Sep 2, 2025

github-actions bot added Nvidia GPU Vulkan examples ggml labels Sep 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Draft: Save resume lora ckpt #6

Draft: Save resume lora ckpt #6

Uh oh!

makaveli10 commented Aug 26, 2025

Uh oh!

Uh oh!

Draft: Save resume lora ckpt #6

Are you sure you want to change the base?

Draft: Save resume lora ckpt #6

Uh oh!

Conversation

makaveli10 commented Aug 26, 2025

Uh oh!

Uh oh!