--resume error #2572

atti0127 · 2025-08-20T04:24:36Z

atti0127
Aug 20, 2025

I trained cait_xxs24 until 225 epoch and tried to resume training using --resume, and accuracy started to decrease as epoch goes on... Something is wrong or I think I'm missing something when using resume. Below is the resume script I used.

Current checkpoints:
('output_xxs24/20250820-120753-cait_xxs24_224-224/checkpoint-1.pth.tar', 71.604)
('output_xxs24/20250820-120753-cait_xxs24_224-224/checkpoint-0.pth.tar', 70.6)
('output_xxs24/20250820-120753-cait_xxs24_224-224/checkpoint-2.pth.tar', 69.976)

./distributed_train.sh 2 --data-dir imagenet --model cait_xxs24_224 --batch-size 128 --epochs 400 --aug-repeats 3 --lr 0.001 --drop-path 0.1 --grad-accum-steps 4 --output output_xxs24 --resume output_xxs24/~~

rwightman · 2025-08-20T13:54:45Z

rwightman
Aug 20, 2025
Maintainer

@atti0127 are you sure it doesn't do that without resume? there is a LR warmup by default and if it gets too high your performance will drop

1 reply

atti0127 Aug 20, 2025
Author

summary.csv

These are the results up to epoch 225, using training script below (rest of the setting were exactly same as deit, as mentioned in cait paper)

./distributed_train.sh 2 --data-dir imagenet --model cait_xxs24_224 --batch-size 128 --epochs 400 --aug-repeats 3 --lr 0.001 --drop-path 0.1 --grad-accum-steps 4 --output output_xxs24

I tried to started training again using last.pth file only adding --resume, and accuracy started to dropped (as I aforementioned)

rwightman · 2025-08-20T16:12:47Z

rwightman
Aug 20, 2025
Maintainer

It's not resuming from epoch 225 though, it's starting from 0 so you may have stomped over your last/best checkpoints using the same output for. Looks for the highest number checkpoint file

…

On Wed, Aug 20, 2025, 7:35 AM atti0127 ***@***.***> wrote: summary.csv <https://github.com/user-attachments/files/21899502/summary.csv> These are the results up to epoch 225, using training script below (rest of the setting were exactly same as deit, as mentioned in cait paper) ./distributed_train.sh 2 --data-dir imagenet --model cait_xxs24_224 --batch-size 128 --epochs 400 --aug-repeats 3 --lr 0.001 --drop-path 0.1 --grad-accum-steps 4 --output output_xxs24 I tried to started training again using last.pth file only adding --resume, and accuracy started to dropped (as I aforementioned) — Reply to this email directly, view it on GitHub <#2572 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABLQICESVHUJD752BT4ZUP33OSBTTAVCNFSM6AAAAACEKGOYY2VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTIMJWGY2DMNA> . You are receiving this because you commented.Message ID: <huggingface/pytorch-image-models/repo-discussions/2572/comments/14166464@ github.com>

0 replies

rwightman · 2025-08-20T16:16:34Z

rwightman
Aug 20, 2025
Maintainer

Sorry, I meant you should try finding the highest checkpoint file to resume from if you overwrote your last/best by mistake

…

On Wed, Aug 20, 2025, 9:12 AM Ross Wightman ***@***.***> wrote: It's not resuming from epoch 225 though, it's starting from 0 so you may have stomped over your last/best checkpoints using the same output for. Looks for the highest number checkpoint file On Wed, Aug 20, 2025, 7:35 AM atti0127 ***@***.***> wrote: > summary.csv > <https://github.com/user-attachments/files/21899502/summary.csv> > > These are the results up to epoch 225, using training script below (rest > of the setting were exactly same as deit, as mentioned in cait paper) > > ./distributed_train.sh 2 --data-dir imagenet --model cait_xxs24_224 > --batch-size 128 --epochs 400 --aug-repeats 3 --lr 0.001 --drop-path 0.1 > --grad-accum-steps 4 --output output_xxs24 > > I tried to started training again using last.pth file only adding > --resume, and accuracy started to dropped (as I aforementioned) > > — > Reply to this email directly, view it on GitHub > <#2572 (reply in thread)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABLQICESVHUJD752BT4ZUP33OSBTTAVCNFSM6AAAAACEKGOYY2VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTIMJWGY2DMNA> > . > You are receiving this because you commented.Message ID: > <huggingface/pytorch-image-models/repo-discussions/2572/comments/14166464 > @github.com> >

1 reply

atti0127 Aug 21, 2025
Author

checkpoint-211.pth.tar checkpoint-212.pth.tar checkpoint-214.pth.tar checkpoint-219.pth.tar checkpoint-220.pth.tar checkpoint-221.pth.tar checkpoint-222.pth.tar checkpoint-223.pth.tar checkpoint-224.pth.tar checkpoint-225.pth.tar last.pth.tar model_best.pth.tar summary.csv

The file consists like above. I tried below scripts and all results started with train: 0 and resulted in accuracy dropped

--resume output_xxs24/checkpoint-225.pth.tar
--resume output_xxs24/last.pth.tar
--resume output_xxs24/model_best.pth.tar

(timm) user1@user1-System-Product-Name:~/Desktop/pytorch-image-models$ ./distributed_train.sh 2 --data-dir imagenet --model cait_xxs24_224 --batch-size 128 --epochs 400 --aug-repeats 3 --lr 0.001 --drop-path 0.1 --grad-accum-steps 4 --output output_xxs24 --resume output_xxs24/20250814-223557-cait_xxs24_224-224/model_best.pth.tar
WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

Added key: store_based_barrier_key:1 to store for rank: 1
Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
Training in distributed mode with multiple processes, 1 device per process.Process 1, total 2, device cuda:1.
Training in distributed mode with multiple processes, 1 device per process.Process 0, total 2, device cuda:0.
Model cait_xxs24_224 created, param count:11993328
Data processing configuration for current model + dataset:
input_size: (3, 224, 224)
interpolation: bicubic
mean: (0.485, 0.456, 0.406)
std: (0.229, 0.224, 0.225)
crop_pct: 0.875
crop_mode: center
Created AdamW (adamw) optimizer: lr: 0.001, betas: (0.9, 0.999), eps: 1e-08, weight_decay: 0.05, amsgrad: False, foreach: None, maximize: False, capturable: False
AMP not enabled. Training in torch.float32.
Restoring model state from checkpoint...
Restoring optimizer state from checkpoint...
Loaded checkpoint 'output_xxs24/20250814-223557-cait_xxs24_224-224/model_best.pth.tar' (epoch 225)
Using native Torch DistributedDataParallel.
Scheduled epochs: 410 (epochs + cooldown_epochs). Warmup within epochs when warmup_prefix=False. LR stepped per epoch.
Train: 0 [ 0/1251 ( 0%)] Loss: 3.27 (3.89) Time: 4.511s, 227.01/s (4.511s, 227.01/s) LR: 1.000e-06 Data: 0.031 (1.119)
Reducer buckets have been rebuilt in this iteration.
Reducer buckets have been rebuilt in this iteration.
Train: 0 [ 50/1251 ( 4%)] Loss: 4.29 (3.83) Time: 1.043s, 981.32/s (1.103s, 928.77/s) LR: 1.000e-06 Data: 0.044 (0.058)
Train: 0 [ 100/1251 ( 8%)] Loss: 3.91 (3.83) Time: 1.042s, 982.44/s (1.072s, 955.04/s) LR: 1.000e-06 Data: 0.044 (0.049)
Train: 0 [ 150/1251 ( 12%)] Loss: 4.20 (3.81) Time: 1.048s, 976.83/s (1.062s, 964.31/s) LR: 1.000e-06 Data: 0.058 (0.046)

rwightman · 2025-08-21T05:44:51Z

rwightman
Aug 21, 2025
Maintainer

I'm really not sure what's going on here, I've never seen the scripts fail to resume like that. Are any modifications made? What revision of the scripts are checked out? What PyTorch version? Are the arguments besides the ones pasted being used?

2 replies

atti0127 Aug 21, 2025
Author

Instead of defining arg parser in training script, I changed it directly in train.py file (For example, I changed '--opt', default='sgd' to '--opt', default='adamw' in train.py (default of timm train.py is sgd) instead of incorporating --opt adamw in training script. Will this be the reason?

All I did was directly changing all hyperparameters same as official deit github's main.py in timm's train.py and adjusting the hyperparameters for cait in training script, mentioned in cait paper that differs from deit setting (--epochs 400, --lr 0.001 --drop-path 0.1)

By the way, I'm using python 3.9 with torch 1.13 in conda environment

atti0127 Aug 23, 2025
Author

group.add_argument('--start-epoch', default=None, type=int, metavar='N',

I found the reason. I set default of --start-epoch to 0. Sorry for confusing.

Uh oh!

--resume error #2572

Uh oh!

atti0127 Aug 20, 2025

Replies: 4 comments · 4 replies

Uh oh!

rwightman Aug 20, 2025 Maintainer

Uh oh!

atti0127 Aug 20, 2025 Author

Uh oh!

rwightman Aug 20, 2025 Maintainer

Uh oh!

rwightman Aug 20, 2025 Maintainer

Uh oh!

atti0127 Aug 21, 2025 Author

Uh oh!

rwightman Aug 21, 2025 Maintainer

Uh oh!

Uh oh!

atti0127 Aug 21, 2025 Author

Uh oh!

Uh oh!

atti0127 Aug 23, 2025 Author

atti0127
Aug 20, 2025

Replies: 4 comments 4 replies

rwightman
Aug 20, 2025
Maintainer

atti0127 Aug 20, 2025
Author

rwightman
Aug 20, 2025
Maintainer

rwightman
Aug 20, 2025
Maintainer

atti0127 Aug 21, 2025
Author

rwightman
Aug 21, 2025
Maintainer

atti0127 Aug 21, 2025
Author

atti0127 Aug 23, 2025
Author