--resume error
#2572
Replies: 4 comments 3 replies
-
@atti0127 are you sure it doesn't do that without resume? there is a LR warmup by default and if it gets too high your performance will drop |
Beta Was this translation helpful? Give feedback.
1 reply
-
It's not resuming from epoch 225 though, it's starting from 0 so you may
have stomped over your last/best checkpoints using the same output for.
Looks for the highest number checkpoint file
…On Wed, Aug 20, 2025, 7:35 AM atti0127 ***@***.***> wrote:
summary.csv
<https://github.com/user-attachments/files/21899502/summary.csv>
These are the results up to epoch 225, using training script below (rest
of the setting were exactly same as deit, as mentioned in cait paper)
./distributed_train.sh 2 --data-dir imagenet --model cait_xxs24_224
--batch-size 128 --epochs 400 --aug-repeats 3 --lr 0.001 --drop-path 0.1
--grad-accum-steps 4 --output output_xxs24
I tried to started training again using last.pth file only adding
--resume, and accuracy started to dropped (as I aforementioned)
—
Reply to this email directly, view it on GitHub
<#2572 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABLQICESVHUJD752BT4ZUP33OSBTTAVCNFSM6AAAAACEKGOYY2VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTIMJWGY2DMNA>
.
You are receiving this because you commented.Message ID:
<huggingface/pytorch-image-models/repo-discussions/2572/comments/14166464@
github.com>
|
Beta Was this translation helpful? Give feedback.
0 replies
-
Sorry, I meant you should try finding the highest checkpoint file to resume
from if you overwrote your last/best by mistake
…On Wed, Aug 20, 2025, 9:12 AM Ross Wightman ***@***.***> wrote:
It's not resuming from epoch 225 though, it's starting from 0 so you may
have stomped over your last/best checkpoints using the same output for.
Looks for the highest number checkpoint file
On Wed, Aug 20, 2025, 7:35 AM atti0127 ***@***.***> wrote:
> summary.csv
> <https://github.com/user-attachments/files/21899502/summary.csv>
>
> These are the results up to epoch 225, using training script below (rest
> of the setting were exactly same as deit, as mentioned in cait paper)
>
> ./distributed_train.sh 2 --data-dir imagenet --model cait_xxs24_224
> --batch-size 128 --epochs 400 --aug-repeats 3 --lr 0.001 --drop-path 0.1
> --grad-accum-steps 4 --output output_xxs24
>
> I tried to started training again using last.pth file only adding
> --resume, and accuracy started to dropped (as I aforementioned)
>
> —
> Reply to this email directly, view it on GitHub
> <#2572 (reply in thread)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ABLQICESVHUJD752BT4ZUP33OSBTTAVCNFSM6AAAAACEKGOYY2VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTIMJWGY2DMNA>
> .
> You are receiving this because you commented.Message ID:
> <huggingface/pytorch-image-models/repo-discussions/2572/comments/14166464
> @github.com>
>
|
Beta Was this translation helpful? Give feedback.
1 reply
-
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I trained cait_xxs24 until 225 epoch and tried to resume training using --resume, and accuracy started to decrease as epoch goes on... Something is wrong or I think I'm missing something when using resume. Below is the resume script I used.
Current checkpoints:
('output_xxs24/20250820-120753-cait_xxs24_224-224/checkpoint-1.pth.tar', 71.604)
('output_xxs24/20250820-120753-cait_xxs24_224-224/checkpoint-0.pth.tar', 70.6)
('output_xxs24/20250820-120753-cait_xxs24_224-224/checkpoint-2.pth.tar', 69.976)
./distributed_train.sh 2 --data-dir imagenet --model cait_xxs24_224 --batch-size 128 --epochs 400 --aug-repeats 3 --lr 0.001 --drop-path 0.1 --grad-accum-steps 4 --output output_xxs24 --resume output_xxs24/~~
Beta Was this translation helpful? Give feedback.
All reactions