[Cleanup] Miscellaneous Refactors #1607

wesleytruong · 2025-08-20T20:24:01Z

This PR makes several miscellaneous refactors to clean up torchtitan before release.

Changes:

Sets each of model_parts to eval mode in Validator class to support PP (Bug fix)
Refactor checkpoint.enable_checkpoint -> checkpoint.enable (Refactor)
Refacotr validation.enabled -> validation.enable (Refactor)

tianyu-l · 2025-08-20T20:26:53Z

torchtitan/components/validate.py

@@ -82,8 +82,9 @@ def validate(
        step: int,
    ) -> None:
        # Set model to eval mode
+        for model in model_parts:
+            model.eval()
        model = model_parts[0]


why keeping this? I think we only need this in non-PP case.

tianyu-l · 2025-08-20T20:28:53Z

torchtitan/components/validate.py

@@ -174,7 +175,8 @@ def validate(
                module.reshard()


After you switch the order of checkpoint and validate, I think this reshard part is not necessary and creates extra overhead -- we are doing validate (reshard) -> next train (unshard), which we could've avoided?

Could you check if it works well when checkpoint and validate happens on the same step? If so we can remove this code.

Sure, in my test it looks like removing this reshard doesn't affect memory usage or loss so I think it should be fine to remove.

Reshard No Reshard

wwwjn · 2025-08-21T00:31:25Z

torchtitan/train.py

-                    model_args, job_config.model.hf_assets_path
+                    model_args,
+                    job_config.model.hf_assets_path
+                    if job_config.checkpoint.enable_checkpoint


Why we need this change here? hf_assets_path can be built during loading HF weights as well, even if it's not used, and we could leave the checkpointing related logic in checkpoint.py

Sorry you're right, I was trying to do some clean up in order to suppress this warning if the user is not intending to save in HF since in that case we wouldn't need a model.safetensor.index.json, but this would interfere with if we want to load from HF from the hf_assets_path. One other idea to suppress this would be to move this error to the checkpointer, but then it would make this warning difficult to overload such as we want to may want to do in Flux. https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/flux/model/state_dict_adapter.py#L53-L57

I think it's ok to show the warning to user. When do we want to suppress the warning? I could only think if a model has only one safetensor file, which out the model.safetensor.index.json.

If a model checkpoint has multiple safetensor file, then both saving and loading needs to check the model.safetensor.index.json file exists.

When do we want to suppress the warning?

E.g. when checkpointing is not enabled at all.

If a model checkpoint has multiple safetensor file, then both saving and loading needs to check the model.safetensor.index.json file exists.

I believe only saving requires it. For loading it should be optional, but I'm not sure if it helps load faster. cc @ankitageorge to confirm.

Ya only saving requires it

tianyu-l · 2025-08-21T20:12:22Z

torchtitan/train.py

+                    job_config.model.hf_assets_path
+                    if job_config.checkpoint.enable
+                    and job_config.checkpoint.last_save_in_hf
+                    else None,


hmmm even for load you need the path? https://github.com/pytorch/torchtitan/blob/main/torchtitan/components/checkpoint.py#L543

I think there are two ways:

always leave the warning there as is

only pass the hf_assets_path in when checkpoint.enable=True

I feel since 2 is not clean enough (as in we don't differentiate save vs. load), so I'm OK with 1.
cc @wwwjn for your opinion.

I'm OK with 1 as well. We can leave the warning and let user aware of this

fegin · 2025-08-22T18:11:13Z

torchtitan/config/job_config.py

@@ -398,13 +398,13 @@ class Parallelism:

 @dataclass
 class Checkpoint:
-    enable_checkpoint: bool = False
+    enable: bool = False


nice, I was thinking to do this as well.

fegin · 2025-08-22T18:15:48Z

torchtitan/train.py

+                    model_args,
+                    (
+                        job_config.model.hf_assets_path
+                        if job_config.checkpoint.enable


Could you please move using checkpoint information out of the trainer, if possible? We deliberately hide job_config.checkpoint inside Checkpointer and let the trainer always call checkpointer API. This usage looks okay but would be good to hide if possible.

tianyu-l

LGTM, nice refactor!

This reverts commit cd337db.

wesleytruong requested review from tianyu-l, fegin, wwwjn and wconstab as code owners August 20, 2025 20:24

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 20, 2025

tianyu-l reviewed Aug 20, 2025

View reviewed changes

tianyu-l requested a review from ebsmothers August 20, 2025 20:29

wwwjn reviewed Aug 21, 2025

View reviewed changes

tianyu-l added the release blocking Issues that are blocking the milestone / release completion label Aug 21, 2025

wesleytruong added 4 commits August 21, 2025 13:04

fix setting all model_parts to eval mode

c2e456c

removes resharding operation after validation

dcd17de

removes missing safetensors.index if not attempting to save_in_hf

39c8249

changed enable_checkpoint->enable

19bd2df

wesleytruong force-pushed the validation_pp_fix branch from a4962f3 to 19bd2df Compare August 21, 2025 20:04

tianyu-l requested changes Aug 21, 2025

View reviewed changes

wesleytruong changed the title ~~[Validation] fix setting all model_parts to eval mode~~ [Cleanup] Miscellaneous Refactors Aug 22, 2025

replace validation.enabled with validation.enable

ff95c80

wesleytruong force-pushed the validation_pp_fix branch from a4dca2d to ff95c80 Compare August 22, 2025 17:53

fegin reviewed Aug 22, 2025

View reviewed changes

wesleytruong added 2 commits August 22, 2025 11:44

reverting hf_assets_path warning change in state_dict_adapter

06020a9

fix typo

c933422

tianyu-l approved these changes Aug 22, 2025

View reviewed changes

wesleytruong merged commit cd337db into main Aug 22, 2025
10 checks passed

tianyu-l deleted the validation_pp_fix branch August 22, 2025 21:37

alfuyao1986 pushed a commit to AMD-AIG-AIMA/torchtitan-amd that referenced this pull request Aug 23, 2025

Revert "[Cleanup] Miscellaneous Refactors (pytorch#1607)"

ea73220

This reverts commit cd337db.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Cleanup] Miscellaneous Refactors #1607

[Cleanup] Miscellaneous Refactors #1607

Uh oh!

wesleytruong commented Aug 20, 2025 •

edited

Loading

Uh oh!

tianyu-l Aug 20, 2025

Uh oh!

tianyu-l Aug 20, 2025

Uh oh!

wesleytruong Aug 20, 2025

Uh oh!

wwwjn Aug 21, 2025 •

edited

Loading

Uh oh!

wesleytruong Aug 21, 2025

Uh oh!

wwwjn Aug 21, 2025

Uh oh!

tianyu-l Aug 21, 2025 •

edited

Loading

Uh oh!

ankitageorge Aug 21, 2025

Uh oh!

tianyu-l Aug 21, 2025

Uh oh!

wwwjn Aug 21, 2025 •

edited

Loading

Uh oh!

fegin Aug 22, 2025

Uh oh!

fegin Aug 22, 2025

Uh oh!

tianyu-l left a comment

Uh oh!

Uh oh!

Uh oh!

[Cleanup] Miscellaneous Refactors #1607

[Cleanup] Miscellaneous Refactors #1607

Uh oh!

Conversation

wesleytruong commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wwwjn Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianyu-l Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wwwjn Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wesleytruong commented Aug 20, 2025 •

edited

Loading

wwwjn Aug 21, 2025 •

edited

Loading

tianyu-l Aug 21, 2025 •

edited

Loading

wwwjn Aug 21, 2025 •

edited

Loading