improve MoE bias update logic in optimizer #1593

rakkit · 2025-08-19T02:28:30Z

We put all experts' usage into a buffer such that we only need one reduce rather than #number-of-layers times

Additionally, handle cases where tokens per expert are counted twice during full recompute.

tianyu-l

Thank you for the PR! I left some comments.

torchtitan/components/optimizer.py

tianyu-l

Thanks, had some more comments.

torchtitan/components/optimizer.py

torchtitan/models/moe.py

torchtitan/components/optimizer.py

rakkit · 2025-08-20T07:27:26Z

for moe ep usage and/or bias . Here we need to do smth like

 expert_usage_metrics = {
     f"moe_ep_usage/L-{layer_id}_EP-{ep_idx}": usage / sum_tokens
     for ep_idx, usage in enumerate(tokens_per_expert)
 }
 
 model_part._metrics_to_log.update{expert_usage_metrics}

and once we finalize RP#1578

for Moe model we can have

    def get_extra_metrics(self, model_parts: list[nn.Module], *args, **kwargs) -> None  | dict[str, Any]:
        return model_parts._metrics_to_log

tianyu-l

Thanks! Had some final comments.

torchtitan/components/optimizer.py

rakkit · 2025-08-21T20:37:27Z

Removed the comment (for ep-usage) and added the early exit in the first loop

tianyu-l

LGTM, thank you! Please fix linting so we can merge.

We put all experts usage into a buffer such that we only need one reduce rather than #number-of-layers times Additionally handle cases where tokens per expert are counted twice during full recompute. (assume all moe layers have same number of experts)

rakkit · 2025-08-21T23:35:24Z

format fixed, thanks a lot for the discussion.

It's important to know that this PR only improves the code on COMM parts -> reduced to only once.
In practice, from the profiler, the second loop will launch lots of kernels, [num moe layer] * [slice, mean, sign, multi, add, zeros), unless one makes everything there vectorized.

rakkit requested review from tianyu-l, fegin, wwwjn and wconstab as code owners August 19, 2025 02:28

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 19, 2025

tianyu-l reviewed Aug 19, 2025

View reviewed changes

rakkit force-pushed the improve_moe_bias_update branch from cb4e41a to cb8c9b8 Compare August 19, 2025 21:50

rakkit commented Aug 19, 2025

View reviewed changes

torchtitan/components/optimizer.py Outdated Show resolved Hide resolved

rakkit requested a review from tianyu-l August 19, 2025 21:57

tianyu-l reviewed Aug 20, 2025

View reviewed changes

rakkit requested a review from tianyu-l August 20, 2025 07:43

tianyu-l reviewed Aug 20, 2025

View reviewed changes

torchtitan/components/optimizer.py Outdated Show resolved Hide resolved

torchtitan/components/optimizer.py Outdated Show resolved Hide resolved

torchtitan/components/optimizer.py Show resolved Hide resolved

torchtitan/components/optimizer.py Show resolved Hide resolved

tianyu-l added the release blocking Issues that are blocking the milestone / release completion label Aug 21, 2025

rakkit force-pushed the improve_moe_bias_update branch from f340bcb to 9c35bc1 Compare August 21, 2025 20:35

rakkit requested a review from tianyu-l August 21, 2025 20:37

tianyu-l approved these changes Aug 21, 2025

View reviewed changes

rakkit force-pushed the improve_moe_bias_update branch from 9c35bc1 to 9a623b8 Compare August 21, 2025 23:27

tianyu-l merged commit 2bfcdd8 into pytorch:main Aug 22, 2025
5 of 7 checks passed

improve MoE bias update logic in optimizer #1593

improve MoE bias update logic in optimizer #1593

Conversation

rakkit commented Aug 19, 2025

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rakkit commented Aug 20, 2025

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rakkit commented Aug 21, 2025

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

rakkit commented Aug 21, 2025

Uh oh!

Uh oh!

Uh oh!