Skip to content

feat: Add Muon optimizer for FSDP training#23

Draft
nguyenhoangthuan99 wants to merge 1 commit intomainfrom
feat/muon-optimizer
Draft

feat: Add Muon optimizer for FSDP training#23
nguyenhoangthuan99 wants to merge 1 commit intomainfrom
feat/muon-optimizer

Conversation

@nguyenhoangthuan99
Copy link
Collaborator

  • Add verl/optim/muon.py with Muon and MuonWithAdamW optimizer classes
  • Add verl/optim/init.py to export the optimizer classes
  • Add verl/trainer/config/optim/muon.yaml for Hydra config
  • Add examples/muon/config/ppo_trainer_muon.yaml example training config
  • Update verl/workers/config/optimizer.py to add MuonOptimizerConfig and handle Muon in build_optimizer()

Muon (MomentUm Orthogonalized by Newton-Schulz) is designed for 2D matrix parameters in neural networks. It should be combined with AdamW for non-matrix parameters (embeddings, biases, gains, etc.).

Reference: https://github.com/KellerJordan/Muon

What does this PR do?

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

- Add verl/optim/muon.py with Muon and MuonWithAdamW optimizer classes
- Add verl/optim/__init__.py to export the optimizer classes
- Add verl/trainer/config/optim/muon.yaml for Hydra config
- Add examples/muon/config/ppo_trainer_muon.yaml example training config
- Update verl/workers/config/optimizer.py to add MuonOptimizerConfig and
  handle Muon in build_optimizer()

Muon (MomentUm Orthogonalized by Newton-Schulz) is designed for 2D matrix
parameters in neural networks. It should be combined with AdamW for
non-matrix parameters (embeddings, biases, gains, etc.).

Reference: https://github.com/KellerJordan/Muon

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments