-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Add ZenFlow code for Stage 3 #7516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ZenFlow code for Stage 3 #7516
Conversation
Hi @tohtana @sfc-gh-truwase @Antlera, when you have some time, could you please take a look at this PR? Thanks! |
db2dfac
to
133290e
Compare
@JoshWoo2003 - could you please resolve merge conflicts? |
d550814
to
47b10d8
Compare
Sorry for the very late reply! I’ve resolved the merge conflicts and updated the affinity setting as suggested. |
Hi @JoshWoo2003, the affinity part looks good to me. Thanks for the change! Can you also fix formatting? Thanks! |
- Introduced a new file: zenflow/engine_stage3.py to implement ZenFlow-specific Stage 3 logic. - Modified zero/stage3.py to ensure compatibility with Zenflow's execution flow. - Updated zero/parameter_offload.py to support the integration of ZenFlow with ZeRO-Stage 3. Signed-off-by: Yusen Wu <[email protected]>
- Add ZenFlowSelectiveAdamW_stage3 to support ZeRO Stage 3 - Update unit tests for ZeRO-Stage 3 with ZenFlow Signed-off-by: Yusen Wu <[email protected]>
Signed-off-by: Yusen Wu <[email protected]>
- Add default value (`zenflow=False`) in DeepSpeedZeROOffload.__init__ - Prevents TypeError when instantiating optimizer without zenflow Signed-off-by: Yusen Wu <[email protected]>
- Resolved merge conflicts with upstream changes - Unified ZenFlow affinity behavior for Stage 3 with Stage 1 and Stage 2 Signed-off-by: Yusen Wu <[email protected]> Co-authored-by: Ma, Guokai <[email protected]>
4f4e752
to
26cc5ec
Compare
Thanks for the review, @delock! The formatting issues were due to my branch being behind the base. I’ve rebased onto upstream/master and the latest push should fix them. Please take another look when you have a chance—thanks! @loadams @sfc-gh-truwase @tohtana @Antlera |
Signed-off-by: Yusen Wu <[email protected]>
Signed-off-by: Yusen Wu <[email protected]>
- Extracted common process setup logic into `zenflow_utils.py` for reuse across stages. - Removed unused `process_pool` assignment. - Added explanatory comments to clarify `adamw` call differences between offload and non-offload paths. Signed-off-by: Yusen Wu <[email protected]> Co-authored-by: Ma, Guokai <[email protected]> Co-authored-by: Tingfeng Lan <[email protected]>
@JoshWoo2003 thanks for addressing the PR feedback. Please take a look at the CI failure. |
- Added ZenFlowSelectiveAdamW_stage3 coverage in unit tests (offload & non-offload paths). - Fixed a logic bug introduced after refactoring code. Signed-off-by: Yusen Wu <[email protected]>
@sfc-gh-truwase Thanks for the reminder! I’ve fixed the CI failure and pushed the update. |
This PR completes the ZenFlow integration for DeepSpeed ZeRO Stage 3.
Highlights:
Note: Intergration with ZeRO Stage 1&2 was introduced in #7391