-
Notifications
You must be signed in to change notification settings - Fork 441
[Fix] Ensure metadata sync across DP ranks in eager mode #2766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Removes the condition that skips metadata synchronization when `enforce_eager` is enabled. This change is necessary to correctly sync the `with_prefill` and `enable_dbo` flags across all data parallel ranks, which is not required in the base implementation. Forcing the sync operation prevents potential inconsistencies, albeit with a minor performance impact. Signed-off-by: Yizhou Liu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly addresses a synchronization issue in data parallel eager mode by ensuring critical metadata flags are synced across ranks. The change is a necessary fix for correctness on Ascend hardware. I've suggested a minor refinement to the code comment to improve clarity and precision for future maintainability. Additionally, as noted in the PR description, adding an E2E test case to validate this behavior would be highly beneficial.
# TODO: In vLLM, the only thing that needs to be synced is num_tokens, but in | ||
# our case, we still need to sync the other two flags as well. So we need to | ||
# include them in the all_reduce operation, and more over, we CANNOT skip it | ||
# even if we are running in eager mode, which harms performance. | ||
# FIXME: Restore the `or self.vllm_config.model_config.enforce_eager` here | ||
# immediately once the other two flags are no longer needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current comment is slightly imprecise and could be confusing. In upstream vLLM, metadata sync is skipped entirely in eager mode, whereas this change correctly enables it for Ascend to sync necessary flags. Suggest rephrasing for clarity and to more accurately reflect the context, which is important for future maintenance of this temporary fix.
# TODO: In vLLM, the only thing that needs to be synced is num_tokens, but in | |
# our case, we still need to sync the other two flags as well. So we need to | |
# include them in the all_reduce operation, and more over, we CANNOT skip it | |
# even if we are running in eager mode, which harms performance. | |
# FIXME: Restore the `or self.vllm_config.model_config.enforce_eager` here | |
# immediately once the other two flags are no longer needed. | |
# TODO: Unlike upstream vLLM which skips metadata sync in eager mode, | |
# we must sync `with_prefill` and `enable_dbo` flags across DP ranks | |
# for correctness on Ascend. This has a minor performance impact. | |
# FIXME: Restore the `or self.vllm_config.model_config.enforce_eager` | |
# check once these flags no longer need to be synced. |
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Codecov Report❌ Patch coverage is
❌ Your patch status has failed because the patch coverage (0.00%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #2766 +/- ##
==========================================
- Coverage 72.99% 72.90% -0.10%
==========================================
Files 153 153
Lines 21331 21398 +67
==========================================
+ Hits 15571 15600 +29
- Misses 5760 5798 +38
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…t#2766) ### What this PR does / why we need it? Removes the condition that skips metadata synchronization when `enforce_eager` is enabled. This change is necessary to correctly sync the `with_prefill` and `enable_dbo` flags across all data parallel ranks, which is not required in the base implementation. Forcing the sync operation prevents potential inconsistencies, albeit with a minor performance impact. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Add a E2E online test case? - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@e599e2c Signed-off-by: Yizhou Liu <[email protected]>
…t#2766) ### What this PR does / why we need it? Removes the condition that skips metadata synchronization when `enforce_eager` is enabled. This change is necessary to correctly sync the `with_prefill` and `enable_dbo` flags across all data parallel ranks, which is not required in the base implementation. Forcing the sync operation prevents potential inconsistencies, albeit with a minor performance impact. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Add a E2E online test case? - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@e599e2c Signed-off-by: Yizhou Liu <[email protected]> Signed-off-by: offline0806 <[email protected]>
What this PR does / why we need it?
Removes the condition that skips metadata synchronization when
enforce_eager
is enabled.This change is necessary to correctly sync the
with_prefill
andenable_dbo
flags across all data parallel ranks, which is not required in the base implementation. Forcing the sync operation prevents potential inconsistencies, albeit with a minor performance impact.Does this PR introduce any user-facing change?
None.
How was this patch tested?
Add a E2E online test case?