[Fix] Ensure metadata sync across DP ranks in eager mode #2766

yiz-liu · 2025-09-05T01:53:07Z

What this PR does / why we need it?

Removes the condition that skips metadata synchronization when enforce_eager is enabled.

This change is necessary to correctly sync the with_prefill and enable_dbo flags across all data parallel ranks, which is not required in the base implementation. Forcing the sync operation prevents potential inconsistencies, albeit with a minor performance impact.

Does this PR introduce any user-facing change?

None.

How was this patch tested?

Add a E2E online test case?

vLLM version: v0.10.1.1
vLLM main: vllm-project/vllm@e599e2c

Removes the condition that skips metadata synchronization when `enforce_eager` is enabled. This change is necessary to correctly sync the `with_prefill` and `enable_dbo` flags across all data parallel ranks, which is not required in the base implementation. Forcing the sync operation prevents potential inconsistencies, albeit with a minor performance impact. Signed-off-by: Yizhou Liu <[email protected]>

gemini-code-assist

Code Review

This pull request correctly addresses a synchronization issue in data parallel eager mode by ensuring critical metadata flags are synced across ranks. The change is a necessary fix for correctness on Ascend hardware. I've suggested a minor refinement to the code comment to improve clarity and precision for future maintainability. Additionally, as noted in the PR description, adding an E2E test case to validate this behavior would be highly beneficial.

gemini-code-assist · 2025-09-05T01:54:46Z

vllm_ascend/worker/model_runner_v1.py

+        # TODO: In vLLM, the only thing that needs to be synced is num_tokens, but in
+        # our case, we still need to sync the other two flags as well. So we need to
+        # include them in the all_reduce operation, and more over, we CANNOT skip it
+        # even if we are running in eager mode, which harms performance.
+        # FIXME: Restore the `or self.vllm_config.model_config.enforce_eager` here
+        # immediately once the other two flags are no longer needed.


The current comment is slightly imprecise and could be confusing. In upstream vLLM, metadata sync is skipped entirely in eager mode, whereas this change correctly enables it for Ascend to sync necessary flags. Suggest rephrasing for clarity and to more accurately reflect the context, which is important for future maintenance of this temporary fix.

Suggested change

# TODO: In vLLM, the only thing that needs to be synced is num_tokens, but in

# our case, we still need to sync the other two flags as well. So we need to

# include them in the all_reduce operation, and more over, we CANNOT skip it

# even if we are running in eager mode, which harms performance.

# FIXME: Restore the `or self.vllm_config.model_config.enforce_eager` here

# immediately once the other two flags are no longer needed.

# TODO: Unlike upstream vLLM which skips metadata sync in eager mode,

# we must sync `with_prefill` and `enable_dbo` flags across DP ranks

# for correctness on Ascend. This has a minor performance impact.

# FIXME: Restore the `or self.vllm_config.model_config.enforce_eager`

# check once these flags no longer need to be synced.

github-actions · 2025-09-05T03:19:08Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

codecov · 2025-09-05T10:23:25Z

Codecov Report

❌ Patch coverage is 0% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 72.90%. Comparing base (4c90fa7) to head (fadf917).
⚠️ Report is 12 commits behind head on main.

Files with missing lines	Patch %	Lines
vllm_ascend/worker/model_runner_v1.py	0.00%	1 Missing ⚠️

❌ Your patch status has failed because the patch coverage (0.00%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2766      +/-   ##
==========================================
- Coverage   72.99%   72.90%   -0.10%     
==========================================
  Files         153      153              
  Lines       21331    21398      +67     
==========================================
+ Hits        15571    15600      +29     
- Misses       5760     5798      +38

Flag	Coverage Δ
unittests	`72.90% <0.00%> (-0.10%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…t#2766) ### What this PR does / why we need it? Removes the condition that skips metadata synchronization when `enforce_eager` is enabled. This change is necessary to correctly sync the `with_prefill` and `enable_dbo` flags across all data parallel ranks, which is not required in the base implementation. Forcing the sync operation prevents potential inconsistencies, albeit with a minor performance impact. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Add a E2E online test case? - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@e599e2c Signed-off-by: Yizhou Liu <[email protected]>

…t#2766) ### What this PR does / why we need it? Removes the condition that skips metadata synchronization when `enforce_eager` is enabled. This change is necessary to correctly sync the `with_prefill` and `enable_dbo` flags across all data parallel ranks, which is not required in the base implementation. Forcing the sync operation prevents potential inconsistencies, albeit with a minor performance impact. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Add a E2E online test case? - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@e599e2c Signed-off-by: Yizhou Liu <[email protected]> Signed-off-by: offline0806 <[email protected]>

gemini-code-assist bot reviewed Sep 5, 2025

View reviewed changes

wangxiyuan approved these changes Sep 8, 2025

View reviewed changes

wangxiyuan merged commit c735bb0 into vllm-project:main Sep 8, 2025
30 of 33 checks passed

yiz-liu deleted the fix-eager-dp branch September 8, 2025 01:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Fix] Ensure metadata sync across DP ranks in eager mode #2766

[Fix] Ensure metadata sync across DP ranks in eager mode #2766

yiz-liu commented Sep 5, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 5, 2025

Uh oh!

github-actions bot commented Sep 5, 2025

Uh oh!

codecov bot commented Sep 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

[Fix] Ensure metadata sync across DP ranks in eager mode #2766

[Fix] Ensure metadata sync across DP ranks in eager mode #2766

Conversation

yiz-liu commented Sep 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 5, 2025

Uh oh!

codecov bot commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

yiz-liu commented Sep 5, 2025 •

edited by github-actions bot

Loading

codecov bot commented Sep 5, 2025 •

edited

Loading