-
Notifications
You must be signed in to change notification settings - Fork 442
[main] add pd transfer for ascend scheduler #2753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: CaranLic <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a 'pd transfer' feature for the Ascend scheduler, aiming to optimize performance by separating prefill and decode phases. The changes involve adding an enable_pd_transfer
configuration, implementing a two-phase state machine ('prefill', 'decode') in the AscendScheduler
, and adding unit tests to validate the new functionality. My review has identified a few areas for improvement. The phase transition logic currently only allows a one-way switch from 'prefill' to 'decode', which may be suboptimal for dynamic workloads. Additionally, the new unit test could be made more robust by avoiding manual manipulation of the scheduler's internal state. Finally, there's a minor point on configuration access that could be improved to make the code more robust against configuration errors.
|
||
def test_scheduler_with_pd_transfer(self): | ||
scheduler = self.create_scheduler() | ||
scheduler.phase = "prefill" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Manually setting the internal state scheduler.phase
makes this test brittle and less representative of real usage. If the initialization logic in AscendScheduler.__init__
changes, this test would not catch the regression. A better approach is to initialize the scheduler with enable_pd_transfer=True
in its configuration, which would correctly set the initial phase.
To achieve this, you could modify the create_scheduler
helper method to accept configuration overrides. For example:
def create_scheduler(self, mock_compute_encoder_budget, scheduler_config_override: Optional[Dict[str, Any]] = None):
# ... existing setup ...
scheduler_config = SchedulerConfig(
# ...
)
if scheduler_config_override:
for key, value in scheduler_config_override.items():
setattr(scheduler_config, key, value)
# ... rest of the function ...
Then, the test can be updated to:
scheduler = self.create_scheduler(scheduler_config_override={"enable_pd_transfer": True})
self.assertEqual(scheduler.phase, "prefill")
This change would make the test more robust and also serve to verify the scheduler's initialization logic.
vllm_ascend/core/scheduler.py
Outdated
enable_pd_transfer = getattr(self.scheduler_config, | ||
'enable_pd_transfer', | ||
False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using getattr
with a default value for enable_pd_transfer
can mask configuration issues. The AscendSchedulerConfig
is expected to always have the enable_pd_transfer
attribute. If it's missing for some reason (e.g., a typo in the attribute name in the config class), getattr
will silently fall back to False
, disabling the feature without any warning. This can make debugging difficult.
It's more robust to access the attribute directly. This will ensure that any configuration problem results in a clear AttributeError
, which is a fail-fast approach.
enable_pd_transfer = self.scheduler_config.enable_pd_transfer
if not self.waiting and not self.running: | ||
self.phase = "decode" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The phase transition from 'prefill' to 'decode' is currently one-way. Once the scheduler enters the 'decode' phase, it never returns to 'prefill'. If new requests arrive while the system is in the 'decode' phase, they will be prefilled and then immediately start decoding, which might not be the most efficient approach for the Ascend hardware this feature is targeting, as it breaks the strict batching of prefill operations.
To improve performance for dynamic workloads, consider adding logic to allow the scheduler to switch back to the 'prefill' phase. For instance, you could add a check at the beginning of the schedule
method:
if self.phase == "decode" and not self.running and self.waiting:
self.phase = "prefill"
This would ensure that if the decoding queue is empty and new requests are waiting, the scheduler can switch back to the more efficient batch prefill mode.
Signed-off-by: CaranLic <[email protected]>
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Signed-off-by: CaranLic <[email protected]>
Signed-off-by: CaranLic <[email protected]>
Signed-off-by: CaranLic <[email protected]>
Signed-off-by: CaranLic <[email protected]>
Signed-off-by: CaranLic <[email protected]>
Signed-off-by: CaranLic <[email protected]>
Signed-off-by: CaranLic <[email protected]>
Signed-off-by: CaranLic <[email protected]>
Signed-off-by: CaranLic <[email protected]>
Signed-off-by: CaranLic <[email protected]>
Signed-off-by: CaranLic <[email protected]>
Enabling this feature will result differences in definition of |
Signed-off-by: CaranLic <[email protected]>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2753 +/- ##
==========================================
+ Coverage 74.76% 74.90% +0.13%
==========================================
Files 150 152 +2
Lines 20891 21008 +117
==========================================
+ Hits 15620 15736 +116
- Misses 5271 5272 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: CaranLic <[email protected]>
Signed-off-by: CaranLic <[email protected]>
Signed-off-by: CaranLic <[email protected]>
Signed-off-by: CaranLic <[email protected]>
Signed-off-by: CaranLic <[email protected]>
Signed-off-by: CaranLic <[email protected]>
Signed-off-by: CaranLic <[email protected]>
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
self.min_p = self.min_p_device[:0] | ||
|
||
|
||
MinPLogitsProcessor.__init__ = min_p_logits_processor_init_func |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not patch vLLM, you should create ascend build_logitsprocs
in npu_input_batch instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change from patch MinPLogitsProcessor init func to redefine build_logitsprocs, code in 30571f1
Signed-off-by: CaranLic <[email protected]>
### What this PR does / why we need it? For offline scenarios, adjust the scheduling process to prioritize the prefill phase of all requests, then process the decode phase of all requests. ### How was this patch tested? ``` max_num_seqs=24, additional_config={ "ascend_scheduler_config":{ "enabled": True, "enable_pd_transfer": True, "decode_max_num_seqs": 24, "enable_chunked_prefill": False } }, ``` | input | output | num prompts | max_num_seqs | dp | tp | scheduler | tps | | ------ | ------ | ---------- | ---------------- | ---- | ---- | ---------------- | --------------- | | dapo-math-17K | 2K | 384 | 24 | 2 | 1 | v1 | 234.06 | | dapo-math-17K | 2K | 384 | 24 | 2 | 1 | pd transfer | 239.59(+2.4%) | | dapo-math-17K| 2K | 384 | 24 | 4 | 1 | v1 | 222.85 | | dapo-math-17K| 2K | 384 | 24 | 4 | 1 | pd transfer | 225.81(+1.3%) | - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@6fb2788 --------- Signed-off-by: CaranLic <[email protected]> Signed-off-by: offline0806 <[email protected]>
What this PR does / why we need it?
For offline scenarios, adjust the scheduling process to prioritize the prefill phase of all requests, then process the decode phase of all requests.
Does this PR introduce any user-facing change?
How was this patch tested?