-
Notifications
You must be signed in to change notification settings - Fork 515
[Long Sequence Feat]support chunk prefill #3734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: LookAround <[email protected]>
Signed-off-by: chenjie <[email protected]>
model runner support cp: input ids, position ids and slot mapping
Signed-off-by: chenjie <[email protected]>
Signed-off-by: LookAround <[email protected]>
model runner support cp: metadata, logits indices
Signed-off-by: LookAround <[email protected]>
Signed-off-by: LookAround <[email protected]>
Signed-off-by: LookAround <[email protected]>
Signed-off-by: LookAround <[email protected]>
Signed-off-by: LookAround <[email protected]>
Signed-off-by: LookAround <[email protected]>
Signed-off-by: LookAround <[email protected]>
Signed-off-by: Delphine-Nic <[email protected]>
Signed-off-by: LookAround <[email protected]>
…_dev # Conflicts: # vllm_ascend/attention/attention_v1.py # vllm_ascend/attention/mla_v1.py # vllm_ascend/distributed/parallel_state.py # vllm_ascend/envs.py # vllm_ascend/ops/fused_moe.py # vllm_ascend/platform.py # vllm_ascend/worker/model_runner_v1.py
Signed-off-by: Delphine-Nic <[email protected]>
…group initialization Signed-off-by: Delphine-Nic <[email protected]>
Signed-off-by: LookAround <[email protected]>
Signed-off-by: LookAround <[email protected]>
Signed-off-by: Delphine-Nic <[email protected]>
Signed-off-by: zhangsicheng5 <[email protected]>
support cp_kv_cache_interleave_size and pd disaggregate
Signed-off-by: LookAround <[email protected]>
…_dev # Conflicts: # vllm_ascend/attention/attention_v1.py # vllm_ascend/attention/mla_v1.py # vllm_ascend/attention/utils.py # vllm_ascend/distributed/llmdatadist_c_mgr_connector.py # vllm_ascend/envs.py # vllm_ascend/patch/worker/patch_common/patch_distributed.py # vllm_ascend/platform.py # vllm_ascend/utils.py # vllm_ascend/worker/model_runner_v1.py
Signed-off-by: LookAround <[email protected]>
Signed-off-by: LookAround <[email protected]>
Signed-off-by: zhangsicheng5 <[email protected]>
Signed-off-by: LookAround <[email protected]>
Signed-off-by: Feng Liu <[email protected]>
Signed-off-by: LookAround <[email protected]>
Signed-off-by: Apocalypse990923-qshi <[email protected]>
Signed-off-by: gaojc <[email protected]>
Signed-off-by: weiguihua2 <[email protected]>
Signed-off-by: w00896881 <[email protected]>
Signed-off-by: weiguihua2 <[email protected]>
Signed-off-by: weiguihua2 <[email protected]>
Signed-off-by: LookAround <[email protected]>
Signed-off-by: LookAround <[email protected]>
Signed-off-by: LookAround <[email protected]>
Signed-off-by: gaojc <[email protected]>
Signed-off-by: LookAround <[email protected]>
Signed-off-by: LookAround <[email protected]>
Signed-off-by: LookAround <[email protected]>
Signed-off-by: Delphine-Nic <[email protected]>
Signed-off-by: Delphine-Nic <[email protected]>
Signed-off-by: LookAround <[email protected]>
Signed-off-by: Apocalypse990923-qshi <[email protected]>
Signed-off-by: weiguihua2 <[email protected]>
Signed-off-by: Apocalypse990923-qshi <[email protected]>
Signed-off-by: Apocalypse990923-qshi <[email protected]>
Signed-off-by: weiguihua2 <[email protected]>
Signed-off-by: weiguihua2 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This PR introduces support for chunked prefill for long sequences, a significant feature involving extensive changes to attention mechanisms and the model runner for distributed context parallelism on Ascend NPUs. While the overall implementation appears robust, I have identified a critical bug that could lead to a runtime crash, along with two high-severity performance bottlenecks stemming from inefficient tensor manipulations and unnecessary CPU-GPU synchronizations. Addressing these issues is crucial for ensuring the correctness and performance of the new feature.
|
|
||
| # Get starting rank for this chunk | ||
| if request_start_rank_dict is not None: | ||
| start_rank, tokens_blank = request_start_rank_dict.get(req_id, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a potential TypeError here. If req_id is not found in request_start_rank_dict, request_start_rank_dict.get(req_id, 0) will return the integer 0. The subsequent attempt to unpack this integer into start_rank, tokens_blank will cause a crash.
While the current call sites might ensure req_id is always present, this code is fragile. To make it more robust, the default value should be a tuple (0, 0) to match the expected unpacking.
| start_rank, tokens_blank = request_start_rank_dict.get(req_id, 0) | |
| start_rank, tokens_blank = request_start_rank_dict.get(req_id, (0, 0)) |
| k_nope, v = kv_nope.split([self.qk_nope_head_dim, self.v_head_dim], dim=-1) | ||
| k_pe = k_pe.expand((*k_nope.shape[:-1], -1)) | ||
|
|
||
| seq_len = torch.stack([seq_len1.cpu(), seq_len2.cpu()]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In _compute_prefill_context, seq_len is constructed by moving seq_len1 and seq_len2 to the CPU in every iteration of the loop. This CPU-GPU synchronization inside a loop can be a significant performance bottleneck, especially since this is in the critical prefill path. It appears the npu_ring_mla kernel requires seqlen on the CPU.
To optimize this, consider moving seq_len1.cpu() out of the loop, as seq_len1 is not modified within it. This would reduce the number of GPU-to-CPU transfers by half within this hot loop.
| cp_kv_recover_idx_for_chunk = torch.from_numpy(np.concatenate(self.cp_kv_recover_idx_for_chunk) | ||
| ).to(device=self.device) | ||
| cp_kv_recover_idx_for_chunk.copy_(torch.tensor( | ||
| np.array(self.cp_kv_recover_idx_for_chunk).flatten().tolist()), | ||
| non_blocking=True) | ||
| self.cp_kv_recover_idx_for_chunk = cp_kv_recover_idx_for_chunk.to( | ||
| torch.float32).argsort().to(torch.int32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The creation of cp_kv_recover_idx_for_chunk in generate_kv_idx involves multiple inefficient conversions between Python lists, NumPy arrays, and PyTorch tensors (e.g., np.concatenate, np.array, .flatten().tolist(), torch.tensor). This happens in _prepare_inputs, which is a critical path executed frequently. These expensive conversions can introduce a significant performance bottleneck.
Consider simplifying this logic to use PyTorch operations directly to avoid these conversions and improve performance.
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
What this PR does / why we need it?
Does this PR introduce any user-facing change?
How was this patch tested?