Skip to content

Conversation

@LookAround0301
Copy link
Contributor

@LookAround0301 LookAround0301 commented Oct 25, 2025

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

LookAround0301 and others added 30 commits September 24, 2025 22:16
model runner support cp: input ids, position ids and slot mapping
model runner support cp: metadata, logits indices
Signed-off-by: LookAround <[email protected]>
Signed-off-by: LookAround <[email protected]>
Signed-off-by: LookAround <[email protected]>
Signed-off-by: LookAround <[email protected]>
Signed-off-by: Delphine-Nic <[email protected]>
Signed-off-by: LookAround <[email protected]>
…_dev

# Conflicts:
#	vllm_ascend/attention/attention_v1.py
#	vllm_ascend/attention/mla_v1.py
#	vllm_ascend/distributed/parallel_state.py
#	vllm_ascend/envs.py
#	vllm_ascend/ops/fused_moe.py
#	vllm_ascend/platform.py
#	vllm_ascend/worker/model_runner_v1.py
Signed-off-by: Delphine-Nic <[email protected]>
support cp_kv_cache_interleave_size and pd disaggregate
Signed-off-by: LookAround <[email protected]>
…_dev

# Conflicts:
#	vllm_ascend/attention/attention_v1.py
#	vllm_ascend/attention/mla_v1.py
#	vllm_ascend/attention/utils.py
#	vllm_ascend/distributed/llmdatadist_c_mgr_connector.py
#	vllm_ascend/envs.py
#	vllm_ascend/patch/worker/patch_common/patch_distributed.py
#	vllm_ascend/platform.py
#	vllm_ascend/utils.py
#	vllm_ascend/worker/model_runner_v1.py
Signed-off-by: LookAround <[email protected]>
Signed-off-by: zhangsicheng5 <[email protected]>
Signed-off-by: LookAround <[email protected]>
LookAround0301 and others added 26 commits October 22, 2025 15:38
Signed-off-by: LookAround <[email protected]>
Signed-off-by: weiguihua2 <[email protected]>
Signed-off-by: w00896881 <[email protected]>
Signed-off-by: weiguihua2 <[email protected]>
Signed-off-by: weiguihua2 <[email protected]>
Signed-off-by: LookAround <[email protected]>
Signed-off-by: LookAround <[email protected]>
Signed-off-by: LookAround <[email protected]>
Signed-off-by: LookAround <[email protected]>
Signed-off-by: Delphine-Nic <[email protected]>
Signed-off-by: Delphine-Nic <[email protected]>
Signed-off-by: LookAround <[email protected]>
Signed-off-by: Apocalypse990923-qshi <[email protected]>
Signed-off-by: Apocalypse990923-qshi <[email protected]>
Signed-off-by: Apocalypse990923-qshi <[email protected]>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR introduces support for chunked prefill for long sequences, a significant feature involving extensive changes to attention mechanisms and the model runner for distributed context parallelism on Ascend NPUs. While the overall implementation appears robust, I have identified a critical bug that could lead to a runtime crash, along with two high-severity performance bottlenecks stemming from inefficient tensor manipulations and unnecessary CPU-GPU synchronizations. Addressing these issues is crucial for ensuring the correctness and performance of the new feature.


# Get starting rank for this chunk
if request_start_rank_dict is not None:
start_rank, tokens_blank = request_start_rank_dict.get(req_id, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a potential TypeError here. If req_id is not found in request_start_rank_dict, request_start_rank_dict.get(req_id, 0) will return the integer 0. The subsequent attempt to unpack this integer into start_rank, tokens_blank will cause a crash.

While the current call sites might ensure req_id is always present, this code is fragile. To make it more robust, the default value should be a tuple (0, 0) to match the expected unpacking.

Suggested change
start_rank, tokens_blank = request_start_rank_dict.get(req_id, 0)
start_rank, tokens_blank = request_start_rank_dict.get(req_id, (0, 0))

k_nope, v = kv_nope.split([self.qk_nope_head_dim, self.v_head_dim], dim=-1)
k_pe = k_pe.expand((*k_nope.shape[:-1], -1))

seq_len = torch.stack([seq_len1.cpu(), seq_len2.cpu()])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In _compute_prefill_context, seq_len is constructed by moving seq_len1 and seq_len2 to the CPU in every iteration of the loop. This CPU-GPU synchronization inside a loop can be a significant performance bottleneck, especially since this is in the critical prefill path. It appears the npu_ring_mla kernel requires seqlen on the CPU.

To optimize this, consider moving seq_len1.cpu() out of the loop, as seq_len1 is not modified within it. This would reduce the number of GPU-to-CPU transfers by half within this hot loop.

Comment on lines +1336 to +1342
cp_kv_recover_idx_for_chunk = torch.from_numpy(np.concatenate(self.cp_kv_recover_idx_for_chunk)
).to(device=self.device)
cp_kv_recover_idx_for_chunk.copy_(torch.tensor(
np.array(self.cp_kv_recover_idx_for_chunk).flatten().tolist()),
non_blocking=True)
self.cp_kv_recover_idx_for_chunk = cp_kv_recover_idx_for_chunk.to(
torch.float32).argsort().to(torch.int32)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The creation of cp_kv_recover_idx_for_chunk in generate_kv_idx involves multiple inefficient conversions between Python lists, NumPy arrays, and PyTorch tensors (e.g., np.concatenate, np.array, .flatten().tolist(), torch.tensor). This happens in _prepare_inputs, which is a critical path executed frequently. These expensive conversions can introduce a significant performance bottleneck.

Consider simplifying this logic to use PyTorch operations directly to avoid these conversions and improve performance.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants