[Long Sequence Feat]support chunk prefill #3734

LookAround0301 · 2025-10-25T01:23:38Z

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.11.0rc3
vLLM main: vllm-project/vllm@17c540a

Signed-off-by: LookAround <[email protected]>

Signed-off-by: chenjie <[email protected]>

model runner support cp: input ids, position ids and slot mapping

Signed-off-by: chenjie <[email protected]>

Signed-off-by: LookAround <[email protected]>

model runner support cp: metadata, logits indices

Signed-off-by: LookAround <[email protected]>

Signed-off-by: Delphine-Nic <[email protected]>

Signed-off-by: LookAround <[email protected]>

…_dev # Conflicts: # vllm_ascend/attention/attention_v1.py # vllm_ascend/attention/mla_v1.py # vllm_ascend/distributed/parallel_state.py # vllm_ascend/envs.py # vllm_ascend/ops/fused_moe.py # vllm_ascend/platform.py # vllm_ascend/worker/model_runner_v1.py

Signed-off-by: Delphine-Nic <[email protected]>

…group initialization Signed-off-by: Delphine-Nic <[email protected]>

Signed-off-by: LookAround <[email protected]>

Signed-off-by: Delphine-Nic <[email protected]>

Signed-off-by: zhangsicheng5 <[email protected]>

support cp_kv_cache_interleave_size and pd disaggregate

Signed-off-by: LookAround <[email protected]>

…_dev # Conflicts: # vllm_ascend/attention/attention_v1.py # vllm_ascend/attention/mla_v1.py # vllm_ascend/attention/utils.py # vllm_ascend/distributed/llmdatadist_c_mgr_connector.py # vllm_ascend/envs.py # vllm_ascend/patch/worker/patch_common/patch_distributed.py # vllm_ascend/platform.py # vllm_ascend/utils.py # vllm_ascend/worker/model_runner_v1.py

Signed-off-by: LookAround <[email protected]>

Signed-off-by: zhangsicheng5 <[email protected]>

Signed-off-by: LookAround <[email protected]>

Signed-off-by: Feng Liu <[email protected]>

Signed-off-by: LookAround <[email protected]>

Signed-off-by: Apocalypse990923-qshi <[email protected]>

Signed-off-by: gaojc <[email protected]>

Signed-off-by: weiguihua2 <[email protected]>

Signed-off-by: w00896881 <[email protected]>

Signed-off-by: weiguihua2 <[email protected]>

Signed-off-by: LookAround <[email protected]>

Signed-off-by: gaojc <[email protected]>

Signed-off-by: LookAround <[email protected]>

Signed-off-by: Delphine-Nic <[email protected]>

Signed-off-by: LookAround <[email protected]>

Signed-off-by: Apocalypse990923-qshi <[email protected]>

…_dev

Signed-off-by: weiguihua2 <[email protected]>

Signed-off-by: Apocalypse990923-qshi <[email protected]>

Signed-off-by: weiguihua2 <[email protected]>

gemini-code-assist

Code Review

This PR introduces support for chunked prefill for long sequences, a significant feature involving extensive changes to attention mechanisms and the model runner for distributed context parallelism on Ascend NPUs. While the overall implementation appears robust, I have identified a critical bug that could lead to a runtime crash, along with two high-severity performance bottlenecks stemming from inefficient tensor manipulations and unnecessary CPU-GPU synchronizations. Addressing these issues is crucial for ensuring the correctness and performance of the new feature.

gemini-code-assist · 2025-10-25T01:30:43Z

vllm_ascend/worker/block_table.py

+
+            # Get starting rank for this chunk
+            if request_start_rank_dict is not None:
+                start_rank, tokens_blank = request_start_rank_dict.get(req_id, 0)


There is a potential TypeError here. If req_id is not found in request_start_rank_dict, request_start_rank_dict.get(req_id, 0) will return the integer 0. The subsequent attempt to unpack this integer into start_rank, tokens_blank will cause a crash.

While the current call sites might ensure req_id is always present, this code is fragile. To make it more robust, the default value should be a tuple (0, 0) to match the expected unpacking.

Suggested change

start_rank, tokens_blank = request_start_rank_dict.get(req_id, 0)

start_rank, tokens_blank = request_start_rank_dict.get(req_id, (0, 0))

gemini-code-assist · 2025-10-25T01:30:43Z

vllm_ascend/attention/mla_v1.py

+                k_nope, v = kv_nope.split([self.qk_nope_head_dim, self.v_head_dim], dim=-1)
+                k_pe = k_pe.expand((*k_nope.shape[:-1], -1))
+
+            seq_len = torch.stack([seq_len1.cpu(), seq_len2.cpu()])


In _compute_prefill_context, seq_len is constructed by moving seq_len1 and seq_len2 to the CPU in every iteration of the loop. This CPU-GPU synchronization inside a loop can be a significant performance bottleneck, especially since this is in the critical prefill path. It appears the npu_ring_mla kernel requires seqlen on the CPU.

To optimize this, consider moving seq_len1.cpu() out of the loop, as seq_len1 is not modified within it. This would reduce the number of GPU-to-CPU transfers by half within this hot loop.

gemini-code-assist · 2025-10-25T01:30:44Z

vllm_ascend/worker/model_runner_v1.py

+        cp_kv_recover_idx_for_chunk = torch.from_numpy(np.concatenate(self.cp_kv_recover_idx_for_chunk)
+                                             ).to(device=self.device)
+        cp_kv_recover_idx_for_chunk.copy_(torch.tensor(
+            np.array(self.cp_kv_recover_idx_for_chunk).flatten().tolist()),
+            non_blocking=True)
+        self.cp_kv_recover_idx_for_chunk = cp_kv_recover_idx_for_chunk.to(
+            torch.float32).argsort().to(torch.int32)


The creation of cp_kv_recover_idx_for_chunk in generate_kv_idx involves multiple inefficient conversions between Python lists, NumPy arrays, and PyTorch tensors (e.g., np.concatenate, np.array, .flatten().tolist(), torch.tensor). This happens in _prepare_inputs, which is a critical path executed frequently. These expensive conversions can introduce a significant performance bottleneck.

Consider simplifying this logic to use PyTorch operations directly to avoid these conversions and improve performance.

github-actions · 2025-10-25T01:37:13Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

github-actions · 2025-10-25T01:53:55Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

LookAround0301 and others added 30 commits September 24, 2025 22:16

[mla backend] support dcp&cp prefill

f5862ac

Signed-off-by: LookAround <[email protected]>

model runner support cp: input ids, position ids and slot mapping

d1ad588

Signed-off-by: chenjie <[email protected]>

Merge pull request #28 from HiC4Sh1e/long_seq_dev

c0e0f51

model runner support cp: input ids, position ids and slot mapping

model runner support cp: metadata, logits indices

b301659

Signed-off-by: chenjie <[email protected]>

[mla backend] add num_computed_tokens_of_dcp_sp

2f36197

Signed-off-by: LookAround <[email protected]>

Merge pull request #29 from HiC4Sh1e/long_seq_dev

30e8076

model runner support cp: metadata, logits indices

[bug] fix config & block_table bug

f887deb

Signed-off-by: LookAround <[email protected]>

[optim] support not enable cp and add env

1bc86bc

Signed-off-by: LookAround <[email protected]>

[bug] fix prefill bug

b69f45a

Signed-off-by: LookAround <[email protected]>

[bug] fix decode bug (single batch)

8b333b9

Signed-off-by: LookAround <[email protected]>

[bug] fix dcp bug

2470894

Signed-off-by: LookAround <[email protected]>

[bug] fix block size bug

8442fb8

Signed-off-by: LookAround <[email protected]>

[optim] clean code

9022138

Signed-off-by: LookAround <[email protected]>

GQA support pcp and dcp

8dda1ba

Signed-off-by: Delphine-Nic <[email protected]>

[bug fix] add cp env

3fb037b

Signed-off-by: LookAround <[email protected]>

bugfix: qwen3 support pcp&dcp

af8ed6f

Signed-off-by: Delphine-Nic <[email protected]>

bugfix:support customized and separated hccl_buffer_size for process …

68b5a2b

…group initialization Signed-off-by: Delphine-Nic <[email protected]>

[Feature] support mla multi-requests

faa564e

Signed-off-by: LookAround <[email protected]>

[refactor] mla and model_runner refactor

ade3726

Signed-off-by: LookAround <[email protected]>

[Feature] support GQA multi-requests

c7866ad

Signed-off-by: Delphine-Nic <[email protected]>

support kv_cache interleave_size and pd disaggregate

230ee9f

Signed-off-by: zhangsicheng5 <[email protected]>

Merge pull request #32 from zhangsicheng5/long_seq_dev

0a69230

support cp_kv_cache_interleave_size and pd disaggregate

clean code

772dbbe

Signed-off-by: LookAround <[email protected]>

[BugFix] mla bug fix

30b45f2

Signed-off-by: LookAround <[email protected]>

[Feature] support chunk prefill

83678a2

Signed-off-by: LookAround <[email protected]>

pd bugfix

ff87f0b

Signed-off-by: zhangsicheng5 <[email protected]>

rename some variable

7e741bc

Signed-off-by: LookAround <[email protected]>

[BugFix] Resolve error when disabling PCP

652799e

Signed-off-by: Feng Liu <[email protected]>

LookAround0301 and others added 26 commits October 22, 2025 15:38

[clean code]

d49c156

Signed-off-by: LookAround <[email protected]>

remove unnecessary variables && add cp_kv_recover_idx_for_chunk

8004bdc

Signed-off-by: Apocalypse990923-qshi <[email protected]>

resolve P/D UT issues

7e66ca9

Signed-off-by: gaojc <[email protected]>

add attention ut

83d0f1b

Signed-off-by: weiguihua2 <[email protected]>

[add attention ut]

db5c63e

Signed-off-by: w00896881 <[email protected]>

clean code

0999eaf

Signed-off-by: weiguihua2 <[email protected]>

clean code

9ad67f3

Signed-off-by: weiguihua2 <[email protected]>

[clean code]

413a8a1

Signed-off-by: LookAround <[email protected]>

[clean code] fix mla UT

de0ffca

Signed-off-by: LookAround <[email protected]>

[clean code] fix mla UT bug

f28af0c

Signed-off-by: LookAround <[email protected]>

[Bugfix] fix qwen multi-batch issues

0d1b954

Signed-off-by: gaojc <[email protected]>

[clean code] fix attention_lint

0131005

Signed-off-by: LookAround <[email protected]>

Merge branch 'vllm-project:main' into long_seq_dev

42c162e

[clean code] fix st bug

d4209c2

Signed-off-by: LookAround <[email protected]>

[clean code] fix st bug

5dea32c

Signed-off-by: LookAround <[email protected]>

bugfix: fix ci pipleline

d57d545

Signed-off-by: Delphine-Nic <[email protected]>

bugfix: fix ci pipleline

e97de70

Signed-off-by: Delphine-Nic <[email protected]>

[clean code] clean code

0ffb88e

Signed-off-by: LookAround <[email protected]>

chunkprefill support multi-req

5faded9

Signed-off-by: Apocalypse990923-qshi <[email protected]>

Merge remote-tracking branch 'refs/remotes/origin/main' into long_seq…

f489276

…_dev

support aclgraph when pcp and dcp

28e579f

Signed-off-by: weiguihua2 <[email protected]>

rebase long_seq_dev

6f7812b

Signed-off-by: Apocalypse990923-qshi <[email protected]>

[bugfix] aclgraph

33e2a55

Signed-off-by: Apocalypse990923-qshi <[email protected]>

support aclgraph when pcp and dcp

643d88c

Signed-off-by: weiguihua2 <[email protected]>

support aclgraph when pcp and dcp

b179903

Signed-off-by: weiguihua2 <[email protected]>

Merge remote-tracking branch 'origin/long_seq_dev' into chunk_prefill

6237d14

gemini-code-assist bot reviewed Oct 25, 2025

View reviewed changes

github-actions bot added the merge-conflicts label Oct 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Long Sequence Feat]support chunk prefill #3734

[Long Sequence Feat]support chunk prefill #3734

LookAround0301 commented Oct 25, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 25, 2025

Uh oh!

gemini-code-assist bot Oct 25, 2025

Uh oh!

gemini-code-assist bot Oct 25, 2025

Uh oh!

github-actions bot commented Oct 25, 2025

Uh oh!

github-actions bot commented Oct 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

	start_rank, tokens_blank = request_start_rank_dict.get(req_id, 0)
	start_rank, tokens_blank = request_start_rank_dict.get(req_id, (0, 0))

[Long Sequence Feat]support chunk prefill #3734

Are you sure you want to change the base?

[Long Sequence Feat]support chunk prefill #3734

Conversation

LookAround0301 commented Oct 25, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 25, 2025

Uh oh!

github-actions bot commented Oct 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

LookAround0301 commented Oct 25, 2025 •

edited by github-actions bot

Loading