Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
77 commits
Select commit Hold shift + click to select a range
f5862ac
[mla backend] support dcp&cp prefill
LookAround0301 Sep 24, 2025
d1ad588
model runner support cp: input ids, position ids and slot mapping
HiC4Sh1e Sep 24, 2025
c0e0f51
Merge pull request #28 from HiC4Sh1e/long_seq_dev
LookAround0301 Sep 24, 2025
b301659
model runner support cp: metadata, logits indices
HiC4Sh1e Sep 25, 2025
2f36197
[mla backend] add num_computed_tokens_of_dcp_sp
LookAround0301 Sep 25, 2025
30e8076
Merge pull request #29 from HiC4Sh1e/long_seq_dev
LookAround0301 Sep 25, 2025
f887deb
[bug] fix config & block_table bug
LookAround0301 Sep 25, 2025
1bc86bc
[optim] support not enable cp and add env
LookAround0301 Sep 25, 2025
b69f45a
[bug] fix prefill bug
LookAround0301 Sep 26, 2025
8b333b9
[bug] fix decode bug (single batch)
LookAround0301 Sep 26, 2025
2470894
[bug] fix dcp bug
LookAround0301 Sep 26, 2025
8442fb8
[bug] fix block size bug
LookAround0301 Sep 28, 2025
9022138
[optim] clean code
LookAround0301 Sep 29, 2025
8dda1ba
GQA support pcp and dcp
Sep 30, 2025
3fb037b
[bug fix] add cp env
LookAround0301 Oct 9, 2025
9b14a09
Merge remote-tracking branch 'refs/remotes/origin/main' into long_seq…
LookAround0301 Oct 10, 2025
af8ed6f
bugfix: qwen3 support pcp&dcp
Oct 11, 2025
68b5a2b
bugfix:support customized and separated hccl_buffer_size for process …
Oct 11, 2025
faa564e
[Feature] support mla multi-requests
LookAround0301 Oct 15, 2025
ade3726
[refactor] mla and model_runner refactor
LookAround0301 Oct 15, 2025
c7866ad
[Feature] support GQA multi-requests
Oct 16, 2025
230ee9f
support kv_cache interleave_size and pd disaggregate
zhangsicheng5 Oct 16, 2025
0a69230
Merge pull request #32 from zhangsicheng5/long_seq_dev
LookAround0301 Oct 16, 2025
772dbbe
clean code
LookAround0301 Oct 16, 2025
a5a636f
Merge remote-tracking branch 'refs/remotes/origin/main' into long_seq…
LookAround0301 Oct 18, 2025
30b45f2
[BugFix] mla bug fix
LookAround0301 Oct 18, 2025
83678a2
[Feature] support chunk prefill
LookAround0301 Oct 19, 2025
ff87f0b
pd bugfix
zhangsicheng5 Oct 19, 2025
7e741bc
rename some variable
LookAround0301 Oct 20, 2025
652799e
[BugFix] Resolve error when disabling PCP
Oct 21, 2025
a9dbd44
Refactor original and cp/sp variables
gjc0824 Oct 21, 2025
991adf0
Merge remote-tracking branch 'refs/remotes/origin/main' into long_seq…
LookAround0301 Oct 21, 2025
072962f
fix qwen fullgraph
weiguihua2 Oct 21, 2025
9a6af94
Merge pull request #42 from weiguihua2/long_seq_dev
LookAround0301 Oct 21, 2025
674b941
rename env
LookAround0301 Oct 21, 2025
4e4163e
Merge remote-tracking branch 'refs/remotes/origin/main' into long_seq…
LookAround0301 Oct 21, 2025
c43ce2d
refactor: clean code for attention_v1 and model_runner_v1
Oct 21, 2025
0b144df
Merge pull request #44 from ader47/refactor/clean-code
LookAround0301 Oct 21, 2025
be413ba
修复BSND转TND
Oct 22, 2025
f384f27
Merge pull request #45 from ZhangMingWei716/long_seq_dev
LookAround0301 Oct 22, 2025
662664b
[clean code]
Oct 22, 2025
dae6e25
Bugfix matmul and add pcpmetadata to mla
gjc0824 Oct 22, 2025
2a82691
clean code
weiguihua2 Oct 22, 2025
0ffb638
[clean code] clean attention_v1 and model_runner_v1
LookAround0301 Oct 22, 2025
02b6b6c
[Clean Code] clean mla_v1
gjc0824 Oct 22, 2025
242b249
[clean code]
Oct 22, 2025
1113672
[clean code] mla_v1 pcp_allgather_restore_idx
gjc0824 Oct 22, 2025
d45fa11
Merge remote-tracking branch 'refs/remotes/origin/main' into long_seq…
LookAround0301 Oct 22, 2025
4560840
[clean code]
Oct 22, 2025
e7f2668
Merge remote-tracking branch 'refs/remotes/origin/main' into long_seq…
LookAround0301 Oct 22, 2025
466e130
clean code
weiguihua2 Oct 22, 2025
d49c156
[clean code]
LookAround0301 Oct 22, 2025
8004bdc
remove unnecessary variables && add cp_kv_recover_idx_for_chunk
Apocalypse990923-qshi Oct 22, 2025
7e66ca9
resolve P/D UT issues
gjc0824 Oct 22, 2025
83d0f1b
add attention ut
weiguihua2 Oct 22, 2025
db5c63e
[add attention ut]
Oct 22, 2025
0999eaf
clean code
weiguihua2 Oct 22, 2025
9ad67f3
clean code
weiguihua2 Oct 22, 2025
413a8a1
[clean code]
LookAround0301 Oct 22, 2025
de0ffca
[clean code] fix mla UT
LookAround0301 Oct 22, 2025
f28af0c
[clean code] fix mla UT bug
LookAround0301 Oct 22, 2025
0d1b954
[Bugfix] fix qwen multi-batch issues
gjc0824 Oct 22, 2025
0131005
[clean code] fix attention_lint
LookAround0301 Oct 22, 2025
42c162e
Merge branch 'vllm-project:main' into long_seq_dev
LookAround0301 Oct 23, 2025
d4209c2
[clean code] fix st bug
LookAround0301 Oct 23, 2025
5dea32c
[clean code] fix st bug
LookAround0301 Oct 23, 2025
d57d545
bugfix: fix ci pipleline
Oct 23, 2025
e97de70
bugfix: fix ci pipleline
Oct 23, 2025
0ffb88e
[clean code] clean code
LookAround0301 Oct 23, 2025
5faded9
chunkprefill support multi-req
Apocalypse990923-qshi Oct 24, 2025
f489276
Merge remote-tracking branch 'refs/remotes/origin/main' into long_seq…
LookAround0301 Oct 24, 2025
28e579f
support aclgraph when pcp and dcp
weiguihua2 Oct 24, 2025
6f7812b
rebase long_seq_dev
Apocalypse990923-qshi Oct 24, 2025
33e2a55
[bugfix] aclgraph
Apocalypse990923-qshi Oct 24, 2025
643d88c
support aclgraph when pcp and dcp
weiguihua2 Oct 24, 2025
b179903
support aclgraph when pcp and dcp
weiguihua2 Oct 24, 2025
6237d14
Merge remote-tracking branch 'origin/long_seq_dev' into chunk_prefill
Apocalypse990923-qshi Oct 24, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 58 additions & 20 deletions vllm_ascend/attention/attention_v1.py
Original file line number Diff line number Diff line change
Expand Up @@ -871,26 +871,64 @@ def _forward_decode_pcp_dcp(self, query: torch.Tensor,
num_heads = self.num_heads

# 1. Compute out&lse by "npu_fused_infer_attention_score"
attn_out, attn_lse = torch.ops.npu.npu_fused_infer_attention_score(
query.view(query.shape[0], 1, query.shape[1], query.shape[2]),
# [b,num_heads,head_size] -> [b,1,num_heads,head_size]
self.key_cache.view(self.key_cache.shape[0],
self.key_cache.shape[1], -1),
self.value_cache.view(self.key_cache.shape[0],
self.key_cache.shape[1], -1),
num_heads=num_heads,
num_key_value_heads=self.num_kv_heads,
input_layout="BSND",
atten_mask=None,
scale=self.scale,
antiquant_mode=0,
antiquant_scale=None,
softmax_lse_flag=True,
block_table=attn_metadata.block_tables,
block_size=self.key_cache.shape[1],
actual_seq_lengths_kv=attn_metadata.decode_meta.
num_computed_tokens_of_pcp_dcp[:, self.pcp_rank, self.dcp_rank],
)
q_nope = query.view(query.shape[0], 1, query.shape[1], query.shape[2])
# [b,num_heads,head_size] -> [b,1,num_heads,head_size]
k_nope = self.key_cache.view(self.key_cache.shape[0],
self.key_cache.shape[1], -1)
value = self.value_cache.view(self.key_cache.shape[0],
self.key_cache.shape[1], -1)
common_kwargs = {
'num_heads': num_heads,
'num_key_value_heads': self.num_kv_heads,
'input_layout': "BSND",
'atten_mask': None,
'scale': self.scale,
'antiquant_mode': 0,
'antiquant_scale': None,
'softmax_lse_flag': True,
'block_table': attn_metadata.block_tables,
'block_size': self.key_cache.shape[1],
"actual_seq_lengths_kv": attn_metadata.decode.num_computed_tokens_of_cp_dcp[:, self.cp_rank, self.dcp_rank],
}
graph_params = get_graph_params()
forward_context: ForwardContext = get_forward_context()
num_tokens = query.shape[0]
if forward_context.capturing:
stream = torch_npu.npu.current_stream()

event = torch.npu.ExternalEvent()
event.wait(stream)
event.reset(stream)
graph_params.events[num_tokens].append(event)

workspace = graph_params.workspaces.get(num_tokens)
if workspace is None:
workspace = torch_npu._npu_fused_infer_attention_score_get_max_workspace(
q_nope, k_nope, value, **common_kwargs)
graph_params.workspaces[num_tokens] = workspace
attn_out = torch.empty_like(q_nope)
attn_lse = torch.empty((num_tokens, num_heads, 1, 1),
dtype=torch.float,
device=q_nope.device)

graph_params.attn_params[num_tokens].append(
(q_nope, k_nope, value, self.num_heads, self.num_kv_heads,
self.scale, attn_metadata.block_tables, self.key_cache.shape[1],
attn_metadata.decode.num_computed_tokens_of_cp_dcp[:, self.cp_rank, self.dcp_rank], workspace,
attn_out, attn_lse, self.cp_rank, self.dcp_rank, self.dcp_size))
torch.npu.graph_task_group_begin(stream)
torch_npu.npu_fused_infer_attention_score.out(
q_nope,
k_nope,
value,
**common_kwargs,
workspace=workspace,
out=[attn_out, attn_lse])
handle = torch.npu.graph_task_group_end(stream)
graph_params.handles[num_tokens].append(handle)
else:
attn_out, attn_lse = torch_npu.npu_fused_infer_attention_score(
q_nope, k_nope, value, **common_kwargs)

attn_out = attn_out.view(attn_out.shape[0], attn_out.shape[2],
attn_out.shape[3])
Expand Down
Loading
Loading