[v0.9.1][Bugfix][PD] Auto-clear producer KV cache if no pull notification #2085

underfituu · 2025-07-29T09:45:04Z

What this PR does / why we need it?

This PR addresses a critical issue where Node D (Device) failures cause Node P (Processor) to hang due to inability to release KV cache.

Trigger Scenarios:

Node D fails mid-inference (e.g., network disconnection)
Node D rejects requests at a certain stage (e.g., via API server)
Load-test script termination causes Node P or D to abort queued requests

Root Cause Analysis:

Currently, Node D sends a "KV cache pull complete, release approved" message to Node P
This message is transmitted via the worker connector. If PD connection breaks or requests are rejected upstream, Node D cannot send the message
Node P will never release KV cache without receiving this message

Solution:
Following VLLM community's approach (NIXL connector timeout mechanism), we're implementing:

A timeout mechanism with comprehensive warnings
Updated README documentation
Reference: VLLM's optimization PR #20139

Note: The full disaster recovery solution is still in design. This PR will be merged into v091-dev branch simply but will evolve in main (PR #2174).

Does this PR introduce any user-facing change?

How was this patch tested?

Signed-off-by: underfituu <[email protected]>

jianzs · 2025-07-29T09:46:51Z

When is this feature needed?

Signed-off-by: underfituu <[email protected]>

underfituu · 2025-08-05T02:11:31Z

When is this feature needed?

Thank you for your attention. This PR addresses a critical issue where Node D (Device) failures cause Node P (Processor) to hang due to inability to release KV cache.

Trigger Scenarios:

Node D fails mid-inference (e.g., network disconnection)
Node D rejects requests at a certain stage (e.g., via API server)
Load-test script termination causes Node D to abort queued requests

Root Cause Analysis:

Currently, Node D sends a "KV cache pull complete, release approved" message to Node P
This message is transmitted via the worker connector. If PD connection breaks or requests are rejected upstream, Node D cannot send the message
Node P will never release KV cache without receiving this message

Solution:
Following VLLM community's approach (NIXL connector timeout mechanism), we're implementing:

A timeout mechanism with comprehensive warnings
Updated README documentation
Reference: VLLM's optimization PR #20139

Note: The full disaster recovery solution is still in design. This PR will be merged into v091-dev branch simply but will evolve in main (PR #2174).

We sincerely welcome your valuable feedback on this approach.

…nto qwen30-dev * 'qwen30-dev' of https://github.com/rjg-lyh/vllm-ascend: [V0.9.1] Replace FA ops with FA_V2 to optimize perf [0.9.1]remove chunked_prefill_for_mla (vllm-project#2177) move with_prefill allreduce from cpu to npu (vllm-project#2230) [v0.9.1] Add release note for v0.9.1rc2 (vllm-project#2233) [Docs] Sync main doc to v0.9.1-dev (vllm-project#2227) [0.9.1] Enable external distributed dp deployments in vllm ascend(0.9.1 only) (vllm-project#2109) [V0.9.1][BugFix] Fix the bug in decoraotor patch (vllm-project#2199) [v0.9.1][Bugfix][PD] Auto-clear producer KV cache if no pull notification (vllm-project#2085) [BUGFIX][0.9.1] FIX ring_mla input ‘query_lens’ to cpu (vllm-project#2170) [0.9.1][Prefill Perf] add D2H & initRoutingQuantV2 (vllm-project#2038) [bugfix] add with_prefill cpu allreduce to handle D-node recomputatio… (vllm-project#2129)

fix_pd_expiry

3e6e711

Signed-off-by: underfituu <[email protected]>

github-actions bot added the module:core label Jul 29, 2025

underfituu force-pushed the fix_pd_expiry branch from a30de4e to f04bfbc Compare August 1, 2025 06:41

fix_pd_expiry

4e280dd

Signed-off-by: underfituu <[email protected]>

underfituu force-pushed the fix_pd_expiry branch from f04bfbc to 4e280dd Compare August 1, 2025 06:55

underfituu added 2 commits August 1, 2025 18:09

del reqs_to_send in listening thread

91a86ff

Signed-off-by: underfituu <[email protected]>

fix lint

691c8f6

Signed-off-by: underfituu <[email protected]>

underfituu closed this Aug 2, 2025

underfituu reopened this Aug 4, 2025

underfituu added 4 commits August 4, 2025 14:12

fix lint

90464d8

Signed-off-by: underfituu <[email protected]>

fix lint

5620f2d

Signed-off-by: underfituu <[email protected]>

only_enable_p_release_kv_cache

680bdd1

Signed-off-by: underfituu <[email protected]>

check finished_req_id in self.reqs_to_send

5c9ffca

Signed-off-by: underfituu <[email protected]>

ganyi1996ppo approved these changes Aug 5, 2025

View reviewed changes

ganyi1996ppo merged commit 2b97c69 into vllm-project:v0.9.1-dev Aug 5, 2025
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[v0.9.1][Bugfix][PD] Auto-clear producer KV cache if no pull notification #2085

[v0.9.1][Bugfix][PD] Auto-clear producer KV cache if no pull notification #2085

Uh oh!

underfituu commented Jul 29, 2025 •

edited

Loading

Uh oh!

jianzs commented Jul 29, 2025

Uh oh!

underfituu commented Aug 5, 2025

Uh oh!

Uh oh!

Uh oh!

[v0.9.1][Bugfix][PD] Auto-clear producer KV cache if no pull notification #2085

[v0.9.1][Bugfix][PD] Auto-clear producer KV cache if no pull notification #2085

Uh oh!

Conversation

underfituu commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

jianzs commented Jul 29, 2025

Uh oh!

underfituu commented Aug 5, 2025

Uh oh!

Uh oh!

Uh oh!

underfituu commented Jul 29, 2025 •

edited

Loading