Skip to content

Conversation

@lilinsiman
Copy link
Contributor

What this PR does / why we need it?

add new e2e tests case for aclgraph memory to v0.11.0

Does this PR introduce any user-facing change?

no

How was this patch tested?

ut

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new end-to-end test to monitor the memory usage of aclgraph capturing. The test is well-structured, patching NPUModelRunner._capture_model to measure memory consumption and asserting it against baseline values for different models. My main feedback is to improve test isolation by using pytest's monkeypatch fixture for environment variable manipulation, which is more robust than manual del and assignment. This change will prevent potential side effects on other tests in the suite.

Comment on lines 37 to 100
def test_aclgraph_mem_use(model: str, max_tokens: int) -> None:
del os.environ["VLLM_WORKER_MULTIPROC_METHOD"]
capture_called = multiprocessing.Value("i", 0) # int, 0 or 1
capture_mem_before = multiprocessing.Value("q", -1) # long long (64-bit)
capture_mem_after = multiprocessing.Value("q", -1) # long long

def capture_model_wrapper(original_method):

def wrapped(self):
mem_before = torch.npu.mem_get_info()[0] # free memory
result = original_method(self)
mem_after = torch.npu.mem_get_info()[0]
with capture_called.get_lock():
capture_called.value = 1
capture_mem_before.value = mem_before
capture_mem_after.value = mem_after
return result

return wrapped

original_capture = NPUModelRunner._capture_model

with patch.object(NPUModelRunner,
'_capture_model',
new=capture_model_wrapper(original_capture)):
prompts = [
"Hello, my name is", "The president of the United States is",
"The capital of France is", "The future of AI is"
]
sampling_params = SamplingParams(max_tokens=max_tokens,
temperature=0.0)
if model == "vllm-ascend/DeepSeek-V2-Lite-W8A8":
vllm_model = LLM(snapshot_download(model),
max_model_len=1024,
quantization="ascend")
else:
vllm_model = LLM(snapshot_download(model))
_ = vllm_model.generate(prompts, sampling_params)

assert capture_called.value == 1, "_capture_model was not called during test"
assert capture_mem_before.value != -1, "capture_mem_before not set"
assert capture_mem_after.value != -1, "capture_mem_after not set"

print("capture_mem_before =", capture_mem_before.value)
print("capture_mem_after =", capture_mem_after.value)

mem_used_by_capture = capture_mem_before.value - capture_mem_after.value
# Empirical observation: capturing ACL graphs for Qwen3-0.6B uses ~0.20 GiB of NPU memory.
# DeepSeek-V2-Lite-W8A8 uses ~0.64 GiB of NPU memory
# a 1.3x tolerance is applied to account for runtime variance.
if model == "vllm-ascend/DeepSeek-V2-Lite-W8A8":
baseline_capture_mem = 0.64
capture_mem_tolerance = 1.3
else:
baseline_capture_mem = 0.20
capture_mem_tolerance = 1.3
max_capture_mem_gib = baseline_capture_mem * capture_mem_tolerance
max_mem_expected = max_capture_mem_gib * (1024**3)
assert mem_used_by_capture < max_mem_expected, (
f"_capture_model used more memory than expected. "
f"Used: {mem_used_by_capture / (1024**3):.2f} GiB, "
f"Expected: < {max_capture_mem_gib:.2f} GiB")
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = 'spawn'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The test modifies os.environ by deleting VLLM_WORKER_MULTIPROC_METHOD and then restoring it at the end. This approach is not robust. If an assertion fails before the end of the test, the environment variable will not be restored, which can lead to test isolation issues and cause other tests to fail unpredictably.

A better approach is to use pytest's monkeypatch fixture, which automatically handles setup and teardown of environment modifications, ensuring the environment is always restored to its original state.

def test_aclgraph_mem_use(model: str, max_tokens: int, monkeypatch) -> None:
    monkeypatch.delenv("VLLM_WORKER_MULTIPROC_METHOD", raising=False)
    capture_called = multiprocessing.Value("i", 0)  # int, 0 or 1
    capture_mem_before = multiprocessing.Value("q", -1)  # long long (64-bit)
    capture_mem_after = multiprocessing.Value("q", -1)  # long long

    def capture_model_wrapper(original_method):

        def wrapped(self):
            mem_before = torch.npu.mem_get_info()[0]  # free memory
            result = original_method(self)
            mem_after = torch.npu.mem_get_info()[0]
            with capture_called.get_lock():
                capture_called.value = 1
                capture_mem_before.value = mem_before
                capture_mem_after.value = mem_after
            return result

        return wrapped

    original_capture = NPUModelRunner._capture_model

    with patch.object(NPUModelRunner,
                      '_capture_model',
                      new=capture_model_wrapper(original_capture)):
        prompts = [
            "Hello, my name is", "The president of the United States is",
            "The capital of France is", "The future of AI is"
        ]
        sampling_params = SamplingParams(max_tokens=max_tokens,
                                         temperature=0.0)
        if model == "vllm-ascend/DeepSeek-V2-Lite-W8A8":
            vllm_model = LLM(snapshot_download(model),
                             max_model_len=1024,
                             quantization="ascend")
        else:
            vllm_model = LLM(snapshot_download(model))
        _ = vllm_model.generate(prompts, sampling_params)

    assert capture_called.value == 1, "_capture_model was not called during test"
    assert capture_mem_before.value != -1, "capture_mem_before not set"
    assert capture_mem_after.value != -1, "capture_mem_after not set"

    print("capture_mem_before =", capture_mem_before.value)
    print("capture_mem_after =", capture_mem_after.value)

    mem_used_by_capture = capture_mem_before.value - capture_mem_after.value
    # Empirical observation: capturing ACL graphs for Qwen3-0.6B uses ~0.20 GiB of NPU memory.
    # DeepSeek-V2-Lite-W8A8 uses ~0.64 GiB of NPU memory
    # a 1.3x tolerance is applied to account for runtime variance.
    if model == "vllm-ascend/DeepSeek-V2-Lite-W8A8":
        baseline_capture_mem = 0.64
        capture_mem_tolerance = 1.3
    else:
        baseline_capture_mem = 0.20
        capture_mem_tolerance = 1.3
    max_capture_mem_gib = baseline_capture_mem * capture_mem_tolerance
    max_mem_expected = max_capture_mem_gib * (1024**3)
    assert mem_used_by_capture < max_mem_expected, (
        f"_capture_model used more memory than expected. "
        f"Used: {mem_used_by_capture / (1024**3):.2f} GiB, "
        f"Expected: < {max_capture_mem_gib:.2f} GiB")

@lilinsiman lilinsiman force-pushed the mem_v0.11.0 branch 2 times, most recently from 8c5a080 to 1e4b511 Compare October 30, 2025 10:16
@yiz-liu yiz-liu merged commit 387ce1c into vllm-project:v0.11.0-dev Oct 31, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants