-
Notifications
You must be signed in to change notification settings - Fork 544
add new e2e tests case for aclgraph memory to v0.11.0 #3880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new end-to-end test to monitor the memory usage of aclgraph capturing. The test is well-structured, patching NPUModelRunner._capture_model to measure memory consumption and asserting it against baseline values for different models. My main feedback is to improve test isolation by using pytest's monkeypatch fixture for environment variable manipulation, which is more robust than manual del and assignment. This change will prevent potential side effects on other tests in the suite.
| def test_aclgraph_mem_use(model: str, max_tokens: int) -> None: | ||
| del os.environ["VLLM_WORKER_MULTIPROC_METHOD"] | ||
| capture_called = multiprocessing.Value("i", 0) # int, 0 or 1 | ||
| capture_mem_before = multiprocessing.Value("q", -1) # long long (64-bit) | ||
| capture_mem_after = multiprocessing.Value("q", -1) # long long | ||
|
|
||
| def capture_model_wrapper(original_method): | ||
|
|
||
| def wrapped(self): | ||
| mem_before = torch.npu.mem_get_info()[0] # free memory | ||
| result = original_method(self) | ||
| mem_after = torch.npu.mem_get_info()[0] | ||
| with capture_called.get_lock(): | ||
| capture_called.value = 1 | ||
| capture_mem_before.value = mem_before | ||
| capture_mem_after.value = mem_after | ||
| return result | ||
|
|
||
| return wrapped | ||
|
|
||
| original_capture = NPUModelRunner._capture_model | ||
|
|
||
| with patch.object(NPUModelRunner, | ||
| '_capture_model', | ||
| new=capture_model_wrapper(original_capture)): | ||
| prompts = [ | ||
| "Hello, my name is", "The president of the United States is", | ||
| "The capital of France is", "The future of AI is" | ||
| ] | ||
| sampling_params = SamplingParams(max_tokens=max_tokens, | ||
| temperature=0.0) | ||
| if model == "vllm-ascend/DeepSeek-V2-Lite-W8A8": | ||
| vllm_model = LLM(snapshot_download(model), | ||
| max_model_len=1024, | ||
| quantization="ascend") | ||
| else: | ||
| vllm_model = LLM(snapshot_download(model)) | ||
| _ = vllm_model.generate(prompts, sampling_params) | ||
|
|
||
| assert capture_called.value == 1, "_capture_model was not called during test" | ||
| assert capture_mem_before.value != -1, "capture_mem_before not set" | ||
| assert capture_mem_after.value != -1, "capture_mem_after not set" | ||
|
|
||
| print("capture_mem_before =", capture_mem_before.value) | ||
| print("capture_mem_after =", capture_mem_after.value) | ||
|
|
||
| mem_used_by_capture = capture_mem_before.value - capture_mem_after.value | ||
| # Empirical observation: capturing ACL graphs for Qwen3-0.6B uses ~0.20 GiB of NPU memory. | ||
| # DeepSeek-V2-Lite-W8A8 uses ~0.64 GiB of NPU memory | ||
| # a 1.3x tolerance is applied to account for runtime variance. | ||
| if model == "vllm-ascend/DeepSeek-V2-Lite-W8A8": | ||
| baseline_capture_mem = 0.64 | ||
| capture_mem_tolerance = 1.3 | ||
| else: | ||
| baseline_capture_mem = 0.20 | ||
| capture_mem_tolerance = 1.3 | ||
| max_capture_mem_gib = baseline_capture_mem * capture_mem_tolerance | ||
| max_mem_expected = max_capture_mem_gib * (1024**3) | ||
| assert mem_used_by_capture < max_mem_expected, ( | ||
| f"_capture_model used more memory than expected. " | ||
| f"Used: {mem_used_by_capture / (1024**3):.2f} GiB, " | ||
| f"Expected: < {max_capture_mem_gib:.2f} GiB") | ||
| os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = 'spawn' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test modifies os.environ by deleting VLLM_WORKER_MULTIPROC_METHOD and then restoring it at the end. This approach is not robust. If an assertion fails before the end of the test, the environment variable will not be restored, which can lead to test isolation issues and cause other tests to fail unpredictably.
A better approach is to use pytest's monkeypatch fixture, which automatically handles setup and teardown of environment modifications, ensuring the environment is always restored to its original state.
def test_aclgraph_mem_use(model: str, max_tokens: int, monkeypatch) -> None:
monkeypatch.delenv("VLLM_WORKER_MULTIPROC_METHOD", raising=False)
capture_called = multiprocessing.Value("i", 0) # int, 0 or 1
capture_mem_before = multiprocessing.Value("q", -1) # long long (64-bit)
capture_mem_after = multiprocessing.Value("q", -1) # long long
def capture_model_wrapper(original_method):
def wrapped(self):
mem_before = torch.npu.mem_get_info()[0] # free memory
result = original_method(self)
mem_after = torch.npu.mem_get_info()[0]
with capture_called.get_lock():
capture_called.value = 1
capture_mem_before.value = mem_before
capture_mem_after.value = mem_after
return result
return wrapped
original_capture = NPUModelRunner._capture_model
with patch.object(NPUModelRunner,
'_capture_model',
new=capture_model_wrapper(original_capture)):
prompts = [
"Hello, my name is", "The president of the United States is",
"The capital of France is", "The future of AI is"
]
sampling_params = SamplingParams(max_tokens=max_tokens,
temperature=0.0)
if model == "vllm-ascend/DeepSeek-V2-Lite-W8A8":
vllm_model = LLM(snapshot_download(model),
max_model_len=1024,
quantization="ascend")
else:
vllm_model = LLM(snapshot_download(model))
_ = vllm_model.generate(prompts, sampling_params)
assert capture_called.value == 1, "_capture_model was not called during test"
assert capture_mem_before.value != -1, "capture_mem_before not set"
assert capture_mem_after.value != -1, "capture_mem_after not set"
print("capture_mem_before =", capture_mem_before.value)
print("capture_mem_after =", capture_mem_after.value)
mem_used_by_capture = capture_mem_before.value - capture_mem_after.value
# Empirical observation: capturing ACL graphs for Qwen3-0.6B uses ~0.20 GiB of NPU memory.
# DeepSeek-V2-Lite-W8A8 uses ~0.64 GiB of NPU memory
# a 1.3x tolerance is applied to account for runtime variance.
if model == "vllm-ascend/DeepSeek-V2-Lite-W8A8":
baseline_capture_mem = 0.64
capture_mem_tolerance = 1.3
else:
baseline_capture_mem = 0.20
capture_mem_tolerance = 1.3
max_capture_mem_gib = baseline_capture_mem * capture_mem_tolerance
max_mem_expected = max_capture_mem_gib * (1024**3)
assert mem_used_by_capture < max_mem_expected, (
f"_capture_model used more memory than expected. "
f"Used: {mem_used_by_capture / (1024**3):.2f} GiB, "
f"Expected: < {max_capture_mem_gib:.2f} GiB")8c5a080 to
1e4b511
Compare
Signed-off-by: lilinsiman <[email protected]>
1e4b511 to
407d165
Compare
What this PR does / why we need it?
add new e2e tests case for aclgraph memory to v0.11.0
Does this PR introduce any user-facing change?
no
How was this patch tested?
ut