Skip to content

[Bug]: vLLM 0.15.0 startup on H200 failed at deep_gemm #36718

@pymhq

Description

@pymhq

Your current environment

vLLM 0.15.0 and 0.15.1 both failed on H200 instance with DeepGEMM assertion error
Hi Team,

We’re using v0.15.0, and it failed to run on a H200 instance because of the DeepGEMM assertion error: RuntimeError: Assertion error (csrc/apis/../jit_kernels/impls/../../jit/kernel_runtime.hpp:45): exit_code == 0

DeepGEMM's JIT kernel is crashing when vLLM tries to run FP8 GEMM operations (fp8_gemm_nt) during startup profiling.

Can you help take a look ?

🐛 Describe the bug


0.15.1 
4:57:02 PM [algo-1-1772841836] [vllm.server] Traceback (most recent call last):
4:57:02 PM [algo-1-1772841836] [vllm.server] RuntimeError: Worker failed with error 'Assertion error (csrc/apis/../jit_kernels/impls/../../jit/kernel_runtime.hpp:45): exit_code == 0', please check the stack trace above for the root cause
4:57:03 PM [algo-1-1772841836] [vllm.server] Traceback (most recent call last):
4:57:03 PM [algo-1-1772841836] [vllm.server] RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
4:57:10 PM [algo-1-1772841836] [VLLM_STATUS] ✗ Configuration failed: VLLM: DP=1, TP=8, KV=0.7, eager
4:57:10 PM [algo-1-1772841836] [VLLM_STATUS] ✗ All vLLM configurations failed
4:57:10 PM [algo-1-1772841836] /opt/wrapper/libfarm/lib/python3.11/site-packages/watchtower/__init__.py:464: WatchtowerWarning: Received message after logging system shutdown warnings.warn("Received message after logging system shutdown", WatchtowerWarning)
4:57:10 PM [algo-1-1772841836] [VLLM_STATUS] Stopping server server (PID: 68574)

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions