Skip to content

vllm推理modelslim量化后的qwen3模型报错 #2290

@KeLomax

Description

@KeLomax

Your current environment

模型qwen3-0.6b,具体量化参数如下:

python /home/ma-user/work/msit/msmodelslim/quant_qwen.py
--model_path /home/ma-user/work/checkpoint-906
--save_directory /home/ma-user/work/Quant/output_IFD_906
--device_type npu
--model_type qwen3
--calib_file /home/ma-user/work/Quant/IFD_cali.jsonl
--w_bit 8
--a_bit 8
--group_size 256
--trust_remote_code True

在910A上,我使用modelslim量化了qwen模型,在使用vllm进行推理时报错。我检查了量化代码生成的模型文件,并没有缺少。报错中为何还提示缺少quant_config?

推理代码如下:

import torch

from vllm import LLM, SamplingParams

prompts = [
"Hello, my name is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)

llm = LLM(model="/home/ma-user/work/Quant/output_IFD",
max_model_len=2048,
enforce_eager=True,
trust_remote_code=True,
# Enable quantization by specifying quantization="ascend"
quantization="ascend")

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

报错内容如下:
(python-3.9.10) [ma-user work]$python /home/ma-user/work/Quant/competition_model_GPTQ.py
INFO 08-07 16:25:29 [init.py:30] Available plugins for group vllm.platform_plugins:
INFO 08-07 16:25:29 [init.py:32] name=ascend, value=vllm_ascend:register
INFO 08-07 16:25:29 [init.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 08-07 16:25:29 [init.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 08-07 16:25:29 [init.py:44] plugin ascend loaded.
INFO 08-07 16:25:29 [init.py:230] Platform plugin ascend is activated
WARNING:root:Warning: Failed to register custom ops, all custom ops will be disabled
INFO 08-07 16:25:31 [init.py:30] Available plugins for group vllm.general_plugins:
INFO 08-07 16:25:31 [init.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model
INFO 08-07 16:25:31 [init.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 08-07 16:25:31 [init.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 08-07 16:25:31 [init.py:44] plugin ascend_enhanced_model loaded.
INFO 08-07 16:25:31 [patch_tritonplaceholder.py:33] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 08-07 16:25:31 [patch_tritonplaceholder.py:46] Triton is not installed. Using dummy decorators. Install it via pip install triton to enable kernel compilation.
INFO 08-07 16:25:31 [patch_tritonplaceholder.py:71] Triton module has been replaced with a placeholder.
WARNING 08-07 16:25:31 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 08-07 16:25:33 [registry.py:380] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 08-07 16:25:33 [registry.py:380] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:CustomQwen2VLForConditionalGeneration.
WARNING 08-07 16:25:33 [registry.py:380] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 08-07 16:25:33 [registry.py:380] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
INFO 08-07 16:25:47 [config.py:689] This model supports multiple tasks: {'embed', 'generate', 'classify', 'score', 'reward'}. Defaulting to 'generate'.
WARNING 08-07 16:25:47 [config.py:768] ascend quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 08-07 16:25:47 [arg_utils.py:1742] npu is experimental on VLLM_USE_V1=1. Falling back to V0 Engine.
INFO 08-07 16:25:47 [config.py:1747] Disabled the custom all-reduce kernel because it is not supported on current platform.
Traceback (most recent call last):
File "/home/ma-user/work/Quant/competition_model_GPTQ.py", line 242, in
comp = Competition()
File "/home/ma-user/work/Quant/competition_model_GPTQ.py", line 35, in init
self.llm = LLM(
File "/home/ma-user/work/requirements/vllm/vllm/utils.py", line 1099, in inner
return fn(*args, **kwargs)
File "/home/ma-user/work/requirements/vllm/vllm/entrypoints/llm.py", line 248, in init
self.llm_engine = LLMEngine.from_engine_args(
File "/home/ma-user/work/requirements/vllm/vllm/engine/llm_engine.py", line 515, in from_engine_args
vllm_config = engine_args.create_engine_config(usage_context)
File "/home/ma-user/work/requirements/vllm/vllm/engine/arg_utils.py", line 1335, in create_engine_config
config = VllmConfig(
File "", line 19, in init
File "/home/ma-user/work/requirements/vllm/vllm/config.py", line 3709, in post_init
self.quant_config = VllmConfig._get_quantization_config(
File "/home/ma-user/work/requirements/vllm/vllm/config.py", line 3651, in _get_quantization_config
quant_config = get_quant_config(model_config, load_config)
File "/home/ma-user/work/requirements/vllm/vllm/model_executor/model_loader/weight_utils.py", line 195, in get_quant_config
return quant_cls()
TypeError: init() missing 1 required positional argument: 'quant_config'
[ERROR] 2025-08-07-16:25:48 (PID:258515, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception

How would you like to use vllm on ascend

I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions