[WIP] Enable Ascend NPU Backend with Custom Ops Integration for NF4 Support #1695

SlightwindSec · 2025-07-02T02:33:02Z

What does this PR do?

This PR ports Ascend NPU backend changes from the multi-backend-refactor branch and integrates with custom ops. It includes changes to enable Ascend build and translation of kernels and ops to Ascend-compatible operators. As the AscendC-based high-performance NF4 implementation is still in progress, a temporary PyTorch version is used for now. The build steps remain the same as before from the user's standpoint.

Collaborators

@ji-huazhong @Ginray @Runningwater23
cc @Titus-von-Koeller @matthewdouglas @amathews-amd @sunway513

Signed-off-by: SlightwindSec <[email protected]>

…ported notice

unlizi · 2025-08-07T11:00:37Z

Error Summary:

Encountered a vector core execution failure on Ascend 910B3 NPU while running Qwen-image NF4 quantized model inference. The NPU reported multiple DDR memory access violations (error code 0x800000) across 12 compute cores, specifically during dequantize_blockwise_fp32_nf4_1kernel execution. The system threw ACL synchronization error (code 507035) when attempting tensor device transfer (pos_freqs.to(device)).

Technical Breakdown:

Hardware-Level: Multiple cores (5-15,20-22) triggered MTE (Memory Tagging Extension) faults indicating invalid DDR address access
Software Stack:
Framework: PyTorch 2.6.1 + torch-npu 2.6.1
CANN: 8.1.RC1
Failing Operation: NF4 dequantization kernel
Error Progression:
Initial memory range violation → Vector core exceptions → Stream synchronization failure
Error suggests possible memory alignment issue or NPU microarchitecture incompatibility with the quantization pattern

(MindSpore) [ma-user qwen-image]$python nf4_inference.py
/home/ma-user/anaconda3/envs/MindSpore/lib/python3.10/site-packages/torch_npu/utils/collect_env.py:58: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/latest owner does not match the current owner.
warnings.warn(f"Warning: The {path} owner does not match the current owner.")
/home/ma-user/anaconda3/envs/MindSpore/lib/python3.10/site-packages/torch_npu/utils/collect_env.py:58: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/8.1.RC1/aarch64-linux/ascend_toolkit_install.info owner does not match the current owner.
warnings.warn(f"Warning: The {path} owner does not match the current owner.")
The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
Loading pipeline components...: 0%| | 0/5 [00:00<?, ?it/s]/home/ma-user/anaconda3/envs/MindSpore/lib/python3.10/site-packages/torch_npu/utils/storage.py:38: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
if self.device.type != 'cpu':
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:04<00:00, 2.30s/it]
Loading pipeline components...: 40%|████████████████████████████████████████▍ | 2/5 [00:20<00:25, 8.42s/it]The config attributes {'pooled_projection_dim': 768} were passed to QwenImageTransformer2DModel, but are not expected and will be ignored. Please verify your config.json configuration file.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:40<00:00, 13.49s/it]
Loading pipeline components...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [01:03<00:00, 12.79s/it]
0%| | 0/50 [00:00<?, ?it/s][W807 18:53:28.145560476 compiler_depend.ts:57] Warning: EZ9999: Inner Error!
EZ9999: [PID: 624026] 2025-08-07-18:53:28.091.327 The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 5, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x5c00003ff6, mte error info: 0x200600006b, ifu error info: 0x2000111ee1ac0, ccu error info: 0xe150ca367a800075, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
TraceBack (most recent call last):
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:0, tslot:4, thread:0, ctxid:0, blk:26, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 10, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x6a00000075, mte error info: 0x200600006b, ifu error info: 0x2000111e36a00, ccu error info: 0x7836503b7a800075, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:0, tslot:4, thread:0, ctxid:0, blk:31, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 11, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x5c00003fc9, mte error info: 0x200600006b, ifu error info: 0x2000111efbc00, ccu error info: 0x7836503b7a800075, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:0, tslot:4, thread:0, ctxid:0, blk:32, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 12, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x7300000064, mte error info: 0x200600006b, ifu error info: 0x2000111e75500, ccu error info: 0xc27201af7a800075, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:0, tslot:4, thread:0, ctxid:0, blk:33, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 15, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x680000016f, mte error info: 0x200600006b, ifu error info: 0x2000111ee30c0, ccu error info: 0x8a6d50d07a800075, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:0, tslot:4, thread:0, ctxid:0, blk:36, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 20, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x680000016f, mte error info: 0x200600006b, ifu error info: 0x2000111edab40, ccu error info: 0x7629795c7a800075, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:0, tslot:4, thread:0, ctxid:0, blk:37, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 21, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x5c00003fc9, mte error info: 0x200600006b, ifu error info: 0x2000111ed4ac0, ccu error info: 0x847cc0897a800075, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:0, tslot:4, thread:0, ctxid:0, blk:38, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 22, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x730000007e, mte error info: 0x200600006b, ifu error info: 0x2000111d982c0, ccu error info: 0x7836503b1a8000de, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:0, tslot:4, thread:0, ctxid:0, blk:39, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 6, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x5c00003fc9, mte error info: 0x200600006b, ifu error info: 0x2000111e40d40, ccu error info: 0x127070cc2b0000b1, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:1, tslot:4, thread:0, ctxid:0, blk:27, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 7, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x5c00003ff6, mte error info: 0x200600006b, ifu error info: 0x2000111eb3600, ccu error info: 0xd7a244127a800075, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:1, tslot:4, thread:0, ctxid:0, blk:28, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 8, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x6a00000075, mte error info: 0x200600006b, ifu error info: 0x2000111eb8e00, ccu error info: 0x6cf8f06c7a800075, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:1, tslot:4, thread:0, ctxid:0, blk:29, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 9, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x6a00000075, mte error info: 0x200600006b, ifu error info: 0x2000111eaef80, ccu error info: 0x7629795c7a800075, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:1, tslot:4, thread:0, ctxid:0, blk:30, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 13, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x5c00003ff6, mte error info: 0x200600006b, ifu error info: 0x2000111eabb80, ccu error info: 0x6c13551b7a800075, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:1, tslot:4, thread:0, ctxid:0, blk:34, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 14, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x5c00003fc9, mte error info: 0x200600006b, ifu error info: 0x2000111ed8900, ccu error info: 0x244388b7a800075, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:1, tslot:4, thread:0, ctxid:0, blk:35, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1366]
AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1119]
Aicore kernel execute failed, device_id=0, stream_id=2, report_stream_id=2, task_id=6348, flip_num=0, fault kernel_name=dequantize_blockwise_fp32_nf4_1, fault kernel info ext=none, program id=0, hash=12067931037022988496.[FUNC:GetError][FILE:stream.cc][LINE:1119]
[AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1119]
rtStreamSynchronize execute failed, reason=[vector core exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
(function copy_between_host_and_device_opapi)
0%| | 0/50 [00:01<?, ?it/s]
Traceback (most recent call last):
File "/home/ma-user/work/qwen-image/nf4_inference.py", line 45, in
image = pipe(
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.10/site-packages/diffusers/pipelines/qwenimage/pipeline_qwenimage.py", line 655, in call
noise_pred = self.transformer(
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.10/site-packages/diffusers/models/transformers/transformer_qwenimage.py", line 594, in forward
image_rotary_emb = self.pos_embed(img_shapes, txt_seq_lens, device=hidden_states.device)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.10/site-packages/diffusers/models/transformers/transformer_qwenimage.py", line 202, in forward
self.pos_freqs = self.pos_freqs.to(device)
RuntimeError: ACL stream synchronize failed, error code:507035
[W807 18:53:28.156241503 compiler_depend.ts:526] Warning: NPU warning, error code is 507035[Error]:
[Error]: The vector core execution is abnormal.
Rectify the fault based on the error information in the ascend log.
EH9999: Inner Error!
rtDeviceSynchronizeWithTimeout execute failed, reason=[vector core exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
EH9999: [PID: 624026] 2025-08-07-18:53:28.111.927 wait for compute device to finish failed, runtime result = 507035.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
TraceBack (most recent call last):
(function npuSynchronizeUsedDevices)
[W807 18:53:28.157702343 compiler_depend.ts:508] Warning: NPU warning, error code is 507035[Error]:
[Error]: The vector core execution is abnormal.
Rectify the fault based on the error information in the ascend log.
EH9999: Inner Error!
rtDeviceSynchronizeWithTimeout execute failed, reason=[vector core exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
EH9999: [PID: 624026] 2025-08-07-18:53:28.113.595 wait for compute device to finish failed, runtime result = 507035.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
TraceBack (most recent call last):
(function npuSynchronizeDevice)
[W807 18:53:28.158988663 compiler_depend.ts:151] Warning: NPU warning, error code is 507035[Error]:
[Error]: The vector core execution is abnormal.
Rectify the fault based on the error information in the ascend log.
EH9999: Inner Error!
rtDeviceSynchronizeWithTimeout execute failed, reason=[vector core exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
EH9999: [PID: 624026] 2025-08-07-18:53:28.114.891 wait for compute device to finish failed, runtime result = 507035.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
TraceBack (most recent call last):
(function empty_cache)
[ERROR] 2025-08-07-18:53:33 (PID:624026, Device:0, RankID:-1) ERR99999 UNKNOWN applicaiton exception

SlightwindSec and others added 5 commits July 1, 2025 11:12

Add NPU-compatible NF4 quantization and dequantization support

bacf273

Signed-off-by: SlightwindSec <[email protected]>

Fix ascend npu op compile error.

3096895

Signed-off-by: SlightwindSec <[email protected]>

Update NPU installation guide.

c0cb673

Merge branch 'main' into upstream_main_npu_enabled

fe43b40

fix(cmake): correct error message from "HIP" to "NPU" for macOS unsup…

d9e152f

…ported notice

SlightwindSec changed the title ~~Enable Ascend NPU Backend with Custom Ops Integration for NF4 Support~~ [WIP] Enable Ascend NPU Backend with Custom Ops Integration for NF4 Support Jul 4, 2025

matthewdouglas self-requested a review July 8, 2025 16:41

matthewdouglas self-assigned this Jul 8, 2025

matthewdouglas added the Ascend NPU Related to Ascend NPU backend label Jul 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[WIP] Enable Ascend NPU Backend with Custom Ops Integration for NF4 Support #1695

[WIP] Enable Ascend NPU Backend with Custom Ops Integration for NF4 Support #1695

Uh oh!

SlightwindSec commented Jul 2, 2025

Uh oh!

unlizi commented Aug 7, 2025

Uh oh!

Uh oh!

Uh oh!

[WIP] Enable Ascend NPU Backend with Custom Ops Integration for NF4 Support #1695

Are you sure you want to change the base?

[WIP] Enable Ascend NPU Backend with Custom Ops Integration for NF4 Support #1695

Uh oh!

Conversation

SlightwindSec commented Jul 2, 2025

What does this PR do?

Collaborators

Uh oh!

unlizi commented Aug 7, 2025

Uh oh!

Uh oh!