Skip to content

Conversation

@ranzhejiang
Copy link
Contributor

@ranzhejiang ranzhejiang commented Sep 1, 2025

@czhu15 @yangulei @Wei-Lin-Intel please help to review, thanks a lot.
1.Set env PT_HPU_MOE_CHUNK for chunk moe, which determines the chunk size sequence.
2.Add env PT_HPU_MOE_TOKEN_BOUNDARY for chunk moe, which helps to select chunk size when meet different tokens numbers.

For example:

  1. Assume PT_HPU_MOE_TOKEN_BOUNDARY is [64,128,1536,1736,2048,3072,4096], PT_HPU_MOE_CHUNK is [64,128,512,1024,1536,2048,4096]
  2. When the token_number is 1025, we first look for the interval in PT_HPU_MOE_TOKEN_BOUNDARY. We find that 128 < 1025 ≤ 1536, so we select the position index of the interval (128, 1536], which is 2. Next, we choose the corresponding chunk size value in PT_HPU_MOE_CHUNK, which is 512. Therefore, the chunk size is set to 512.

Wei-Lin-Intel and others added 7 commits August 18, 2025 09:38
* enable chunk moe

* add fix

* fix wrong name
Summary
This PR fixes the scaling issue for models like Hunyuan:

w2_scale_fp8 is provided as a scalar, but should be expanded to match the per-channel size.
w13_scale_fp8 is given in a combined form (two values for W1/W3) and needs to be reshaped and repeated to the correct size for per-channel quantization.
Ensures that w2_input_scale is stored as a list (one per expert) instead of a single tensor.
* add calibration and conversion for GLM-4.5 fp8 models

* set VLLM_DISABLE_MARK_SCALES_AS_CONST=true for scale_format=const

* add conversion scripts for GLM-4.5 fp8 models

* use torch.finfo for fp8 max
@ranzhejiang ranzhejiang changed the title [aice/v.1.22] refactor chunk size code [WIP] [aice/v.1.22] refactor chunk size code Sep 1, 2025
@ranzhejiang ranzhejiang changed the title [WIP] [aice/v.1.22] refactor chunk size code [aice/v.1.22] refactor chunk size code Sep 1, 2025
@czhu15
Copy link

czhu15 commented Sep 1, 2025

what's the relationship between
"PT_HPU_MOE_CHUNK", "64,128,512,1024,1536,2048,4096"
and
"PT_HPU_MOE_TOKEN_BOUNDARY", "64,64,1536,1536,2048,2048,4096"
will be good if can give an example explaination.

@ranzhejiang
Copy link
Contributor Author

what's the relationship between "PT_HPU_MOE_CHUNK", "64,128,512,1024,1536,2048,4096" and "PT_HPU_MOE_TOKEN_BOUNDARY", "64,64,1536,1536,2048,2048,4096" will be good if can give an example explaination.

Updated in PR description

@czhu15
Copy link

czhu15 commented Sep 18, 2025

@ranzhejiang , pls let me know if this PR is still valid. If not, pls close it. Or resolve the conflict. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants