-
Notifications
You must be signed in to change notification settings - Fork 46
[aice/v.1.22] refactor chunk size code #354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: aice/v1.22.0
Are you sure you want to change the base?
[aice/v.1.22] refactor chunk size code #354
Conversation
* enable chunk moe * add fix * fix wrong name
Summary This PR fixes the scaling issue for models like Hunyuan: w2_scale_fp8 is provided as a scalar, but should be expanded to match the per-channel size. w13_scale_fp8 is given in a combined form (two values for W1/W3) and needs to be reshaped and repeated to the correct size for per-channel quantization. Ensures that w2_input_scale is stored as a list (one per expert) instead of a single tensor.
* add calibration and conversion for GLM-4.5 fp8 models * set VLLM_DISABLE_MARK_SCALES_AS_CONST=true for scale_format=const * add conversion scripts for GLM-4.5 fp8 models * use torch.finfo for fp8 max
|
what's the relationship between |
Updated in PR description |
|
@ranzhejiang , pls let me know if this PR is still valid. If not, pls close it. Or resolve the conflict. Thanks! |
@czhu15 @yangulei @Wei-Lin-Intel please help to review, thanks a lot.
1.Set env
PT_HPU_MOE_CHUNKfor chunk moe, which determines the chunk size sequence.2.Add env
PT_HPU_MOE_TOKEN_BOUNDARYfor chunk moe, which helps to select chunk size when meet different tokens numbers.For example:
PT_HPU_MOE_TOKEN_BOUNDARYis [64,128,1536,1736,2048,3072,4096],PT_HPU_MOE_CHUNKis [64,128,512,1024,1536,2048,4096]token_numberis 1025, we first look for the interval inPT_HPU_MOE_TOKEN_BOUNDARY. We find that 128 < 1025 ≤ 1536, so we select the position index of the interval (128, 1536], which is 2. Next, we choose the corresponding chunk size value inPT_HPU_MOE_CHUNK, which is 512. Therefore, the chunk size is set to 512.