support GLM-4.5 MoE models #15026

ddh0 · 2025-08-02T04:55:34Z

GLM-4.5 are two Mixture-of-Experts LLMs released by Zhipu / Z.ai. They are highly interesting for running locally due to their size and apparent performance thus far. If successful this PR would close #14921. For additional context, see #14939.

GLM-4.5 model info (`GLM4_MOE`)

common info

head dim: 128
hidden activation function : SiLU
partial rotary factor: 0.5
RMS norm ε: 1e-5
embeddings are not tied
RoPE θ: 1000000.0 (1M)
no funny RoPE scaling ❌
n_shared_experts (always active): 1
num_experts_per_tok: 8
uses top-k prob normalization (for expert selection) ✅
num_attention_heads: 96
num_key_value_heads: 8
GQA factor: 12x

GLM-4.5-Air

model card
config.json
model size: 106B-A12B
hidden size: 4096
intermediate size (FFN): 10944
max_position_embeddings (ctx length): 131072
moe_intermediate_size (expert size): 1408
n_routed_experts (conditional experts): 128
routed_scaling_factor: 1.0
first_k_dense_replace (how many dense layers are there at the start of the model): 1
num_hidden_layers (total number of hidden layers including dense and MoE): 46
no QK normalization ❌

GLM-4.5

model card
config.json
model size: 355B-A32B
hidden size: 5120
intermediate size (FFN): 12288
max_position_embeddings (ctx length): 131072
moe_intermediate_size (expert size): 1536
n_routed_experts (conditional experts): 160
routed_scaling_factor: 2.5
first_k_dense_replace (how many dense layers are there at the start of the model): 3
num_hidden_layers (total number of hidden layers including dense and MoE): 92
uses QK normalization ✅

in 🤗 transformers

implementation
the following components of the Glm4Moe model can be implemented as subclasses of DeepseekV3 components:
- Glm4MoeModel
- Glm4MoeMLP
- Glm4MoeTopkRouter
- Glm4MoeRMSNorm
- Glm4MoeDecoderLayer
- Glm4MoePreTrainedModel
- Glm4MoeForCausalLM <-- this is what's in the config.json on HF
Glm4MoeAttention can be implemented as a subclass of CohereAttention and nn.Module (looks pretty standard)

ⓘ misc. notes

this PR will NOT attempt to implement MTP (multi-token prediction). the relevant tensors will be excluded from the GGUFs.
the MoE router uses group-based top-k selection, even though all conditional experts are in one group
the MoE router must take into account the expert score correction biases from the model weights (so we need to keep that tensor)

TODOs:

~~add GGUF constants~~
~~add basic C++ code~~
llama-model.cpp
- ~~add case for load_hparams~~
- ~~add case for load_tensors~~
- ~~write llm_build_glm4_moe~~
implement HF model conversion

MikeLP · 2025-08-02T05:40:05Z

As I get things didn't go well with previous PR.

initial PR commit add GGUF constants initial GLM-4.5 integration fix typo `LLM_ATCH_GLM4_MOE` --> `LLM_ARCH_GLM4_MOE` add glm4_moe tensor mapping add `attn_k_norm` and `attn_q_norm` tensors for GLM-4.5 more consistent organization more consistent organization (cont.) Merge branch 'ggml-org:master' into glm45 Merge branch 'ggml-org:master' into glm45

ddh0 · 2025-08-03T06:45:38Z

Alright, I think I've got most of the actual implementation done:

C++ and Python boilerplate
load_hparams case
load_tensors case
inference graph

Next I need to implement the HF --> GGUF conversion and do some testing with the model before I'm ready for a full review. It would be helpful to get another pair of eyes on this, but I'm not sure who to ping.

cc @CISC, @Noeda

CISC · 2025-08-03T06:57:08Z

Just briefly I can note that you've copied the ffn_norm mistake from the other PR here. :)

ddh0 · 2025-08-03T07:00:36Z

D'oh!

Noeda · 2025-08-03T07:04:18Z

I'm myself laser focused on just correctness of the other PR with some of my own changes and @CISC changes, seeing if I can confirm parity using MLX-LM implementation. If I'm successful, the result of that work can be used in this PR or the other PR, or for whoever wants to get the implementation ready, but otherwise I probably won't do reviewing work.

- remove `ffn_norm` per CISC - re-organize some small things

alkavan · 2025-08-04T12:02:05Z

Hey!

I tried converting zai-org-/GLM-4.5-Air to bf16 using convert_hf_to_gguf.py finding out that it's not supported yet and stumbled about this issue.

I tried checking out both glm45-support and @ddh0 glm45 branch. Getting the same error when trying to convert. Is the conversion still missing? Or not part of this PR? Also if the support is already there, any quick patch I can do to make the conversion work?

python convert_hf_to_gguf.py /home/ubuntu/.cache/huggingface/hub/models--zai-org--GLM-4.5-Air/snapshots/e7fdb9e0a52d2e0aefea94f5867c924a32a78d17/ --outtype bf16 --outfile ~/.cache/lm-studio/models/glm-4.5-air-bf16.gguf
INFO:hf-to-gguf:Loading model: e7fdb9e0a52d2e0aefea94f5867c924a32a78d17
INFO:hf-to-gguf:Model architecture: Glm4MoeForCausalLM
ERROR:hf-to-gguf:Model Glm4MoeForCausalLM is not supported

theo77186 · 2025-08-04T13:12:10Z

For now, conversion isn't implemented, see original comment. The PR 14939 has implemented conversion and there is more progress there. Not sure which one will get merged.

ddh0 · 2025-08-04T18:56:39Z

Closed in favor of #14939.

initial PR commit

11648d5

github-actions bot added the python python script changes label Aug 2, 2025

ddh0 added 9 commits August 2, 2025 01:24

add GGUF constants

69d1c58

initial GLM-4.5 integration

2586ae5

fix typo LLM_ATCH_GLM4_MOE --> LLM_ARCH_GLM4_MOE

2c6e198

add glm4_moe tensor mapping

dbe9f10

add attn_k_norm and attn_q_norm tensors for GLM-4.5

5f9e4e1

more consistent organization

41169a8

more consistent organization (cont.)

3cf2e4a

Merge branch 'ggml-org:master' into glm45

2232baa

Merge branch 'ggml-org:master' into glm45

da39c79

ddh0 mentioned this pull request Aug 2, 2025

model: Add support for GLM 4.5 family of models (#14921) #14939

Merged

ddh0 added 8 commits August 2, 2025 12:55

Merge branch 'ggml-org:master' into glm45

99c30f7

llama-hparams : group MoE-specific params together

428f079

dummy graph

64fbb24

support loading GLM4 hparams

d85099e

add "glm4_moe" LLM_ARCH

261775d

implement load_tensors for GLM4_MOE

a30d9a6

remove trailing whitespace

d048901

llama-model : implement GLM4 MoE inference graph

3e18442

add LLM_KV_ATTENTION_USE_KQ_NORM for GLM4_MOE

61b8442

- remove `ffn_norm` per CISC - re-organize some small things

ddh0 closed this Aug 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

support GLM-4.5 MoE models #15026

support GLM-4.5 MoE models #15026

Uh oh!

ddh0 commented Aug 2, 2025 •

edited

Loading

Uh oh!

MikeLP commented Aug 2, 2025

Uh oh!

ddh0 commented Aug 3, 2025 •

edited

Loading

Uh oh!

CISC commented Aug 3, 2025

Uh oh!

ddh0 commented Aug 3, 2025

Uh oh!

Noeda commented Aug 3, 2025

Uh oh!

alkavan commented Aug 4, 2025

Uh oh!

theo77186 commented Aug 4, 2025

Uh oh!

ddh0 commented Aug 4, 2025

Uh oh!

Uh oh!

support GLM-4.5 MoE models #15026

support GLM-4.5 MoE models #15026

Uh oh!

Conversation

ddh0 commented Aug 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GLM-4.5 model info (GLM4_MOE)

common info

GLM-4.5-Air

GLM-4.5

in 🤗 transformers

ⓘ misc. notes

TODOs:

Uh oh!

MikeLP commented Aug 2, 2025

Uh oh!

ddh0 commented Aug 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Aug 3, 2025

Uh oh!

ddh0 commented Aug 3, 2025

Uh oh!

Noeda commented Aug 3, 2025

Uh oh!

alkavan commented Aug 4, 2025

Uh oh!

theo77186 commented Aug 4, 2025

Uh oh!

ddh0 commented Aug 4, 2025

Uh oh!

Uh oh!

ddh0 commented Aug 2, 2025 •

edited

Loading

GLM-4.5 model info (`GLM4_MOE`)

ddh0 commented Aug 3, 2025 •

edited

Loading