Skip to content

support GLM-4.5 MoE models #15026

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 19 commits into from
Closed

support GLM-4.5 MoE models #15026

wants to merge 19 commits into from

Conversation

ddh0
Copy link
Contributor

@ddh0 ddh0 commented Aug 2, 2025

GLM-4.5 are two Mixture-of-Experts LLMs released by Zhipu / Z.ai. They are highly interesting for running locally due to their size and apparent performance thus far. If successful this PR would close #14921. For additional context, see #14939.

GLM-4.5 model info (GLM4_MOE)

common info

  • head dim: 128
  • hidden activation function : SiLU
  • partial rotary factor: 0.5
  • RMS norm ε: 1e-5
  • embeddings are not tied
  • RoPE θ: 1000000.0 (1M)
  • no funny RoPE scaling ❌
  • n_shared_experts (always active): 1
  • num_experts_per_tok: 8
  • uses top-k prob normalization (for expert selection) ✅
  • num_attention_heads: 96
  • num_key_value_heads: 8
  • GQA factor: 12x

GLM-4.5-Air

  • model card
  • config.json
  • model size: 106B-A12B
  • hidden size: 4096
  • intermediate size (FFN): 10944
  • max_position_embeddings (ctx length): 131072
  • moe_intermediate_size (expert size): 1408
  • n_routed_experts (conditional experts): 128
  • routed_scaling_factor: 1.0
  • first_k_dense_replace (how many dense layers are there at the start of the model): 1
  • num_hidden_layers (total number of hidden layers including dense and MoE): 46
  • no QK normalization ❌

GLM-4.5

  • model card
  • config.json
  • model size: 355B-A32B
  • hidden size: 5120
  • intermediate size (FFN): 12288
  • max_position_embeddings (ctx length): 131072
  • moe_intermediate_size (expert size): 1536
  • n_routed_experts (conditional experts): 160
  • routed_scaling_factor: 2.5
  • first_k_dense_replace (how many dense layers are there at the start of the model): 3
  • num_hidden_layers (total number of hidden layers including dense and MoE): 92
  • uses QK normalization ✅

in 🤗 transformers

  • implementation
  • the following components of the Glm4Moe model can be implemented as subclasses of DeepseekV3 components:
    • Glm4MoeModel
    • Glm4MoeMLP
    • Glm4MoeTopkRouter
    • Glm4MoeRMSNorm
    • Glm4MoeDecoderLayer
    • Glm4MoePreTrainedModel
    • Glm4MoeForCausalLM <-- this is what's in the config.json on HF
  • Glm4MoeAttention can be implemented as a subclass of CohereAttention and nn.Module (looks pretty standard)

ⓘ misc. notes

  • this PR will NOT attempt to implement MTP (multi-token prediction). the relevant tensors will be excluded from the GGUFs.
  • the MoE router uses group-based top-k selection, even though all conditional experts are in one group
  • the MoE router must take into account the expert score correction biases from the model weights (so we need to keep that tensor)

TODOs:

  • add GGUF constants
  • add basic C++ code
  • llama-model.cpp
    • add case for load_hparams
    • add case for load_tensors
    • write llm_build_glm4_moe
  • implement HF model conversion

@github-actions github-actions bot added the python python script changes label Aug 2, 2025
@MikeLP
Copy link

MikeLP commented Aug 2, 2025

As I get things didn't go well with previous PR.

Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Aug 2, 2025
initial PR commit

add GGUF constants

initial GLM-4.5 integration

fix typo `LLM_ATCH_GLM4_MOE` --> `LLM_ARCH_GLM4_MOE`

add glm4_moe tensor mapping

add `attn_k_norm` and `attn_q_norm` tensors for GLM-4.5

more consistent organization

more consistent organization (cont.)

Merge branch 'ggml-org:master' into glm45

Merge branch 'ggml-org:master' into glm45
@ddh0
Copy link
Contributor Author

ddh0 commented Aug 3, 2025

Alright, I think I've got most of the actual implementation done:

  • C++ and Python boilerplate
  • load_hparams case
  • load_tensors case
  • inference graph

Next I need to implement the HF --> GGUF conversion and do some testing with the model before I'm ready for a full review. It would be helpful to get another pair of eyes on this, but I'm not sure who to ping.

cc @CISC, @Noeda

@CISC
Copy link
Collaborator

CISC commented Aug 3, 2025

Just briefly I can note that you've copied the ffn_norm mistake from the other PR here. :)

@ddh0
Copy link
Contributor Author

ddh0 commented Aug 3, 2025

D'oh!

@Noeda
Copy link
Contributor

Noeda commented Aug 3, 2025

I'm myself laser focused on just correctness of the other PR with some of my own changes and @CISC changes, seeing if I can confirm parity using MLX-LM implementation. If I'm successful, the result of that work can be used in this PR or the other PR, or for whoever wants to get the implementation ready, but otherwise I probably won't do reviewing work.

- remove `ffn_norm` per CISC
- re-organize some small things
@alkavan
Copy link

alkavan commented Aug 4, 2025

Hey!

I tried converting zai-org-/GLM-4.5-Air to bf16 using convert_hf_to_gguf.py finding out that it's not supported yet and stumbled about this issue.

I tried checking out both glm45-support and @ddh0 glm45 branch. Getting the same error when trying to convert. Is the conversion still missing? Or not part of this PR? Also if the support is already there, any quick patch I can do to make the conversion work?

python convert_hf_to_gguf.py /home/ubuntu/.cache/huggingface/hub/models--zai-org--GLM-4.5-Air/snapshots/e7fdb9e0a52d2e0aefea94f5867c924a32a78d17/ --outtype bf16 --outfile ~/.cache/lm-studio/models/glm-4.5-air-bf16.gguf
INFO:hf-to-gguf:Loading model: e7fdb9e0a52d2e0aefea94f5867c924a32a78d17
INFO:hf-to-gguf:Model architecture: Glm4MoeForCausalLM
ERROR:hf-to-gguf:Model Glm4MoeForCausalLM is not supported

@theo77186
Copy link

For now, conversion isn't implemented, see original comment. The PR 14939 has implemented conversion and there is more progress there. Not sure which one will get merged.

@ddh0
Copy link
Contributor Author

ddh0 commented Aug 4, 2025

Closed in favor of #14939.

@ddh0 ddh0 closed this Aug 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: GLM 4.5 MoE support
6 participants