-
Notifications
You must be signed in to change notification settings - Fork 154
Add support for GLM-4.5 models #668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just two quick observations and trying to better understand how you ported the mainline PR as it is a bit different (did you make some changes, or was this port from a couple days ago etc).
Anyway, thanks for opening a clean PR to make it easier to look at the differences!
// Not sure if it is really a matter of insufficient precision, or I have made a mistake in the fattn-vec-f16 kernel. | ||
if (use_f32_precision || model.arch == LLM_ARCH_PHI2 || model.arch == LLM_ARCH_PHI3 || model.arch == LLM_ARCH_GPTNEOX || | ||
(model.arch == LLM_ARCH_DEEPSEEK2 && q->ne[1] <= 8) || model.arch == LLM_ARCH_COHERE2 || model.arch == LLM_ARCH_GLM4) { | ||
(model.arch == LLM_ARCH_DEEPSEEK2 && q->ne[1] <= 8) || model.arch == LLM_ARCH_COHERE2 || model.arch == LLM_ARCH_GLM4 || model.arch == LLM_ARCH_GLM4_MOE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you test perplexity without this to see if it runs clean? Usually only add a model in here if perplexity is throwing nans without it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At that time the model was still not working (or I had not tried llama-server); so I was trying my best to find any possible change I could make to make the model work, without putting much thoughts into it. I had noticed you made those changes in one of your old PR, so I followed your steps. It is entirely possible that this change may not be required after all.
src/llama.cpp
Outdated
} | ||
|
||
// crop output on last layer | ||
if (il == n_layer - 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if is slightly different than mainline: https://github.com/ggml-org/llama.cpp/pull/14939/files#diff-36e262e316ec1404e29880eb8b8ce4660ac584f0d0434710efc48a66497bdb59R13557
On mainline it is:
if (il == n_layer - 1 && inp_out_ids) {
I haven't followed the mainline PR closely enough to see if this changed recently nor looked in depth yet at what inp_out_ids even is supposed to be over there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've noticed this diff, it was originally there: https://github.com/ggml-org/llama.cpp/pull/14939/files/fae4df8ee05c57bf2e2d5e3f0b094010f34dc86f#diff-36e262e316ec1404e29880eb8b8ce4660ac584f0d0434710efc48a66497bdb59R13554
This is a port from a few days ago indeed - to the best of my knowledge it was https://github.com/ggml-org/llama.cpp/pull/14939/files/fae4df8ee05c57bf2e2d5e3f0b094010f34dc86f. With slight variations as I was trying to best match existing implementations such as DeepSeek to make the port compatible with ik_llama. |
@ubergarm - /nothink seems to be working just fine. I'm not witnessing the issues they are reporting on the llama.cpp implementation. |
My other thought about chat template stuff is after running gguf dump i see:
This looks right with what the official chat template is here: https://huggingface.co/zai-org/GLM-4.5?chat_template=default When starting the model with llama-server it says:
So for I'll upload your imatrix soon after it finishes to here: https://huggingface.co/ubergarm/GLM-4.5-GGUF Once again though, I'm hesitant to release quants just yet until things are more tested, but excited to see this coming along! Thanks for your efforts! |
I'll already have GLM-4.5-Air bf16 safetensors downloaded, so will try this PR going through the whole process and see how it comes out for testing. |
I was about to that that this is only tested on GLM-4.5, not the others. Good thinking! |
Also cc @AesSedai |
Your imatrix is uploaded here: https://huggingface.co/ubergarm/GLM-4.5-GGUF/blob/main/imatrix-GLM-4.5-BF16.dat Here are the logs of the run including 👈 imatrix logs and command#!/usr/bin/env bash
ulimit -n 9999
# echo 0 | sudo tee /proc/sys/kernel/numa_balancing
# sudo sync; echo 3 | sudo tee /proc/sys/vm/drop_caches
model=/mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/GLM-4.5-THIREUS-BF16-SPECIAL_TENSOR-00001-of-01762.gguf
numactl -N 1 -m 1 \
./build/bin/llama-imatrix \
-m "$model" \
-fa \
-f ubergarm-imatrix-calibration-corpus-v02.txt \
-o /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat \
--verbosity 1 \
--layer-similarity \
--seed 1337 \
--ctx-size 512 \
-ub 4096 -b 4096 \
--numa numactl \
--threads 128 \
--threads-batch 192 \
--no-mmap
llama_model_loader: loaded meta data with 44 key-value pairs and 1761 tensors from /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/GLM-4.5-THIREUS-BF16-SPECIAL_TENSOR-00001-of-01762.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = glm4moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = GLM 4.5
llama_model_loader: - kv 3: general.version str = 4.5
llama_model_loader: - kv 4: general.basename str = GLM
llama_model_loader: - kv 5: general.size_label str = 160x21B
llama_model_loader: - kv 6: general.license str = mit
llama_model_loader: - kv 7: general.tags arr[str,1] = ["text-generation"]
llama_model_loader: - kv 8: general.languages arr[str,2] = ["en", "zh"]
llama_model_loader: - kv 9: glm4moe.block_count u32 = 93
llama_model_loader: - kv 10: glm4moe.context_length u32 = 131072
llama_model_loader: - kv 11: glm4moe.embedding_length u32 = 5120
llama_model_loader: - kv 12: glm4moe.feed_forward_length u32 = 12288
llama_model_loader: - kv 13: glm4moe.attention.head_count u32 = 96
llama_model_loader: - kv 14: glm4moe.attention.head_count_kv u32 = 8
llama_model_loader: - kv 15: glm4moe.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 16: glm4moe.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 17: glm4moe.expert_used_count u32 = 8
llama_model_loader: - kv 18: glm4moe.attention.key_length u32 = 128
llama_model_loader: - kv 19: glm4moe.attention.value_length u32 = 128
llama_model_loader: - kv 20: general.file_type u32 = 32
llama_model_loader: - kv 21: glm4moe.rope.dimension_count u32 = 64
llama_model_loader: - kv 22: glm4moe.expert_count u32 = 160
llama_model_loader: - kv 23: glm4moe.expert_feed_forward_length u32 = 1536
llama_model_loader: - kv 24: glm4moe.expert_shared_count u32 = 1
llama_model_loader: - kv 25: glm4moe.leading_dense_block_count u32 = 3
llama_model_loader: - kv 26: glm4moe.expert_gating_func u32 = 2
llama_model_loader: - kv 27: glm4moe.expert_weights_scale f32 = 2.500000
llama_model_loader: - kv 28: glm4moe.expert_weights_norm bool = true
llama_model_loader: - kv 29: general.quantization_version u32 = 2
llama_model_loader: - kv 30: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 31: tokenizer.ggml.pre str = glm4
llama_model_loader: - kv 32: tokenizer.ggml.tokens arr[str,151552] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 33: tokenizer.ggml.token_type arr[i32,151552] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 34: tokenizer.ggml.merges arr[str,318088] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 35: tokenizer.ggml.eos_token_id u32 = 151329
llama_model_loader: - kv 36: tokenizer.ggml.padding_token_id u32 = 151329
llama_model_loader: - kv 37: tokenizer.ggml.eot_token_id u32 = 151336
llama_model_loader: - kv 38: tokenizer.ggml.unknown_token_id u32 = 151329
llama_model_loader: - kv 39: tokenizer.ggml.bos_token_id u32 = 151329
llama_model_loader: - kv 40: tokenizer.chat_template str = [gMASK]<sop>\n{%- if tools -%}\n<|syste...
llama_model_loader: - kv 41: split.no u16 = 0
llama_model_loader: - kv 42: split.count u16 = 1762
llama_model_loader: - kv 43: split.tensors.count i32 = 1761
llama_model_loader: - type f32: 838 tensors
llama_model_loader: - type bf16: 923 tensors
llm_load_vocab: special tokens cache size = 36
llm_load_vocab: token to piece cache size = 0.9713 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = glm4moe
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 151552
llm_load_print_meta: n_merges = 318088
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 5120
llm_load_print_meta: n_layer = 93
llm_load_print_meta: n_head = 96
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_swa_pattern = 1
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 12
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 12288
llm_load_print_meta: n_expert = 160
llm_load_print_meta: n_expert_used = 8
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 355B.A32B
llm_load_print_meta: model ftype = BF16
llm_load_print_meta: model params = 358.338 B
llm_load_print_meta: model size = 670.586 GiB (16.075 BPW)
llm_load_print_meta: repeating layers = 667.696 GiB (16.075 BPW, 356.786 B parameters)
llm_load_print_meta: general.name = GLM 4.5
llm_load_print_meta: BOS token = 151329 '<|endoftext|>'
llm_load_print_meta: EOS token = 151329 '<|endoftext|>'
llm_load_print_meta: UNK token = 151329 '<|endoftext|>'
llm_load_print_meta: PAD token = 151329 '<|endoftext|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 151336 '<|user|>'
llm_load_print_meta: max token length = 1024
llm_load_tensors: ggml ctx size = 0.72 MiB
llm_load_tensors: CPU buffer size = 686680.19 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe = 0
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 186.00 MiB
llama_new_context_with_model: KV self size = 186.00 MiB, K (f16): 93.00 MiB, V (f16): 93.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.58 MiB
llama_new_context_with_model: CPU compute buffer size = 306.00 MiB
llama_new_context_with_model: graph nodes = 4779
llama_new_context_with_model: graph splits = 1
system_info: n_threads = 128 (n_threads_batch = 192) / 768 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 901.12 ms
compute_imatrix: computing over 814 chunks with batch_size 512
compute_imatrix: 10.08 seconds per pass - ETA 2 hours 16.72 minutes
[1]16.8092,[2]6.7403,[3]4.3630,[4]3.1866,[5]2.5865,[6]2.2122,[7]1.9898,[8]1.8430,[9]1.8314,
save_imatrix: entry ' blk.92.ffn_gate_exps.weight' has partial data (98.75%) 2 out of 160 experts are missing data Storing **but be aware**
save_imatrix: entry ' blk.51.ffn_down_exps.weight' has partial data (99.38%) 1 out of 160 experts are missing data Storing **but be aware**
save_imatrix: entry ' blk.48.ffn_gate_exps.weight' has partial data (99.38%) 1 out of 160 experts are missing data Storing **but be aware**
save_imatrix: entry ' blk.51.ffn_up_exps.weight' has partial data (99.38%) 1 out of 160 experts are missing data Storing **but be aware**
save_imatrix: entry ' blk.30.ffn_up_exps.weight' has partial data (99.38%) 1 out of 160 experts are missing data Storing **but be aware**
save_imatrix: entry ' blk.29.ffn_down_exps.weight' has partial data (99.38%) 1 out of 160 experts are missing data Storing **but be aware**
save_imatrix: entry ' blk.29.ffn_gate_exps.weight' has partial data (99.38%) 1 out of 160 experts are missing data Storing **but be aware**
save_imatrix: entry ' blk.29.ffn_up_exps.weight' has partial data (99.38%) 1 out of 160 experts are missing data Storing **but be aware**
save_imatrix: entry ' blk.48.ffn_down_exps.weight' has partial data (99.38%) 1 out of 160 experts are missing data Storing **but be aware**
save_imatrix: entry ' blk.92.ffn_up_exps.weight' has partial data (98.75%) 2 out of 160 experts are missing data Storing **but be aware**
save_imatrix: entry ' blk.30.ffn_gate_exps.weight' has partial data (99.38%) 1 out of 160 experts are missing data Storing **but be aware**
save_imatrix: entry ' blk.30.ffn_down_exps.weight' has partial data (99.38%) 1 out of 160 experts are missing data Storing **but be aware**
save_imatrix: entry ' blk.51.ffn_gate_exps.weight' has partial data (99.38%) 1 out of 160 experts are missing data Storing **but be aware**
save_imatrix: entry ' blk.92.ffn_down_exps.weight' has partial data (98.75%) 2 out of 160 experts are missing data Storing **but be aware**
save_imatrix: entry ' blk.48.ffn_up_exps.weight' has partial data (99.38%) 1 out of 160 experts are missing data Storing **but be aware**
save_imatrix: stored collected data after 10 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[10]1.7538,[11]1.8628,[12]1.9497,[13]2.0128,[14]2.0758,[15]1.9792,[16]1.9022,[17]1.8481,[18]1.7962,[19]1.7420,
save_imatrix: stored collected data after 20 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[20]1.7069,[21]1.6655,[22]1.6390,[23]1.6079,[24]1.5804,[25]1.5523,[26]1.6324,[27]1.7262,[28]1.8429,[29]1.8173,
save_imatrix: stored collected data after 30 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[30]1.8001,[31]1.8139,[32]1.8058,[33]1.8766,[34]1.8526,[35]1.8476,[36]1.8334,[37]1.8276,[38]1.8529,[39]1.8663,
save_imatrix: stored collected data after 40 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[40]1.8546,[41]1.8837,[42]1.8914,[43]1.9060,[44]1.9156,[45]1.9178,[46]1.9054,[47]1.9153,[48]1.9143,[49]1.9182,
save_imatrix: stored collected data after 50 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[50]1.9079,[51]1.9235,[52]1.9356,[53]1.9249,[54]1.9343,[55]1.9362,[56]1.9417,[57]1.9357,[58]1.9853,[59]2.0340,
save_imatrix: stored collected data after 60 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[60]2.0820,[61]2.0979,[62]2.1400,[63]2.1680,[64]2.1632,[65]2.1640,[66]2.1671,[67]2.1524,[68]2.1657,[69]2.2064,
save_imatrix: stored collected data after 70 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[70]2.2582,[71]2.2854,[72]2.3223,[73]2.3531,[74]2.3716,[75]2.3998,[76]2.4148,[77]2.4411,[78]2.4389,[79]2.4221,
save_imatrix: stored collected data after 80 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[80]2.4199,[81]2.4225,[82]2.4477,[83]2.4895,[84]2.5059,[85]2.5095,[86]2.5137,[87]2.5037,[88]2.5068,[89]2.4945,
save_imatrix: stored collected data after 90 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[90]2.4849,[91]2.4808,[92]2.4657,[93]2.4469,[94]2.4741,[95]2.5218,[96]2.5412,[97]2.5434,[98]2.5515,[99]2.5706,
save_imatrix: stored collected data after 100 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[100]2.5892,[101]2.5974,[102]2.5991,[103]2.6306,[104]2.6540,[105]2.6476,[106]2.6845,[107]2.7273,[108]2.7577,[109]2.7980,
save_imatrix: stored collected data after 110 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[110]2.8276,[111]2.8644,[112]2.8963,[113]2.8919,[114]2.9083,[115]2.9220,[116]2.9303,[117]2.9364,[118]2.9665,[119]3.0045,
save_imatrix: stored collected data after 120 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[120]3.0421,[121]3.0367,[122]3.0118,[123]2.9970,[124]3.0177,[125]3.0060,[126]2.9822,[127]2.9828,[128]2.9806,[129]2.9875,
save_imatrix: stored collected data after 130 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[130]2.9961,[131]3.0125,[132]3.0298,[133]3.0353,[134]3.0729,[135]3.0906,[136]3.0662,[137]3.0415,[138]3.0195,[139]2.9963,
save_imatrix: stored collected data after 140 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[140]3.0054,[141]3.0137,[142]3.0481,[143]3.0755,[144]3.0806,[145]3.0999,[146]3.1236,[147]3.1461,[148]3.1797,[149]3.2059,
save_imatrix: stored collected data after 150 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[150]3.2329,[151]3.2507,[152]3.2723,[153]3.2842,[154]3.2903,[155]3.2860,[156]3.3026,[157]3.3125,[158]3.3244,[159]3.3376,
save_imatrix: stored collected data after 160 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[160]3.3521,[161]3.3566,[162]3.3596,[163]3.3730,[164]3.3788,[165]3.3865,[166]3.3999,[167]3.4016,[168]3.4032,[169]3.4102,
save_imatrix: stored collected data after 170 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[170]3.4193,[171]3.4246,[172]3.4296,[173]3.4358,[174]3.4532,[175]3.4650,[176]3.4702,[177]3.4760,[178]3.4935,[179]3.4819,
save_imatrix: stored collected data after 180 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[180]3.4896,[181]3.5034,[182]3.5243,[183]3.5399,[184]3.5470,[185]3.5495,[186]3.5481,[187]3.5473,[188]3.5489,[189]3.5499,
save_imatrix: stored collected data after 190 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[190]3.5511,[191]3.5486,[192]3.5701,[193]3.5991,[194]3.6236,[195]3.6501,[196]3.6697,[197]3.7037,[198]3.7149,[199]3.7318,
save_imatrix: stored collected data after 200 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[200]3.7226,[201]3.7382,[202]3.7311,[203]3.7091,[204]3.6876,[205]3.7082,[206]3.7229,[207]3.7318,[208]3.7417,[209]3.7609,
save_imatrix: stored collected data after 210 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[210]3.7771,[211]3.7935,[212]3.8121,[213]3.8290,[214]3.8321,[215]3.8110,[216]3.7881,[217]3.7656,[218]3.7432,[219]3.7212,
save_imatrix: stored collected data after 220 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[220]3.7055,[221]3.7036,[222]3.6939,[223]3.6892,[224]3.6760,[225]3.6579,[226]3.6574,[227]3.6620,[228]3.6838,[229]3.7077,
save_imatrix: stored collected data after 230 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[230]3.7183,[231]3.7402,[232]3.7358,[233]3.7603,[234]3.7894,[235]3.8015,[236]3.8160,[237]3.8175,[238]3.8411,[239]3.8694,
save_imatrix: stored collected data after 240 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[240]3.8659,[241]3.8761,[242]3.8903,[243]3.9100,[244]3.9294,[245]3.9438,[246]3.9562,[247]3.9656,[248]3.9554,[249]3.9816,
save_imatrix: stored collected data after 250 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[250]3.9954,[251]4.0137,[252]4.0246,[253]4.0289,[254]4.0353,[255]4.0385,[256]4.0511,[257]4.0559,[258]4.0678,[259]4.0822,
save_imatrix: stored collected data after 260 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[260]4.0923,[261]4.1029,[262]4.1146,[263]4.1297,[264]4.1417,[265]4.1577,[266]4.1437,[267]4.1496,[268]4.1544,[269]4.1681,
save_imatrix: stored collected data after 270 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[270]4.1888,[271]4.2028,[272]4.2240,[273]4.2244,[274]4.2225,[275]4.2310,[276]4.2363,[277]4.2530,[278]4.2672,[279]4.2808,
save_imatrix: stored collected data after 280 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[280]4.2902,[281]4.2913,[282]4.3053,[283]4.3167,[284]4.3194,[285]4.3358,[286]4.3368,[287]4.3410,[288]4.3498,[289]4.3473,
save_imatrix: stored collected data after 290 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[290]4.3583,[291]4.3637,[292]4.3700,[293]4.3864,[294]4.3991,[295]4.4124,[296]4.4292,[297]4.4341,[298]4.4530,[299]4.4657,
save_imatrix: stored collected data after 300 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[300]4.4822,[301]4.4951,[302]4.5095,[303]4.5144,[304]4.5336,[305]4.5411,[306]4.5451,[307]4.5536,[308]4.5710,[309]4.5813,
save_imatrix: stored collected data after 310 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[310]4.5858,[311]4.5939,[312]4.6025,[313]4.6165,[314]4.6244,[315]4.6338,[316]4.6466,[317]4.6587,[318]4.6726,[319]4.6779,
save_imatrix: stored collected data after 320 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[320]4.6821,[321]4.6755,[322]4.6849,[323]4.6688,[324]4.6853,[325]4.6888,[326]4.6671,[327]4.6807,[328]4.6914,[329]4.6975,
save_imatrix: stored collected data after 330 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[330]4.7043,[331]4.7033,[332]4.7074,[333]4.7257,[334]4.7232,[335]4.7341,[336]4.7505,[337]4.7598,[338]4.7639,[339]4.7518,
save_imatrix: stored collected data after 340 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[340]4.7632,[341]4.7790,[342]4.7941,[343]4.8108,[344]4.8320,[345]4.8604,[346]4.8630,[347]4.8644,[348]4.8671,[349]4.8744,
save_imatrix: stored collected data after 350 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[350]4.8872,[351]4.9059,[352]4.9061,[353]4.9034,[354]4.9135,[355]4.9099,[356]4.9097,[357]4.9080,[358]4.9038,[359]4.9090,
save_imatrix: stored collected data after 360 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[360]4.9194,[361]4.9156,[362]4.9129,[363]4.8963,[364]4.8796,[365]4.8635,[366]4.8503,[367]4.8318,[368]4.8157,[369]4.7996,
save_imatrix: stored collected data after 370 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[370]4.7866,[371]4.7714,[372]4.7556,[373]4.7429,[374]4.7300,[375]4.7123,[376]4.6990,[377]4.6849,[378]4.6693,[379]4.6561,
save_imatrix: stored collected data after 380 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[380]4.6537,[381]4.6400,[382]4.6338,[383]4.6367,[384]4.6232,[385]4.6175,[386]4.6067,[387]4.5890,[388]4.5728,[389]4.5652,
save_imatrix: stored collected data after 390 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[390]4.5555,[391]4.5414,[392]4.5239,[393]4.5066,[394]4.5040,[395]4.5026,[396]4.4983,[397]4.4885,[398]4.4904,[399]4.4896,
save_imatrix: stored collected data after 400 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[400]4.4737,[401]4.4599,[402]4.4532,[403]4.4396,[404]4.4289,[405]4.4190,[406]4.4097,[407]4.3939,[408]4.3788,[409]4.3650,
save_imatrix: stored collected data after 410 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[410]4.3531,[411]4.3425,[412]4.3370,[413]4.3287,[414]4.3245,[415]4.3204,[416]4.3186,[417]4.3135,[418]4.3088,[419]4.2949,
save_imatrix: stored collected data after 420 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[420]4.2807,[421]4.2662,[422]4.2536,[423]4.2399,[424]4.2282,[425]4.2152,[426]4.2014,[427]4.1906,[428]4.1766,[429]4.1644,
save_imatrix: stored collected data after 430 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[430]4.1522,[431]4.1426,[432]4.1317,[433]4.1220,[434]4.1199,[435]4.1181,[436]4.1119,[437]4.1020,[438]4.0960,[439]4.0828,
save_imatrix: stored collected data after 440 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[440]4.0702,[441]4.0585,[442]4.0473,[443]4.0362,[444]4.0323,[445]4.0235,[446]4.0202,[447]4.0146,[448]4.0049,[449]4.0016,
save_imatrix: stored collected data after 450 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[450]3.9946,[451]3.9874,[452]3.9767,[453]3.9699,[454]3.9632,[455]3.9550,[456]3.9433,[457]3.9322,[458]3.9206,[459]3.9096,
save_imatrix: stored collected data after 460 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[460]3.8989,[461]3.8899,[462]3.8810,[463]3.8748,[464]3.8676,[465]3.8636,[466]3.8585,[467]3.8536,[468]3.8486,[469]3.8434,
save_imatrix: stored collected data after 470 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[470]3.8384,[471]3.8334,[472]3.8284,[473]3.8242,[474]3.8191,[475]3.8139,[476]3.8096,[477]3.8047,[478]3.7999,[479]3.7961,
save_imatrix: stored collected data after 480 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[480]3.7857,[481]3.7764,[482]3.7724,[483]3.7656,[484]3.7584,[485]3.7486,[486]3.7395,[487]3.7308,[488]3.7220,[489]3.7158,
save_imatrix: stored collected data after 490 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[490]3.7084,[491]3.7021,[492]3.6979,[493]3.6921,[494]3.6859,[495]3.6784,[496]3.6778,[497]3.6749,[498]3.6698,[499]3.6680,
save_imatrix: stored collected data after 500 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[500]3.6647,[501]3.6636,[502]3.6639,[503]3.6667,[504]3.6647,[505]3.6596,[506]3.6523,[507]3.6562,[508]3.6667,[509]3.6749,
save_imatrix: stored collected data after 510 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[510]3.6826,[511]3.6895,[512]3.6970,[513]3.7014,[514]3.7053,[515]3.7070,[516]3.7142,[517]3.7171,[518]3.7236,[519]3.7323,
save_imatrix: stored collected data after 520 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[520]3.7451,[521]3.7605,[522]3.7736,[523]3.7724,[524]3.7795,[525]3.7835,[526]3.7893,[527]3.7909,[528]3.7931,[529]3.8026,
save_imatrix: stored collected data after 530 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[530]3.8079,[531]3.8093,[532]3.8157,[533]3.8216,[534]3.8286,[535]3.8288,[536]3.8280,[537]3.8288,[538]3.8326,[539]3.8370,
save_imatrix: stored collected data after 540 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[540]3.8417,[541]3.8458,[542]3.8479,[543]3.8506,[544]3.8552,[545]3.8600,[546]3.8687,[547]3.8770,[548]3.8832,[549]3.8919,
save_imatrix: stored collected data after 550 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[550]3.8987,[551]3.9068,[552]3.9131,[553]3.9191,[554]3.9255,[555]3.9314,[556]3.9288,[557]3.9261,[558]3.9228,[559]3.9276,
save_imatrix: stored collected data after 560 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[560]3.9331,[561]3.9378,[562]3.9433,[563]3.9446,[564]3.9491,[565]3.9497,[566]3.9542,[567]3.9548,[568]3.9547,[569]3.9539,
save_imatrix: stored collected data after 570 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[570]3.9546,[571]3.9576,[572]3.9542,[573]3.9509,[574]3.9462,[575]3.9421,[576]3.9350,[577]3.9290,[578]3.9228,[579]3.9166,
save_imatrix: stored collected data after 580 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[580]3.9139,[581]3.9150,[582]3.9130,[583]3.9138,[584]3.9120,[585]3.9120,[586]3.9116,[587]3.9092,[588]3.9037,[589]3.9042,
save_imatrix: stored collected data after 590 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[590]3.9009,[591]3.8938,[592]3.8872,[593]3.8797,[594]3.8743,[595]3.8708,[596]3.8695,[597]3.8677,[598]3.8667,[599]3.8645,
save_imatrix: stored collected data after 600 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[600]3.8599,[601]3.8535,[602]3.8530,[603]3.8530,[604]3.8531,[605]3.8489,[606]3.8471,[607]3.8441,[608]3.8471,[609]3.8458,
save_imatrix: stored collected data after 610 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[610]3.8437,[611]3.8437,[612]3.8425,[613]3.8375,[614]3.8309,[615]3.8233,[616]3.8164,[617]3.8088,[618]3.8017,[619]3.7944,
save_imatrix: stored collected data after 620 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[620]3.7875,[621]3.7793,[622]3.7714,[623]3.7645,[624]3.7581,[625]3.7505,[626]3.7442,[627]3.7367,[628]3.7299,[629]3.7234,
save_imatrix: stored collected data after 630 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[630]3.7168,[631]3.7100,[632]3.7058,[633]3.6985,[634]3.6944,[635]3.6929,[636]3.6877,[637]3.6811,[638]3.6755,[639]3.6690,
save_imatrix: stored collected data after 640 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[640]3.6621,[641]3.6558,[642]3.6496,[643]3.6435,[644]3.6372,[645]3.6309,[646]3.6249,[647]3.6194,[648]3.6184,[649]3.6123,
save_imatrix: stored collected data after 650 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[650]3.6055,[651]3.5992,[652]3.5929,[653]3.5864,[654]3.5796,[655]3.5730,[656]3.5668,[657]3.5611,[658]3.5547,[659]3.5570,
save_imatrix: stored collected data after 660 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[660]3.5573,[661]3.5604,[662]3.5587,[663]3.5528,[664]3.5491,[665]3.5438,[666]3.5377,[667]3.5324,[668]3.5275,[669]3.5226,
save_imatrix: stored collected data after 670 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[670]3.5177,[671]3.5125,[672]3.5068,[673]3.5011,[674]3.4970,[675]3.4922,[676]3.4869,[677]3.4818,[678]3.4765,[679]3.4708,
save_imatrix: stored collected data after 680 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[680]3.4667,[681]3.4611,[682]3.4562,[683]3.4514,[684]3.4459,[685]3.4411,[686]3.4390,[687]3.4378,[688]3.4343,[689]3.4294,
save_imatrix: stored collected data after 690 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[690]3.4236,[691]3.4176,[692]3.4125,[693]3.4072,[694]3.4033,[695]3.4014,[696]3.3996,[697]3.3970,[698]3.3955,[699]3.3934,
save_imatrix: stored collected data after 700 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[700]3.3914,[701]3.3899,[702]3.3879,[703]3.3860,[704]3.3840,[705]3.3822,[706]3.3807,[707]3.3782,[708]3.3768,[709]3.3745,
save_imatrix: stored collected data after 710 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[710]3.3728,[711]3.3708,[712]3.3714,[713]3.3711,[714]3.3711,[715]3.3720,[716]3.3730,[717]3.3740,[718]3.3748,[719]3.3762,
save_imatrix: stored collected data after 720 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[720]3.3783,[721]3.3785,[722]3.3791,[723]3.3798,[724]3.3814,[725]3.3823,[726]3.3842,[727]3.3854,[728]3.3873,[729]3.3872,
save_imatrix: stored collected data after 730 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[730]3.3873,[731]3.3885,[732]3.3909,[733]3.3917,[734]3.3921,[735]3.3923,[736]3.3934,[737]3.3951,[738]3.3954,[739]3.3978,
save_imatrix: stored collected data after 740 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[740]3.3994,[741]3.4013,[742]3.4027,[743]3.4031,[744]3.4030,[745]3.4037,[746]3.4056,[747]3.4068,[748]3.4084,[749]3.4093,
save_imatrix: stored collected data after 750 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[750]3.4106,[751]3.4113,[752]3.4133,[753]3.4155,[754]3.4160,[755]3.4166,[756]3.4181,[757]3.4199,[758]3.4207,[759]3.4219,
save_imatrix: stored collected data after 760 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[760]3.4226,[761]3.4234,[762]3.4251,[763]3.4255,[764]3.4273,[765]3.4286,[766]3.4300,[767]3.4306,[768]3.4316,[769]3.4319,
save_imatrix: stored collected data after 770 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[770]3.4328,[771]3.4347,[772]3.4354,[773]3.4356,[774]3.4361,[775]3.4380,[776]3.4390,[777]3.4410,[778]3.4409,[779]3.4423,
save_imatrix: stored collected data after 780 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[780]3.4438,[781]3.4454,[782]3.4471,[783]3.4492,[784]3.4495,[785]3.4502,[786]3.4508,[787]3.4524,[788]3.4528,[789]3.4550,
save_imatrix: stored collected data after 790 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[790]3.4564,[791]3.4576,[792]3.4577,[793]3.4588,[794]3.4610,[795]3.4623,[796]3.4624,[797]3.4635,[798]3.4646,[799]3.4681,
save_imatrix: stored collected data after 800 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[800]3.4686,[801]3.4682,[802]3.4700,[803]3.4718,[804]3.4725,[805]3.4733,[806]3.4738,[807]3.4747,[808]3.4753,[809]3.4761,
save_imatrix: stored collected data after 810 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[810]3.4778,[811]3.4796,[812]3.4808,[813]3.4815,[814]3.4820,
save_imatrix: stored collected data after 814 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
Final estimate: PPL = 3.4820 +/- 0.01710
llama_print_timings: load time = 252304.21 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 7778409.55 ms / 416768 tokens ( 18.66 ms per token, 53.58 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 8047122.96 ms / 416769 tokens
======================== sorted layer importances
0: Layer 0, <cos_sim> = 0.427685
1: Layer 2, <cos_sim> = 0.754953
2: Layer 1, <cos_sim> = 0.761973
3: Layer 3, <cos_sim> = 0.859269
4: Layer 91, <cos_sim> = 0.884954
5: Layer 4, <cos_sim> = 0.903777
6: Layer 32, <cos_sim> = 0.909284
7: Layer 6, <cos_sim> = 0.912099
8: Layer 39, <cos_sim> = 0.916231
9: Layer 37, <cos_sim> = 0.916806
10: Layer 23, <cos_sim> = 0.917485
11: Layer 31, <cos_sim> = 0.917567
12: Layer 41, <cos_sim> = 0.919794
13: Layer 40, <cos_sim> = 0.921433
14: Layer 33, <cos_sim> = 0.921658
15: Layer 29, <cos_sim> = 0.921735
16: Layer 30, <cos_sim> = 0.921975
17: Layer 22, <cos_sim> = 0.923691
18: Layer 14, <cos_sim> = 0.923877
19: Layer 28, <cos_sim> = 0.923969
20: Layer 38, <cos_sim> = 0.923987
21: Layer 24, <cos_sim> = 0.924143
22: Layer 34, <cos_sim> = 0.925237
23: Layer 26, <cos_sim> = 0.925394
24: Layer 7, <cos_sim> = 0.925584
25: Layer 36, <cos_sim> = 0.927248
26: Layer 13, <cos_sim> = 0.9273
27: Layer 25, <cos_sim> = 0.928089
28: Layer 21, <cos_sim> = 0.92832
29: Layer 85, <cos_sim> = 0.928453
30: Layer 84, <cos_sim> = 0.929674
31: Layer 27, <cos_sim> = 0.930059
32: Layer 35, <cos_sim> = 0.930223
33: Layer 10, <cos_sim> = 0.930323
34: Layer 8, <cos_sim> = 0.931374
35: Layer 9, <cos_sim> = 0.931623
36: Layer 11, <cos_sim> = 0.932746
37: Layer 92, <cos_sim> = 0.93573
38: Layer 42, <cos_sim> = 0.93659
39: Layer 12, <cos_sim> = 0.938836
40: Layer 5, <cos_sim> = 0.941925
41: Layer 43, <cos_sim> = 0.944567
42: Layer 86, <cos_sim> = 0.944959
43: Layer 15, <cos_sim> = 0.947795
44: Layer 20, <cos_sim> = 0.94999
45: Layer 18, <cos_sim> = 0.952245
46: Layer 83, <cos_sim> = 0.953255
47: Layer 19, <cos_sim> = 0.953365
48: Layer 44, <cos_sim> = 0.953818
49: Layer 45, <cos_sim> = 0.953992
50: Layer 17, <cos_sim> = 0.95693
51: Layer 16, <cos_sim> = 0.957523
52: Layer 46, <cos_sim> = 0.958581
53: Layer 87, <cos_sim> = 0.958638
54: Layer 80, <cos_sim> = 0.958765
55: Layer 90, <cos_sim> = 0.959166
56: Layer 81, <cos_sim> = 0.959806
57: Layer 82, <cos_sim> = 0.961588
58: Layer 47, <cos_sim> = 0.961593
59: Layer 89, <cos_sim> = 0.961599
60: Layer 88, <cos_sim> = 0.962237
61: Layer 79, <cos_sim> = 0.9637
62: Layer 48, <cos_sim> = 0.964249
63: Layer 50, <cos_sim> = 0.965041
64: Layer 49, <cos_sim> = 0.965927
65: Layer 51, <cos_sim> = 0.966417
66: Layer 54, <cos_sim> = 0.968401
67: Layer 52, <cos_sim> = 0.968414
68: Layer 76, <cos_sim> = 0.970006
69: Layer 53, <cos_sim> = 0.970636
70: Layer 55, <cos_sim> = 0.971746
71: Layer 78, <cos_sim> = 0.971977
72: Layer 75, <cos_sim> = 0.972703
73: Layer 77, <cos_sim> = 0.976101
74: Layer 58, <cos_sim> = 0.978261
75: Layer 56, <cos_sim> = 0.978325
76: Layer 57, <cos_sim> = 0.979194
77: Layer 59, <cos_sim> = 0.979934
78: Layer 73, <cos_sim> = 0.980678
79: Layer 67, <cos_sim> = 0.981173
80: Layer 66, <cos_sim> = 0.981491
81: Layer 72, <cos_sim> = 0.981982
82: Layer 68, <cos_sim> = 0.982146
83: Layer 65, <cos_sim> = 0.982195
84: Layer 61, <cos_sim> = 0.982205
85: Layer 74, <cos_sim> = 0.982215
86: Layer 60, <cos_sim> = 0.982433
87: Layer 71, <cos_sim> = 0.982878
88: Layer 63, <cos_sim> = 0.983266
89: Layer 70, <cos_sim> = 0.98376
90: Layer 64, <cos_sim> = 0.984047
91: Layer 69, <cos_sim> = 0.984223
92: Layer 62, <cos_sim> = 0.984883
======================== sorted attention importances
0: Layer 0, <cos_sim> = 0.344938
1: Layer 1, <cos_sim> = 0.591517
2: Layer 2, <cos_sim> = 0.654333
3: Layer 3, <cos_sim> = 0.823345
4: Layer 7, <cos_sim> = 0.838734
5: Layer 13, <cos_sim> = 0.848823
6: Layer 4, <cos_sim> = 0.853067
7: Layer 6, <cos_sim> = 0.85565
8: Layer 9, <cos_sim> = 0.859293
9: Layer 8, <cos_sim> = 0.865704
10: Layer 15, <cos_sim> = 0.874006
11: Layer 12, <cos_sim> = 0.876242
12: Layer 5, <cos_sim> = 0.879342
13: Layer 10, <cos_sim> = 0.882239
14: Layer 11, <cos_sim> = 0.882394
15: Layer 17, <cos_sim> = 0.892329
16: Layer 16, <cos_sim> = 0.893135
17: Layer 21, <cos_sim> = 0.897359
18: Layer 14, <cos_sim> = 0.900934
19: Layer 19, <cos_sim> = 0.90725
20: Layer 20, <cos_sim> = 0.913042
21: Layer 18, <cos_sim> = 0.915475
22: Layer 23, <cos_sim> = 0.921054
23: Layer 22, <cos_sim> = 0.928428
24: Layer 24, <cos_sim> = 0.932289
25: Layer 25, <cos_sim> = 0.937948
26: Layer 32, <cos_sim> = 0.941675
27: Layer 28, <cos_sim> = 0.942764
28: Layer 27, <cos_sim> = 0.943415
29: Layer 26, <cos_sim> = 0.944893
30: Layer 33, <cos_sim> = 0.94565
31: Layer 31, <cos_sim> = 0.947089
32: Layer 30, <cos_sim> = 0.947855
33: Layer 37, <cos_sim> = 0.948046
34: Layer 38, <cos_sim> = 0.950758
35: Layer 39, <cos_sim> = 0.951663
36: Layer 35, <cos_sim> = 0.952977
37: Layer 41, <cos_sim> = 0.953607
38: Layer 29, <cos_sim> = 0.954196
39: Layer 40, <cos_sim> = 0.954582
40: Layer 34, <cos_sim> = 0.954729
41: Layer 42, <cos_sim> = 0.954918
42: Layer 36, <cos_sim> = 0.959325
43: Layer 85, <cos_sim> = 0.962455
44: Layer 43, <cos_sim> = 0.964714
45: Layer 44, <cos_sim> = 0.965964
46: Layer 45, <cos_sim> = 0.967585
47: Layer 46, <cos_sim> = 0.970301
48: Layer 86, <cos_sim> = 0.970694
49: Layer 84, <cos_sim> = 0.971816
50: Layer 51, <cos_sim> = 0.974334
51: Layer 52, <cos_sim> = 0.974995
52: Layer 83, <cos_sim> = 0.975456
53: Layer 92, <cos_sim> = 0.977125
54: Layer 50, <cos_sim> = 0.978028
55: Layer 47, <cos_sim> = 0.978144
56: Layer 81, <cos_sim> = 0.978666
57: Layer 48, <cos_sim> = 0.978749
58: Layer 49, <cos_sim> = 0.978877
59: Layer 82, <cos_sim> = 0.978997
60: Layer 53, <cos_sim> = 0.979392
61: Layer 80, <cos_sim> = 0.979488
62: Layer 91, <cos_sim> = 0.980023
63: Layer 87, <cos_sim> = 0.981837
64: Layer 58, <cos_sim> = 0.981972
65: Layer 54, <cos_sim> = 0.982405
66: Layer 79, <cos_sim> = 0.98302
67: Layer 88, <cos_sim> = 0.983417
68: Layer 78, <cos_sim> = 0.984292
69: Layer 89, <cos_sim> = 0.986245
70: Layer 90, <cos_sim> = 0.986536
71: Layer 61, <cos_sim> = 0.986631
72: Layer 68, <cos_sim> = 0.986747
73: Layer 59, <cos_sim> = 0.986949
74: Layer 71, <cos_sim> = 0.987292
75: Layer 73, <cos_sim> = 0.987331
76: Layer 55, <cos_sim> = 0.987493
77: Layer 72, <cos_sim> = 0.987677
78: Layer 76, <cos_sim> = 0.98837
79: Layer 57, <cos_sim> = 0.988642
80: Layer 56, <cos_sim> = 0.988706
81: Layer 77, <cos_sim> = 0.988904
82: Layer 67, <cos_sim> = 0.989027
83: Layer 65, <cos_sim> = 0.989138
84: Layer 70, <cos_sim> = 0.989192
85: Layer 63, <cos_sim> = 0.989244
86: Layer 74, <cos_sim> = 0.989437
87: Layer 64, <cos_sim> = 0.989463
88: Layer 66, <cos_sim> = 0.989895
89: Layer 60, <cos_sim> = 0.989994
90: Layer 69, <cos_sim> = 0.991278
91: Layer 75, <cos_sim> = 0.991565
92: Layer 62, <cos_sim> = 0.991979
======================== sorted ffn importances
0: Layer 0, <cos_sim> = 0.624763
1: Layer 1, <cos_sim> = 0.625883
2: Layer 2, <cos_sim> = 0.758364
3: Layer 6, <cos_sim> = 0.823684
4: Layer 3, <cos_sim> = 0.82519
5: Layer 14, <cos_sim> = 0.855503
6: Layer 8, <cos_sim> = 0.857968
7: Layer 11, <cos_sim> = 0.858688
8: Layer 4, <cos_sim> = 0.860767
9: Layer 12, <cos_sim> = 0.861034
10: Layer 7, <cos_sim> = 0.866609
11: Layer 5, <cos_sim> = 0.867507
12: Layer 16, <cos_sim> = 0.880483
13: Layer 9, <cos_sim> = 0.884367
14: Layer 15, <cos_sim> = 0.884556
15: Layer 10, <cos_sim> = 0.888465
16: Layer 20, <cos_sim> = 0.889985
17: Layer 91, <cos_sim> = 0.895348
18: Layer 18, <cos_sim> = 0.897993
19: Layer 19, <cos_sim> = 0.900205
20: Layer 13, <cos_sim> = 0.900349
21: Layer 24, <cos_sim> = 0.916468
22: Layer 17, <cos_sim> = 0.916497
23: Layer 22, <cos_sim> = 0.916572
24: Layer 25, <cos_sim> = 0.919954
25: Layer 26, <cos_sim> = 0.920522
26: Layer 23, <cos_sim> = 0.921078
27: Layer 27, <cos_sim> = 0.925478
28: Layer 29, <cos_sim> = 0.930093
29: Layer 21, <cos_sim> = 0.930119
30: Layer 32, <cos_sim> = 0.933381
31: Layer 31, <cos_sim> = 0.934207
32: Layer 28, <cos_sim> = 0.935718
33: Layer 30, <cos_sim> = 0.936237
34: Layer 34, <cos_sim> = 0.938
35: Layer 37, <cos_sim> = 0.940027
36: Layer 36, <cos_sim> = 0.941134
37: Layer 33, <cos_sim> = 0.943274
38: Layer 39, <cos_sim> = 0.943999
39: Layer 40, <cos_sim> = 0.94547
40: Layer 35, <cos_sim> = 0.945489
41: Layer 41, <cos_sim> = 0.94562
42: Layer 38, <cos_sim> = 0.948466
43: Layer 43, <cos_sim> = 0.952106
44: Layer 42, <cos_sim> = 0.953932
45: Layer 44, <cos_sim> = 0.954758
46: Layer 45, <cos_sim> = 0.955495
47: Layer 84, <cos_sim> = 0.957791
48: Layer 46, <cos_sim> = 0.962356
49: Layer 47, <cos_sim> = 0.963516
50: Layer 85, <cos_sim> = 0.964269
51: Layer 50, <cos_sim> = 0.964335
52: Layer 90, <cos_sim> = 0.964703
53: Layer 48, <cos_sim> = 0.96498
54: Layer 51, <cos_sim> = 0.965182
55: Layer 49, <cos_sim> = 0.965669
56: Layer 52, <cos_sim> = 0.968839
57: Layer 86, <cos_sim> = 0.968873
58: Layer 89, <cos_sim> = 0.968923
59: Layer 79, <cos_sim> = 0.971219
60: Layer 83, <cos_sim> = 0.971772
61: Layer 80, <cos_sim> = 0.972194
62: Layer 87, <cos_sim> = 0.972445
63: Layer 53, <cos_sim> = 0.972528
64: Layer 81, <cos_sim> = 0.972955
65: Layer 78, <cos_sim> = 0.973467
66: Layer 57, <cos_sim> = 0.974059
67: Layer 77, <cos_sim> = 0.974158
68: Layer 82, <cos_sim> = 0.974329
69: Layer 55, <cos_sim> = 0.974836
70: Layer 88, <cos_sim> = 0.974857
71: Layer 92, <cos_sim> = 0.975135
72: Layer 54, <cos_sim> = 0.975202
73: Layer 75, <cos_sim> = 0.975318
74: Layer 76, <cos_sim> = 0.975601
75: Layer 56, <cos_sim> = 0.976377
76: Layer 58, <cos_sim> = 0.976683
77: Layer 60, <cos_sim> = 0.977063
78: Layer 73, <cos_sim> = 0.977821
79: Layer 67, <cos_sim> = 0.97792
80: Layer 72, <cos_sim> = 0.978052
81: Layer 70, <cos_sim> = 0.978202
82: Layer 59, <cos_sim> = 0.978237
83: Layer 65, <cos_sim> = 0.978362
84: Layer 71, <cos_sim> = 0.978628
85: Layer 69, <cos_sim> = 0.978685
86: Layer 66, <cos_sim> = 0.978833
87: Layer 63, <cos_sim> = 0.979037
88: Layer 64, <cos_sim> = 0.979078
89: Layer 62, <cos_sim> = 0.979626
90: Layer 74, <cos_sim> = 0.979673
91: Layer 68, <cos_sim> = 0.979996
92: Layer 61, <cos_sim> = 0.980596 |
Thank you so much @ubergarm! |
I apologize if this is off topic, but is it possible for a GLM-4.5-Air 4 bit GGUF quant to test as well, as it is the much friendlier option for resource limited individuals (such as myself). Thank you all for your amazing and quick work! |
Having an issue converting the original bf16 safetensors using this PR to get my bf16 GGUF. @@ -3951,8 +3951,8 @@ class Dots1Model(Qwen2MoeModel):
return [(self.map_tensor_name(name), data_torch)]
return super().modify_tensors(data_torch, name, bid)
-@ModelBase.register("Glm4MoeForCausalLM")
-class Glm4MoeModel(TextModel):
+@Model.register("Glm4MoeForCausalLM")
+class Glm4MoeModel(Model):
model_arch = gguf.MODEL_ARCH.GLM4_MOE
def __init__(self, *args, **kwargs):
@@ -3960,7 +3960,7 @@ class Glm4MoeModel(TextModel):
# GLM4_MOE has num_hidden_layers + 1 actual layers (including NextN layer)
self.block_count = self.hparams["num_hidden_layers"] + 1
self.tensor_map = gguf.get_tensor_name_map(self.model_arch, self.block_count) But then it bombs out with this: python \
convert_hf_to_gguf.py \
--outtype bf16 \
--split-max-size 50G \
--outfile /mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/ \
/mnt/raid/models/zai-org/GLM-4.5-Air/
INFO:hf-to-gguf:gguf: experts used count = 8
INFO:hf-to-gguf:gguf: file type = 24
INFO:hf-to-gguf:Set model tokenizer
WARNING:hf-to-gguf:
WARNING:hf-to-gguf:**************************************************************************************
WARNING:hf-to-gguf:** WARNING: The BPE pre-tokenizer was not recognized!
WARNING:hf-to-gguf:** There are 2 possible reasons for this:
WARNING:hf-to-gguf:** - the model has not been added to convert_hf_to_gguf_update.py yet
WARNING:hf-to-gguf:** - the pre-tokenization config has changed upstream
WARNING:hf-to-gguf:** Check your model files and convert_hf_to_gguf_update.py and update them accordingly.
WARNING:hf-to-gguf:** ref: https://github.com/ggerganov/llama.cpp/pull/6920
WARNING:hf-to-gguf:**
WARNING:hf-to-gguf:** chkhsh: a1336059768a55c99a734006ffb02203cd450fed003e9a71886c88acf24fdbc2
WARNING:hf-to-gguf:**************************************************************************************
WARNING:hf-to-gguf:
Traceback (most recent call last):
File "/home/w/projects/ik_llama.cpp/convert_hf_to_gguf.py", line 4578, in <module>
main()
File "/home/w/projects/ik_llama.cpp/convert_hf_to_gguf.py", line 4572, in main
model_instance.write()
File "/home/w/projects/ik_llama.cpp/convert_hf_to_gguf.py", line 431, in write
self.prepare_metadata(vocab_only=False)
File "/home/w/projects/ik_llama.cpp/convert_hf_to_gguf.py", line 417, in prepare_metadata
self.set_vocab()
File "/home/w/projects/ik_llama.cpp/convert_hf_to_gguf.py", line 3971, in set_vocab
tokens, toktypes, tokpre = self.get_vocab_base()
^^^^^^^^^^^^^^^^^^^^^
File "/home/w/projects/ik_llama.cpp/convert_hf_to_gguf.py", line 512, in get_vocab_base
tokpre = self.get_vocab_base_pre(tokenizer)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/w/projects/ik_llama.cpp/convert_hf_to_gguf.py", line 662, in get_vocab_base_pre
raise NotImplementedError("BPE pre-tokenizer was not recognized - update get_vocab_base_pre()")
NotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre() Probably just a few name changes to go still? My other option is to use mainline lcpp's convert python script then try to pick back up here. Thoughts? Oh I'll check for a missing commit here: |
I recommend using the llama.cpp script for this. Which is what I did. So not too sure if I had ported all the right changes. There are no changes to the _update.py version and I don't know what this version of the script is used for. |
Got it, yes I am now using For this PR will likely want to either:
Just some thoughts. I'll keep going, and other good news this seems to be working now: GLM-4.5-IQ4_KSS.gguf 178.396 GiB (4.276 BPW) |
Sorry for spamming you, hah.. I think this patch is enough to get Air to convert here with this script. Interestingly needed a checksum from an older GLM model hah... should Be able to save it as diff --git a/convert_hf_to_gguf.py b/convert_hf_to_gguf.py
index c2748822..b1ca9e1c 100644
--- a/convert_hf_to_gguf.py
+++ b/convert_hf_to_gguf.py
@@ -618,6 +618,9 @@ class Model:
if chkhsh == "b6e8e1518dc4305be2fe39c313ed643381c4da5db34a98f6a04c093f8afbe99b":
# ref: https://huggingface.co/THUDM/glm-4-9b-chat
res = "chatglm-bpe"
+ if chkhsh == "a1336059768a55c99a734006ffb02203cd450fed003e9a71886c88acf24fdbc2":
+ # ref: https://huggingface.co/THUDM/glm-4-9b-hf
+ res = "glm4"
if chkhsh == "9ca2dd618e8afaf09731a7cf6e2105b373ba6a1821559f258b272fe83e6eb902":
# ref: https://huggingface.co/zai-org/GLM-4.5-Air, https://huggingface.co/zai-org/GLM-4.5
res = "gpt-2"
@@ -3951,8 +3954,8 @@ class Dots1Model(Qwen2MoeModel):
return [(self.map_tensor_name(name), data_torch)]
return super().modify_tensors(data_torch, name, bid)
-@ModelBase.register("Glm4MoeForCausalLM")
-class Glm4MoeModel(TextModel):
+@Model.register("Glm4MoeForCausalLM")
+class Glm4MoeModel(Model):
model_arch = gguf.MODEL_ARCH.GLM4_MOE
def __init__(self, *args, **kwargs): Keep in mind I believe some things have changed in that draft PR and there could be a few other things. I'm gonna keep going with the convert Air from mainline for now as it is about 70% done. |
not sure if that "res" needs to be changed from "gpt-2" to "glm4" |
I used mainline convert on Air and quantized a Q8_0 but got this trying to start up ik_llama.cpp llama-server:
Unfortunately I'm out of time to play with this today. I may try using this PRs convert and look closer at the tensors when I have some more time. |
src/llama.cpp
Outdated
model.output = create_tensor(ctx_output, tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab}, llama_model_loader::TENSOR_DUPLICATED); | ||
} | ||
|
||
// --- NextN / MTP tensors (preserved but unused), on the final layer --- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is MTP on the roadmap for ik_llama?
Okay, having some luck with GLM-4.5-Air now. Cooking imatrix off the bf16 and testing the Q8_0 on llama-server so far so good! compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 899.288 ms
compute_imatrix: computing over 814 chunks with batch_size 512
compute_imatrix: 7.09 seconds per pass - ETA 1 hours 36.25 minutes
[1]18.4611,[2]7.6556,[3]4.9280,[4]3.5268,[5]2.8022,[6]2.3688,[7]2.1118,[8]1.9437,[9]1.9381,
save_imatrix: entry ' blk.46.ffn_down_exps.weight' has partial data (96.88%) 4 out of 128 experts are missing data Storing **but be aware**
save_imatrix: entry ' blk.46.ffn_up_exps.weight' has partial data (96.88%) 4 out of 128 experts are missing data Storing **but be aware**
save_imatrix: entry ' blk.46.ffn_gate_exps.weight' has partial data (96.88%) 4 out of 128 experts are missing data Storing **but be aware**
save_imatrix: stored collected data after 10 chunks in /mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/imatrix-GLM-4.5-Air-BF16.dat
[10]1.8529,[11]1.9772,[12]2.1353,[13]2.2048,[14]2.2779,
I was thinking about what @TheLegendOfKitty said and I was like "psure MTP is not on the roadmap" so I just removed those MTP tensors and seems to be happy now. The other clue was on startup before this next patch here it was complaining with: So anyway I used this PR plus this patch on 👈 Patchdiff --git a/convert_hf_to_gguf.py b/convert_hf_to_gguf.py
index c2748822..5da48e24 100644
--- a/convert_hf_to_gguf.py
+++ b/convert_hf_to_gguf.py
@@ -618,6 +618,9 @@ class Model:
if chkhsh == "b6e8e1518dc4305be2fe39c313ed643381c4da5db34a98f6a04c093f8afbe99b":
# ref: https://huggingface.co/THUDM/glm-4-9b-chat
res = "chatglm-bpe"
+ if chkhsh == "a1336059768a55c99a734006ffb02203cd450fed003e9a71886c88acf24fdbc2":
+ # ref: https://huggingface.co/THUDM/glm-4-9b-hf
+ res = "glm4"
if chkhsh == "9ca2dd618e8afaf09731a7cf6e2105b373ba6a1821559f258b272fe83e6eb902":
# ref: https://huggingface.co/zai-org/GLM-4.5-Air, https://huggingface.co/zai-org/GLM-4.5
res = "gpt-2"
@@ -3951,8 +3954,8 @@ class Dots1Model(Qwen2MoeModel):
return [(self.map_tensor_name(name), data_torch)]
return super().modify_tensors(data_torch, name, bid)
-@ModelBase.register("Glm4MoeForCausalLM")
-class Glm4MoeModel(TextModel):
+@Model.register("Glm4MoeForCausalLM")
+class Glm4MoeModel(Model):
model_arch = gguf.MODEL_ARCH.GLM4_MOE
def __init__(self, *args, **kwargs):
@@ -4093,7 +4096,7 @@ class Glm4MoeModel(TextModel):
# Handle expert gating input (routing gate)
if ".mlp.gate.e_score_correction_bias" in name:
new_name = name.replace("model.layers.", "blk.").replace(
- ".mlp.gate.e_score_correction_bias", ".ffn_gate_inp.bias"
+ ".mlp.gate.e_score_correction_bias", ".ffn_gate_inp.bias" # *NOTE* this is ".exp_probs_b" in mainline PR
)
return [(new_name, data_torch)]
elif ".mlp.gate.weight" in name:
@@ -4132,6 +4135,7 @@ class Glm4MoeModel(TextModel):
return [(self.map_tensor_name(new_name), data_torch)]
# Handle special NextN tensors - preserve for future MTP support
+ # What are these MTP tensors? If we preserve them do they need to be "noop" or whatever?
if (
".embed_tokens." in name
or ".shared_head." in name
@@ -4140,7 +4144,9 @@ class Glm4MoeModel(TextModel):
or ".hnorm." in name
):
new_name = name.replace("model.layers.", "blk.").replace("model.", "").replace(".weight", "")
- return [(new_name, data_torch)]
+ logger.debug(f"Skipping MTP tensor: {new_name}")
+ return []
+ #return [(new_name, data_torch)]
# GLM tensor mapping - handle directly without map_tensor_name
if ".input_layernorm." in name: If this imatrix cooks I'll upload that at least and see about a ~4BPW quant or not depending on if the tensors are nice sizes etc. |
@ubergarm - Don't we want to preserve the MTP tensors for future MTP support? I see you've set |
@ubergarm - More info about MTP can be found here: ggml-org/llama.cpp#13236 Also discussed here: ggml-org/llama.cpp#14939 (comment) |
Oh wow, I definitely need to check this out! Thanks for the tip. I'll provide feedback after I do. @usrlocalben - Kimi is a beast! Sadly quite slow and very big, but it's on my list of options too! |
Which IQ2? |
added intel autoround Q2 above as well. @Thireus K2 is much faster than R1. fewer active params and no-think. much worse vram rent though. |
I can confirm llama.cpp is more than 2x faster than ik_llama.cpp on the Blackwell card. :| I might switch back to llama.cpp then. 😂 See details: https://huggingface.co/ubergarm/GLM-4.5-Air-GGUF/discussions/2#689916c972f9ea17ff768e4b |
Thanks for clarifying and adding the extra data point, that looks like an impressive result for the autoround compared to the ubergarm IQ2_KL
I'm not @Thireus but I haven't tried K2 even though it sounds tempting for that reason because my R1/V3 quants leave me no spare room at ~4.5 BPW, and so either I go to a lower BPW or RPC and neither sounded compelling enough so far. |
Very interesting... GLM-4.5-Air was able to solve the Dipiloblop prompt with mainline llama.cpp using the following recipe:
Model answer:
|
I'm positive. Same model, different results: ik_llama.cpp is not able to solve it, while llama.cpp solves it. Something is up with our implementation. |
Unless you are using a zero-temperature setting, to make a definitive statement you need to run the task several times and count how many times the task was solved. But given the performance results, I wouldn't be surprised if the implementation in |
@ikawrakow, you're right - need a more scientific testing methodology. I have decided to set the temps and related settings the same between ik_llama.cpp and llama.cpp and do 10 rounds for each. Here are the results:
So, it looks like ik_llama.cpp is somehow unable to pull the correct XP for level 101 most of the time. The same model recipe was used for both:
|
One thing to note, for long context -b 4096 and -ub 4096 breaks the model. Same with mainline. Maybe you should use -b 2048 and -ub 2048 |
Ah! I'll give it a go thanks. Interestingly I have not noticed any issue with llama.cpp. Edit: Nope that's not it, same issue. |
Why is |
Lack of sleep. I ran everything again and updated my results in this post: #668 (comment) Same outcome as before, I believe this parameter didn't have any effect. |
I thought it set a global system_prompt which would prefix the context which could affect results. |
You were right to point this out, I think this is what it does indeed and probably didn't make the previous results any better. The updated results are without this -p option. Would be good if someone could replicate this test without necessarily using my parameters but their own to validate if something is up with our implementation. Could be RoPE, could be something else... could also be extreme good luck I had with 10 good seeds on llama.cpp... but I doubt it. |
So, if we count this as correct, then But if there is a real difference, the fact that |
Another update from mainline, @ajunca narrowed down some of the large context issues to a CUDA graph issue. Could be interesting to run your test with a compiled version with GGML_CUDA_USE_GRAPHS=OFF |
CUDA graphs are turned off for MoE models in Oh, and the huge PPL observed for GPT-OSS is there with or without CUDA graphs. Here and in mainline. |
No, at the end was not a problem of cuda graphs... After more testing, the problem was still there. I think though, is a problem of CUDA device synchronization. And that there are two problems related:
![]() If you go to #ggml-org/llama.cpp#15112 you will see things I tried. Probably I will open a new issue, with all the stuff. |
@ajunca The problems you are describing are about mainline |
Apologies if i brought in a different issue. I can report that setting -b 4096 and -ub 4096 affects ik_llama the same way as ajunca has described in mainline. Thought it could be good to reference their investigation if they are related, maybe they arent |
Yes indead, are related to llama.cpp. I tried also ik_llama.cpp at some point, I remember I had problems, but don't remember which ones. Does ik_llama.cpp has a Vulkan backend? In llama.cpp, the Vulkan backend is workings (but much slower). |
The much slower part is no longer true. |
I can confirm the GGGGGGG problem occurred to me as well when using the IQ3_K quant for some reason - https://huggingface.co/Thireus/Qwen3-Coder-480B-A35B-Instruct-THIREUS-IQ3_K-SPECIAL_SPLIT/discussions/1#6898de3d70c4fed919a7e7e9 - I thought it had something to do with the quantization of the model for that qtype being wrong... but seems like it may not be it. I've avoided using this quant and haven't been able to reproduce the issue with other quants since. |
I mean that when running with the Vulkan backend is not producing the mentioned problem. So looks like a CUDA problem. Well just tried today. I get like 60% speed for generation (not bad at all), but only like 15% for prompt processing, so yea, quite slower on this regard. |
I tried many different, even MXFP4, and at some point and with enough testing, all seems to produce it. |
15% of what? 15% of mainline |
Yea sorry, I was comparing the Vulkan vs Cuda in llama.cpp. Saying that in Vulkan it seems to work correctly, the only problem is that is considerably slower than the CUDA backend. (percentage compared to CUDA toks/s). |
Port from llama.cpp - See ggml-org/llama.cpp#14939
Ported from https://github.com/sammcj/llama.cpp/tree/dcbbd2cb057a6c6e907e0195395a74201ef19e1b
Old conversation - #662
Still some cleanup work to do, such as the GGML_ASSERT which needs to be restored. Contributors are welcomed.
Want to test it out?
Instructions to convert HF to GGUF:
→ Alternatively you can download GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT - must be used with
ulimit -n 9999
Basic instructions to quantize from BF16 to q4_K quant:
→ Alternatively you can download GLM-4.5-THIREUS-Q4_K-SPECIAL_SPLIT - must be used with
ulimit -n 9999
and ik_llama built with-DGGML_MAX_CONTEXTS=2048
IMPORTANT: This is not how these SPECIAL_SPLIT quants are supposed to be used, as they must be used in a recipe (read more at https://gguf.thireus.com), so expect quite bad perplexity. However still usable this way for testing purpose.
Basic instructions to lauch llama-server:
Quant mix recipe for the adventurous:
Windows builds available at: https://github.com/Thireus/ik_llama.cpp/releases