Avoid rebuild of GGML graph for each token #98

agray3 · 2024-10-19T19:18:49Z

Introduces caching of GGML graph to avoid unnecessary full rebuild between each token. KV cache parameters, which change with each token, are updated directly in cached GGML graph. Can be disabled with GGML_DISABLE_GRAPH_CACHING environment variable.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

Introduces caching of GGML graph to avoid unnecessary full rebuild between each token. KV cache parameters, which change with each token, are updated directly in cached GGML graph. Can be disabled with GGML_DISABLE_GRAPH_CACHING environment variable.

agray3 · 2024-10-19T19:19:21Z

See #94

This reverts commit f2d315b. As far as I can tell, the commit breaks Metal TG.

* Adapting iq2_bn to work without separate scale tensors Why? It is becoming burdensome to maintain the special Bitnet conversion in convert_hf_to_gguf.py, so I thnk it is better to make iq1_bn and iq2_bn just work with the mainline conversion script (which does not generate scales). * Adapting iq1_bn to work without separate scale tensors * Adapting iq2_bn: CUDA dequantize * Adapting iq2_bn: CUDA works * Adapting iq1_bn: CUDA works * Adapting iq1_bn, iq2_bn: NEON * Adapting iq1_bn, iq2_bn: Metal Dequantize works, but there is still something wrong with the dot products. * WIP Absoolutely don't see what is wrong with the iq1_bn and iq2_bn vector dot product kernels. * Remove iq1_tn and iq2_tn - Part 1 Now that iq1_bn and iq2_bn have per row scales, there is no reason to also have iq1_tn and iq2_tn. * Remove iq1_tn and iq2_tn - Part 2 * Bitnet: use the standard llm_build_kv to build self attention My main motivation was to enable FA. But FA does not work anyway because head size is 100 for the Botnet ternary models (and I had forgotten this little detail). * Revert "Avoid rebuild of GGML graph for each token (#98)" This reverts commit f2d315b. As far as I can tell, the commit breaks Metal TG. --------- Co-authored-by: Iwan Kawrakow <[email protected]>

* Adapting iq2_bn to work without separate scale tensors Why? It is becoming burdensome to maintain the special Bitnet conversion in convert_hf_to_gguf.py, so I thnk it is better to make iq1_bn and iq2_bn just work with the mainline conversion script (which does not generate scales). * Adapting iq1_bn to work without separate scale tensors * Adapting iq2_bn: CUDA dequantize * Adapting iq2_bn: CUDA works * Adapting iq1_bn: CUDA works * Adapting iq1_bn, iq2_bn: NEON * Adapting iq1_bn, iq2_bn: Metal Dequantize works, but there is still something wrong with the dot products. * WIP Absoolutely don't see what is wrong with the iq1_bn and iq2_bn vector dot product kernels. * Remove iq1_tn and iq2_tn - Part 1 Now that iq1_bn and iq2_bn have per row scales, there is no reason to also have iq1_tn and iq2_tn. * Remove iq1_tn and iq2_tn - Part 2 * Bitnet: use the standard llm_build_kv to build self attention My main motivation was to enable FA. But FA does not work anyway because head size is 100 for the Botnet ternary models (and I had forgotten this little detail). * Revert "Avoid rebuild of GGML graph for each token (ikawrakow#98)" This reverts commit f2d315b. As far as I can tell, the commit breaks Metal TG. --------- Co-authored-by: Iwan Kawrakow <[email protected]>

agray3 mentioned this pull request Oct 19, 2024

Adding @agray3's graph caching approach #94

Closed

ikawrakow approved these changes Oct 20, 2024

View reviewed changes

ikawrakow merged commit f2d315b into ikawrakow:main Oct 20, 2024

ikawrakow pushed a commit that referenced this pull request Oct 25, 2024

Revert "Avoid rebuild of GGML graph for each token (#98)"

af4255d

This reverts commit f2d315b. As far as I can tell, the commit breaks Metal TG.

ikawrakow mentioned this pull request Oct 25, 2024

Bitnet changes #106

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid rebuild of GGML graph for each token #98

Avoid rebuild of GGML graph for each token #98

Uh oh!

agray3 commented Oct 19, 2024

Uh oh!

agray3 commented Oct 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Avoid rebuild of GGML graph for each token #98

Avoid rebuild of GGML graph for each token #98

Uh oh!

Conversation

agray3 commented Oct 19, 2024

Uh oh!

agray3 commented Oct 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants