CANN: implement LRU cache for ACL graphs in CANN backend #15814

noemotiovon · 2025-09-05T09:19:18Z

What does this PR do?

implement LRU cache for ACL graphs in CANN backend.

Introduce ggml_cann_graph_lru_cache to store multiple ggml_cann_graph objects.
Graphs are loaded on demand and evicted using LRU policy when capacity is exceeded.
Updated push, move_to_front, and clear methods to manage cached graphs efficiently.
Ensures reuse of graphs, reducing graph reconstruction overhead in CANN backend.

- Introduce ggml_cann_graph_lru_cache to store multiple ggml_cann_graph objects. - Graphs are loaded on demand and evicted using LRU policy when capacity is exceeded. - Updated push, move_to_front, and clear methods to manage cached graphs efficiently. - Ensures reuse of graphs, reducing graph reconstruction overhead in CANN backend.

Signed-off-by: noemotiovon <[email protected]>

hipudding

Thanks for this awsome feature. It does improve the performance.

ggml/src/ggml-cann/common.h

hipudding · 2025-09-08T11:41:52Z

ggml/src/ggml-cann/common.h

+     * @param node Shared pointer to the ggml_cann_graph to move.
+     */
+    void move_to_front(std::shared_ptr<ggml_cann_graph> node) {
+        cache_list.remove(node);


Delete a list in array will go through all elements in array. It's better to use priority queue

The current implementation has a time complexity of O(n), but even if I switch to a priority queue, it would still require a full traversal. I plan to add a map member variable to reduce the time complexity to O(1).

ggml/src/ggml-cann/common.h

ggml/src/ggml-cann/ggml-cann.cpp

noemotiovon · 2025-09-09T03:52:24Z

Test 1: Compiled with ACL graph

cmake .. -DCMAKE_BUILD_TYPE=release -DGGML_CANN=on -DUSE_ACL_GRAPH=on && make -j32

With Acl Graph = on

# Srcipt:
./bin/llama-parallel -m /home/lichenguang25/.ollama/models/blobs/sha256-6f96e01a3f550ca08aea1e5725bb8d5a7eccc6f281c30417e9d380b8c46467bd -np 8 -ns 128 --top-k 1 -pps --junk 10 -c 16384 -ngl 99
# Log
Total prompt tokens:  17075, speed: 488.01 t/s
Total gen tokens:     13278, speed: 379.49 t/s
Total speed (AVG):           speed: 867.50 t/s
Cache misses:             0

llama_perf_context_print:        load time =    1684.96 ms
llama_perf_context_print: prompt eval time =   12888.43 ms / 30601 tokens (    0.42 ms per token,  2374.30 tokens per second)
llama_perf_context_print:        eval time =     131.64 ms /    25 runs   (    5.27 ms per token,   189.91 tokens per second)
llama_perf_context_print:       total time =   34992.69 ms / 30626 tokens
llama_perf_context_print:    graphs reused =       1440

With Acl Graph = off

# Srcipt:
GGML_CANN_ACL_GRAPH=off ./bin/llama-parallel -m /home/lichenguang25/.ollama/models/blobs/sha256-6f96e01a3f550ca08aea1e5725bb8d5a7eccc6f281c30417e9d380b8c46467bd -np 8 -ns 128 --top-k 1 -pps --junk 10 -c 16384 -ngl 99
# Log
Total prompt tokens:  17075, speed: 378.83 t/s
Total gen tokens:     13278, speed: 294.59 t/s
Total speed (AVG):           speed: 673.41 t/s
Cache misses:             0

llama_perf_context_print:        load time =    7599.01 ms
llama_perf_context_print: prompt eval time =   27366.52 ms / 30601 tokens (    0.89 ms per token,  1118.19 tokens per second)
llama_perf_context_print:        eval time =     352.01 ms /    25 runs   (   14.08 ms per token,    71.02 tokens per second)
llama_perf_context_print:       total time =   45077.53 ms / 30626 tokens
llama_perf_context_print:    graphs reused =       1440

noemotiovon · 2025-09-09T06:07:21Z

Test 2: Compiled without ACL graph

cmake .. -DCMAKE_BUILD_TYPE=release -DGGML_CANN=on -DUSE_ACL_GRAPH=off && make -j32

# Srcipt:
./bin/llama-parallel -m /home/lichenguang25/.ollama/models/blobs/sha256-6f96e01a3f550ca08aea1e5725bb8d5a7eccc6f281c30417e9d380b8c46467bd -np 8 -ns 128 --top-k 1 -pps --junk 10 -c 16384 -ngl 99

# Log
Total prompt tokens:  17075, speed: 364.97 t/s
Total gen tokens:     13278, speed: 283.81 t/s
Total speed (AVG):           speed: 648.79 t/s
Cache misses:             0

llama_perf_context_print:        load time =    7621.12 ms
llama_perf_context_print: prompt eval time =   28171.94 ms / 30601 tokens (    0.92 ms per token,  1086.22 tokens per second)
llama_perf_context_print:        eval time =     333.99 ms /    25 runs   (   13.36 ms per token,    74.85 tokens per second)
llama_perf_context_print:       total time =   46788.79 ms / 30626 tokens
llama_perf_context_print:    graphs reused =       1440

hipudding

Just a little more needs to be modified, it’s very close to perfect.

ggml/src/ggml-cann/ggml-cann.cpp

Signed-off-by: noemotiovon <[email protected]>

noemotiovon · 2025-09-10T07:12:32Z

Test 3: Test of graph capture times

GGML_CANN_GRAPH_CACHE_CAPACITY=1, falling back to the old single-graph scenario.

# Srcipt:
GGML_CANN_GRAPH_CACHE_CAPACITY=1 ./bin/llama-parallel -m /home/lichenguang25/.ollama/models/blobs/sha256-6f96e01a3f550ca08aea1e5725bb8d5a7eccc6f281c30417e9d380b8c46467bd -np 8 -ns 128 --top-k 1 -pps --junk 10 -c 16384 -ngl 99

# Log
[DEBUG] acl graph capture times = 701

main: n_parallel = 8, n_sequences = 128, cont_batching = 1, system tokens = 273
External prompt file: used built-in defaults
Model and path used:  /home/lichenguang25/.ollama/models/blobs/sha256-6f96e01a3f550ca08aea1e5725bb8d5a7eccc6f281c30417e9d380b8c46467bd

Total prompt tokens:  17075, speed: 402.39 t/s
Total gen tokens:     13278, speed: 312.91 t/s
Total speed (AVG):           speed: 715.31 t/s
Cache misses:             0

llama_perf_context_print:        load time =    1666.10 ms
llama_perf_context_print: prompt eval time =   20167.03 ms / 30601 tokens (    0.66 ms per token,  1517.38 tokens per second)
llama_perf_context_print:        eval time =     129.78 ms /    25 runs   (    5.19 ms per token,   192.63 tokens per second)
llama_perf_context_print:       total time =   42437.01 ms / 30626 tokens
llama_perf_context_print:    graphs reused =       1440

GGML_CANN_GRAPH_CACHE_CAPACITY=32（default is 12）, In the scenario of using the new LRU cache, try to ensure that the configured value is greater than parallel size.

# Srcipt:
GGML_CANN_GRAPH_CACHE_CAPACITY=32 ./bin/llama-parallel -m /home/lichenguang25/.ollama/models/blobs/sha256-6f96e01a3f550ca08aea1e5725bb8d5a7eccc6f281c30417e9d380b8c46467bd -np 8 -ns 128 --top-k 1 -pps --junk 10 -c 16384 -ngl 99

# Log
[DEBUG] acl graph capture times = 208

main: n_parallel = 8, n_sequences = 128, cont_batching = 1, system tokens = 273
External prompt file: used built-in defaults
Model and path used:  /home/lichenguang25/.ollama/models/blobs/sha256-6f96e01a3f550ca08aea1e5725bb8d5a7eccc6f281c30417e9d380b8c46467bd

Total prompt tokens:  17075, speed: 521.33 t/s
Total gen tokens:     13278, speed: 405.40 t/s
Total speed (AVG):           speed: 926.74 t/s
Cache misses:             0

llama_perf_context_print:        load time =    1604.67 ms
llama_perf_context_print: prompt eval time =   12819.83 ms / 30601 tokens (    0.42 ms per token,  2387.01 tokens per second)
llama_perf_context_print:        eval time =     110.46 ms /    25 runs   (    4.42 ms per token,   226.33 tokens per second)
llama_perf_context_print:       total time =   32756.24 ms / 30626 tokens
llama_perf_context_print:    graphs reused =       1440

* CANN: implement LRU cache for ACL graphs in CANN backend - Introduce ggml_cann_graph_lru_cache to store multiple ggml_cann_graph objects. - Graphs are loaded on demand and evicted using LRU policy when capacity is exceeded. - Updated push, move_to_front, and clear methods to manage cached graphs efficiently. - Ensures reuse of graphs, reducing graph reconstruction overhead in CANN backend. * fix typo * The LRU cache capacity can be configured via an env variable Signed-off-by: noemotiovon <[email protected]> * refactory acl graph * refactory && fix review comments Signed-off-by: noemotiovon <[email protected]> --------- Signed-off-by: noemotiovon <[email protected]>

github-actions bot added ggml changes relating to the ggml tensor library for machine learning Ascend NPU issues specific to Ascend NPUs labels Sep 5, 2025

noemotiovon added 3 commits September 8, 2025 02:18

fix typo

4c9b10a

The LRU cache capacity can be configured via an env variable

15b4ff7

Signed-off-by: noemotiovon <[email protected]>

noemotiovon force-pushed the graph_lru_cache branch from 32b25b7 to 15b4ff7 Compare September 8, 2025 02:18

github-actions bot added the documentation Improvements or additions to documentation label Sep 8, 2025

noemotiovon mentioned this pull request Sep 8, 2025

ACL Graph 支持 LRU Cache，减少图捕获 cosdt/llama.cpp#29

Closed

1 task

hipudding reviewed Sep 8, 2025

View reviewed changes

refactory acl graph

81aa674

hipudding reviewed Sep 10, 2025

View reviewed changes

ggml/src/ggml-cann/ggml-cann.cpp Outdated Show resolved Hide resolved

ggml/src/ggml-cann/ggml-cann.cpp Outdated Show resolved Hide resolved

ggml/src/ggml-cann/ggml-cann.cpp Outdated Show resolved Hide resolved

ggml/src/ggml-cann/ggml-cann.cpp Outdated Show resolved Hide resolved

refactory && fix review comments

d91cefc

Signed-off-by: noemotiovon <[email protected]>

hipudding approved these changes Sep 10, 2025

View reviewed changes

hipudding merged commit 28b5f19 into ggml-org:master Sep 10, 2025
49 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CANN: implement LRU cache for ACL graphs in CANN backend #15814

CANN: implement LRU cache for ACL graphs in CANN backend #15814

Uh oh!

noemotiovon commented Sep 5, 2025

Uh oh!

hipudding left a comment

Uh oh!

Uh oh!

Uh oh!

hipudding Sep 8, 2025

Uh oh!

noemotiovon Sep 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

noemotiovon commented Sep 9, 2025 •

edited

Loading

Uh oh!

noemotiovon commented Sep 9, 2025

Uh oh!

hipudding left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

noemotiovon commented Sep 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

CANN: implement LRU cache for ACL graphs in CANN backend #15814

CANN: implement LRU cache for ACL graphs in CANN backend #15814

Uh oh!

Conversation

noemotiovon commented Sep 5, 2025

What does this PR do?

Uh oh!

hipudding left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hipudding Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

noemotiovon Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

noemotiovon commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test 1: Compiled with ACL graph

Uh oh!

noemotiovon commented Sep 9, 2025

Test 2: Compiled without ACL graph

Uh oh!

hipudding left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

noemotiovon commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test 3: Test of graph capture times

Uh oh!

Uh oh!

Uh oh!

noemotiovon commented Sep 9, 2025 •

edited

Loading

noemotiovon commented Sep 10, 2025 •

edited

Loading