Skip to content

Conversation

noemotiovon
Copy link
Collaborator

What does this PR do?

implement LRU cache for ACL graphs in CANN backend.

  • Introduce ggml_cann_graph_lru_cache to store multiple ggml_cann_graph objects.
  • Graphs are loaded on demand and evicted using LRU policy when capacity is exceeded.
  • Updated push, move_to_front, and clear methods to manage cached graphs efficiently.
  • Ensures reuse of graphs, reducing graph reconstruction overhead in CANN backend.

@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning Ascend NPU issues specific to Ascend NPUs labels Sep 5, 2025
- Introduce ggml_cann_graph_lru_cache to store multiple ggml_cann_graph objects.
- Graphs are loaded on demand and evicted using LRU policy when capacity is exceeded.
- Updated push, move_to_front, and clear methods to manage cached graphs efficiently.
- Ensures reuse of graphs, reducing graph reconstruction overhead in CANN backend.
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Sep 8, 2025
Copy link
Collaborator

@hipudding hipudding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this awsome feature. It does improve the performance.

* @param node Shared pointer to the ggml_cann_graph to move.
*/
void move_to_front(std::shared_ptr<ggml_cann_graph> node) {
cache_list.remove(node);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete a list in array will go through all elements in array. It's better to use priority queue

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation has a time complexity of O(n), but even if I switch to a priority queue, it would still require a full traversal. I plan to add a map member variable to reduce the time complexity to O(1).

@noemotiovon
Copy link
Collaborator Author

noemotiovon commented Sep 9, 2025

Test 1: Compiled with ACL graph

cmake .. -DCMAKE_BUILD_TYPE=release -DGGML_CANN=on -DUSE_ACL_GRAPH=on && make -j32

With Acl Graph = on

# Srcipt:
./bin/llama-parallel -m /home/lichenguang25/.ollama/models/blobs/sha256-6f96e01a3f550ca08aea1e5725bb8d5a7eccc6f281c30417e9d380b8c46467bd -np 8 -ns 128 --top-k 1 -pps --junk 10 -c 16384 -ngl 99
# Log
Total prompt tokens:  17075, speed: 488.01 t/s
Total gen tokens:     13278, speed: 379.49 t/s
Total speed (AVG):           speed: 867.50 t/s
Cache misses:             0

llama_perf_context_print:        load time =    1684.96 ms
llama_perf_context_print: prompt eval time =   12888.43 ms / 30601 tokens (    0.42 ms per token,  2374.30 tokens per second)
llama_perf_context_print:        eval time =     131.64 ms /    25 runs   (    5.27 ms per token,   189.91 tokens per second)
llama_perf_context_print:       total time =   34992.69 ms / 30626 tokens
llama_perf_context_print:    graphs reused =       1440

With Acl Graph = off

# Srcipt:
GGML_CANN_ACL_GRAPH=off ./bin/llama-parallel -m /home/lichenguang25/.ollama/models/blobs/sha256-6f96e01a3f550ca08aea1e5725bb8d5a7eccc6f281c30417e9d380b8c46467bd -np 8 -ns 128 --top-k 1 -pps --junk 10 -c 16384 -ngl 99
# Log
Total prompt tokens:  17075, speed: 378.83 t/s
Total gen tokens:     13278, speed: 294.59 t/s
Total speed (AVG):           speed: 673.41 t/s
Cache misses:             0

llama_perf_context_print:        load time =    7599.01 ms
llama_perf_context_print: prompt eval time =   27366.52 ms / 30601 tokens (    0.89 ms per token,  1118.19 tokens per second)
llama_perf_context_print:        eval time =     352.01 ms /    25 runs   (   14.08 ms per token,    71.02 tokens per second)
llama_perf_context_print:       total time =   45077.53 ms / 30626 tokens
llama_perf_context_print:    graphs reused =       1440

@noemotiovon
Copy link
Collaborator Author

Test 2: Compiled without ACL graph

cmake .. -DCMAKE_BUILD_TYPE=release -DGGML_CANN=on -DUSE_ACL_GRAPH=off && make -j32
# Srcipt:
./bin/llama-parallel -m /home/lichenguang25/.ollama/models/blobs/sha256-6f96e01a3f550ca08aea1e5725bb8d5a7eccc6f281c30417e9d380b8c46467bd -np 8 -ns 128 --top-k 1 -pps --junk 10 -c 16384 -ngl 99

# Log
Total prompt tokens:  17075, speed: 364.97 t/s
Total gen tokens:     13278, speed: 283.81 t/s
Total speed (AVG):           speed: 648.79 t/s
Cache misses:             0

llama_perf_context_print:        load time =    7621.12 ms
llama_perf_context_print: prompt eval time =   28171.94 ms / 30601 tokens (    0.92 ms per token,  1086.22 tokens per second)
llama_perf_context_print:        eval time =     333.99 ms /    25 runs   (   13.36 ms per token,    74.85 tokens per second)
llama_perf_context_print:       total time =   46788.79 ms / 30626 tokens
llama_perf_context_print:    graphs reused =       1440

Copy link
Collaborator

@hipudding hipudding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a little more needs to be modified, it’s very close to perfect.

@noemotiovon
Copy link
Collaborator Author

noemotiovon commented Sep 10, 2025

Test 3: Test of graph capture times

GGML_CANN_GRAPH_CACHE_CAPACITY=1, falling back to the old single-graph scenario.

# Srcipt:
GGML_CANN_GRAPH_CACHE_CAPACITY=1 ./bin/llama-parallel -m /home/lichenguang25/.ollama/models/blobs/sha256-6f96e01a3f550ca08aea1e5725bb8d5a7eccc6f281c30417e9d380b8c46467bd -np 8 -ns 128 --top-k 1 -pps --junk 10 -c 16384 -ngl 99

# Log
[DEBUG] acl graph capture times = 701

main: n_parallel = 8, n_sequences = 128, cont_batching = 1, system tokens = 273
External prompt file: used built-in defaults
Model and path used:  /home/lichenguang25/.ollama/models/blobs/sha256-6f96e01a3f550ca08aea1e5725bb8d5a7eccc6f281c30417e9d380b8c46467bd

Total prompt tokens:  17075, speed: 402.39 t/s
Total gen tokens:     13278, speed: 312.91 t/s
Total speed (AVG):           speed: 715.31 t/s
Cache misses:             0

llama_perf_context_print:        load time =    1666.10 ms
llama_perf_context_print: prompt eval time =   20167.03 ms / 30601 tokens (    0.66 ms per token,  1517.38 tokens per second)
llama_perf_context_print:        eval time =     129.78 ms /    25 runs   (    5.19 ms per token,   192.63 tokens per second)
llama_perf_context_print:       total time =   42437.01 ms / 30626 tokens
llama_perf_context_print:    graphs reused =       1440

GGML_CANN_GRAPH_CACHE_CAPACITY=32(default is 12), In the scenario of using the new LRU cache, try to ensure that the configured value is greater than parallel size.

# Srcipt:
GGML_CANN_GRAPH_CACHE_CAPACITY=32 ./bin/llama-parallel -m /home/lichenguang25/.ollama/models/blobs/sha256-6f96e01a3f550ca08aea1e5725bb8d5a7eccc6f281c30417e9d380b8c46467bd -np 8 -ns 128 --top-k 1 -pps --junk 10 -c 16384 -ngl 99

# Log
[DEBUG] acl graph capture times = 208

main: n_parallel = 8, n_sequences = 128, cont_batching = 1, system tokens = 273
External prompt file: used built-in defaults
Model and path used:  /home/lichenguang25/.ollama/models/blobs/sha256-6f96e01a3f550ca08aea1e5725bb8d5a7eccc6f281c30417e9d380b8c46467bd

Total prompt tokens:  17075, speed: 521.33 t/s
Total gen tokens:     13278, speed: 405.40 t/s
Total speed (AVG):           speed: 926.74 t/s
Cache misses:             0

llama_perf_context_print:        load time =    1604.67 ms
llama_perf_context_print: prompt eval time =   12819.83 ms / 30601 tokens (    0.42 ms per token,  2387.01 tokens per second)
llama_perf_context_print:        eval time =     110.46 ms /    25 runs   (    4.42 ms per token,   226.33 tokens per second)
llama_perf_context_print:       total time =   32756.24 ms / 30626 tokens
llama_perf_context_print:    graphs reused =       1440

@hipudding hipudding merged commit 28b5f19 into ggml-org:master Sep 10, 2025
49 checks passed
njsyw1997 pushed a commit to aizip/llama.cpp that referenced this pull request Sep 10, 2025
* CANN: implement LRU cache for ACL graphs in CANN backend

- Introduce ggml_cann_graph_lru_cache to store multiple ggml_cann_graph objects.
- Graphs are loaded on demand and evicted using LRU policy when capacity is exceeded.
- Updated push, move_to_front, and clear methods to manage cached graphs efficiently.
- Ensures reuse of graphs, reducing graph reconstruction overhead in CANN backend.

* fix typo

* The LRU cache capacity can be configured via an env variable

Signed-off-by: noemotiovon <[email protected]>

* refactory acl graph

* refactory && fix review comments

Signed-off-by: noemotiovon <[email protected]>

---------

Signed-off-by: noemotiovon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ascend NPU issues specific to Ascend NPUs documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants