You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Introduction of CUDA Graphs to LLama.cpp (ggml-org#6766)
* DRAFT: Introduction of CUDA Graphs to LLama.cpp
* FIx issues raised in comments
* Tidied to now only use CUDA runtime (not mixed with driver calls)
* disable for multi-gpu and batch size > 1
* Disable CUDA graphs for old GPU arch and with env var
* added missing CUDA_CHECKs
* Addressed comments
* further addressed comments
* limit to GGML_ALLOW_CUDA_GRAPHS defined in llama.cpp cmake
* Added more comprehensive graph node checking
* With mechanism to fall back if graph capture fails
* Revert "With mechanism to fall back if graph capture fails"
This reverts commit eb9f15f.
* Fall back if graph capture fails and address other comments
* - renamed GGML_ALLOW_CUDA_GRAPHS to GGML_CUDA_USE_GRAPHS
- rename env variable to disable CUDA graphs to GGML_CUDA_DISABLE_GRAPHS
- updated Makefile build to enable CUDA graphs
- removed graph capture failure checking in ggml_cuda_error
using a global variable to track this is not thread safe, but I am also not safistied with checking an error by string
if this is necessary to workaround some issues with graph capture with eg. cuBLAS, we can pass the ggml_backend_cuda_context to the error checking macro and store the result in the context
- fixed several resource leaks
- fixed issue with zero node graphs
- changed fixed size arrays to vectors
- removed the count of number of evaluations before start capturing, and instead changed the capture mode to relaxed
- removed the check for multiple devices so that it is still possible to use a single device, instead checks for split buffers to disable cuda graphs with -sm row
- changed the op for checking batch size to GGML_OP_ADD, should be more reliable than GGML_OP_SOFT_MAX
- code style fixes
- things to look into
- VRAM usage of the cudaGraphExec_t, if it is significant we may need to make it optional
- possibility of using cudaStreamBeginCaptureToGraph to keep track of which ggml graph nodes correspond to which cuda graph nodes
* fix build without cuda graphs
* remove outdated comment
* replace minimum cc value with a constant
---------
Co-authored-by: slaren <[email protected]>
0 commit comments