-
Notifications
You must be signed in to change notification settings - Fork 13k
metal : make the backend async #15832
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Making |
Thanks that's good to know. I think I got confused by seeing |
You can see a |
Got it - thanks. The main motivation is to allow to overlap the |
2b8074d
to
84ae836
Compare
@slaren Should I expect the |
I don't think there are any situations currently in llama.cpp in which |
84ae836
to
c4fe8b0
Compare
ggml/src/ggml-metal/ggml-metal.m
Outdated
static bool ggml_backend_metal_buffer_type_is_host(ggml_backend_buffer_type_t buft) { | ||
return true; | ||
// TODO: not sure why, but without setting this to `false`, op offloading does not work correctly | ||
// to reproduce, do the following: | ||
// | ||
// build with: cmake -DGGML_BLAS=OFF -DGGML_METAL=ON | ||
// run: ./bin/llama-cli -m ggml-model-mxfp4.gguf -p "$(printf 'hello %.0s' {1..100})" --n-cpu-moe 10 | ||
// | ||
return false; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not completely sure I understand why we cannot return true
here. If set to true
it seems to result in a data race at some point.
@slaren Is setting false
here a good solution, or is there a better approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's probably a bug in the way the scheduler handles this. By returning true, the CPU backend will use the results from the Metal backend without copies, but there is no synchronization preventing a race when the backend is async.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a full solution for this will be tricky, I think it will require tracking the source of every tensor used in a split.
One immediate solution would be to stop using newBufferWithBytesNoCopy
in Metal, create device-only buffers, and accept the cost of copying to the CPU backend. This will also likely improve the performance of Metal with discrete GPUs.
If that's not acceptable and there is a significant cost of these copies, this temporary patch should allow it to work without impacting negatively other cases:
diff --git a/ggml/src/ggml-backend.cpp b/ggml/src/ggml-backend.cpp
index 7646f3f13..e88974121 100644
--- a/ggml/src/ggml-backend.cpp
+++ b/ggml/src/ggml-backend.cpp
@@ -1426,6 +1426,21 @@ static enum ggml_status ggml_backend_sched_compute_splits(ggml_backend_sched_t s
int split_backend_id = split->backend_id;
ggml_backend_t split_backend = sched->backends[split_backend_id];
+ // HACK: if the split backend does not support async, synchronize the previous backend to avoid data races
+ // a full solution would require keeping track of the dependencies between splits, and synchronizing only when needed
+ // normally this is not a problem because backends have incompatible buffer types, and thus cannot share tensors
+ // however, in some cases such as Metal, the CPU backend can use the same buffer type as the GPU backend (i.e. host memory)
+ if (split_id > 0) {
+ ggml_backend_dev_props props;
+ auto * dev = ggml_backend_get_device(split_backend);
+ ggml_backend_dev_get_props(dev, &props);
+ if (!props.caps.async) {
+ int prev_backend_id = splits[split_id - 1].backend_id;
+ ggml_backend_t prev_backend = sched->backends[prev_backend_id];
+ ggml_backend_synchronize(prev_backend);
+ }
+ }
+
// copy the input tensors to the split backend
for (int input_id = 0; input_id < split->n_inputs; input_id++) {
ggml_backend_t input_backend = ggml_backend_sched_get_tensor_backend(sched, split->inputs[input_id]);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In 85aaf52 I removed the newBufferWithBytesNoCopy
usage and now buffers are created with Private Metal memory type. Setting is_host == true
now works correctly.
I will be testing the performance now to see if there is any significant impact from this change.
f369bdb
to
8d4835d
Compare
ggml/src/ggml-metal/ggml-metal.m
Outdated
#if 1 | ||
// TODO: tmp hack | ||
static void * p_base = (void *) 0x000000400ULL; | ||
|
||
ctx->all_data = p_base; | ||
|
||
p_base = (void *) ((uintptr_t) p_base + size_aligned); | ||
#else | ||
ctx->all_data = ggml_metal_host_malloc(size_aligned); | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the proper way to handle this?
Previously, we used the allocated buffer in host memory as the base address. What should I use now after we no longer allocate host memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Metal does not give a pointer to the buffers that's fine, just use a non-zero aligned pointer as the base address.
8d4835d
to
85aaf52
Compare
ggml/src/ggml-metal/ggml-metal.m
Outdated
static struct ggml_backend_buffer_type ggml_backend_buffer_from_ptr_type_metal = { | ||
/* .iface = */ { | ||
/* .get_name = */ ggml_backend_metal_buffer_from_ptr_type_get_name, | ||
/* .alloc_buffer = */ ggml_backend_metal_buffer_type_alloc_buffer, | ||
/* .alloc_buffer = */ NULL, | ||
/* .get_alignment = */ ggml_backend_metal_buffer_type_get_alignment, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The buffer_from_ptr_type
should not need to implement .alloc_buffer
, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should be ok to leave the alloc_buffer
function unimplemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this function, one of the server tests fails during llama-server
loading like this:
bin/llama-server --host 127.0.0.1 --port 8080 --temp 0.0 --seed 42 --hf-repo ggml-org/stories15M_MOE --hf-file stories15M_MOE-F16.gguf --batch-size 1024 --no-slots --alias stories15m-moe --ctx-size 2048 --parallel 1 --n-predict 64 --lora ./tmp/moe_shakespeare15M.gguf
Process 8535 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
frame #0: 0x0000000000000000
error: memory read failed for 0x0
Target 0: (llama-server) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
* frame #0: 0x0000000000000000
frame #1: 0x0000000101ba28fc libggml-base.dylib`ggml_backend_buft_alloc_buffer(buft=0x00000001015f0a70, size=15634432) at ggml-backend.cpp:45:12
frame #2: 0x0000000101ba14b0 libggml-base.dylib`alloc_tensor_range(ctx=0x00006000009cc900, first=0x0000000101ff4020, last=0x0000000000000000, buft=0x00000001015f0a70, size=15634432, buffers=0x000000016fdf81e0, n_buffers=0x000000016fdf81d8) at ggml-alloc.c:937:36
frame #3: 0x0000000101ba13dc libggml-base.dylib`ggml_backend_alloc_ctx_tensors_from_buft(ctx=0x00006000009cc900, buft=0x00000001015f0a70) at ggml-alloc.c:1004:14
frame #4: 0x0000000102186f50 libllama.dylib`llama_adapter_lora_init_impl(model=0x0000000143053600, path_lora="./tmp/moe_shakespeare15M.gguf", adapter=0x00006000036cc140) at llama-adapter.cpp:385:43
frame #5: 0x0000000102185160 libllama.dylib`llama_adapter_lora_init(model=0x0000000143053600, path_lora="./tmp/moe_shakespeare15M.gguf") at llama-adapter.cpp:421:9
frame #6: 0x000000010039f5a0 llama-server`common_init_from_params(params=0x000000016fdfbf48) at common.cpp:985:20
frame #7: 0x0000000100007694 llama-server`server_context::load_model(this=0x000000016fdfbf48, params=0x000000016fdfd600) at server.cpp:2144:22
frame #8: 0x000000010000598c llama-server`main(argc=26, argv=0x000000016fdfef70) at server.cpp:5319:21
frame #9: 0x000000019ac9ab98 dyld`start + 6076
Bringing back the function fixes this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Um yeah, this is because the lora tensors use the same buffer type as the base tensors. It takes the buffer type from the tensor and allocates a new buffer of the same type. I didn't think of this case, it is better to leave it implemented then.
ggml/src/ggml-metal/ggml-metal.m
Outdated
static struct ggml_backend_metal_device_context { | ||
id<MTLDevice> mtl_device; | ||
int mtl_device_ref_count; | ||
id<MTLLibrary> mtl_library; | ||
id<MTLDevice> mtl_device; | ||
int mtl_device_ref_count; | ||
id<MTLCommandQueue> mtl_queue; | ||
id<MTLLibrary> mtl_library; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point during the refactoring, decided to move the MTLQueue from the backend context to the device context. I.e. a single queue is now shared by all backend contexts.
Not sure if this is better. Maybe will move it back to the backend context and create separate queues for each synchronous get_tensor
, set_tensor
and memset_tensor
operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intention is that the ggml_backend
object represents a queue or stream. In practice, it probably doesn't matter too much if it is shared for Apple Mx iGPUs, but it would prevent some use cases such as running a computation in one backend and a memory transfer in another simultaneously.
id<MTLCommandQueue> queue = ctx->queue; | ||
id<MTLCommandBuffer> cmd_buf = [queue commandBuffer]; | ||
id<MTLBlitCommandEncoder> encoder = [cmd_buf blitCommandEncoder]; | ||
[cmd_buf enqueue]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell from the doc, there is no reason to call enqueue
unless you plan to commit multiple command buffers and want to "reserve a place in the queue". commit
automatically calls enqueue
if necessary:
If your app calls commit for a command buffer that isn’t enqueued, the method effectively calls enqueue for you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, these enqueue
calls are not needed - reomving.
// here we guarantee the full previous graph has finished computing | ||
// but note that we have already enqueued the first part of the new graph so it can start processing, while | ||
// continue to encode the rest of the graph | ||
if (ctx->cmd_buf_last) { | ||
[ctx->cmd_buf_last waitUntilCompleted]; | ||
ctx->cmd_buf_last = nil; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why is this necessary, or why there are both a cmd_buf_last
and a cmd_buf_ext_last
. It seems that a single reference to the last queued buffer should be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why is this necessary
Technically here, instead of waiting for the previous command buffers to finish, we could create new command buffers and proceed directly with enqueuing the new graph. But the limitation is that for each cmd_buf
we have an associated ggml_metal_mem_pool
used for allocating fleeting buffers that are needed for some of the operators. So creating a new command buffer, would require creating new memory pool which is not desired as the idea of the pools is to reuse previously allocated memory.
why there are both a cmd_buf_last and a cmd_buf_ext_last
Single reference will work, but I thought that this is slightly more optimal. In the following scenario:
# encoding a graph using 2 command buffers:
# first cb is a small cb with the first 128 nodes
# second cb is the rest of the graph
enqueue compute graph 0 (cb0)
enqueue compute graph 0 (cb1) <--- cmd_buf_last points to this CB
wait until scheduled cb1
async set tensor A
async set tensor B
...
async set tensor X <--- cmd_buf_ext_last points to this CB
wait until complete cb0
enqueue compute graph 1 (cb0)
wait until complete cb1 <--- here no need to wait for the async copies
enqueue compute graph 1 (cb1)
The last wait only has to wait for the previous graph to be finished and does not need to wait for any async copies that might be ongoing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the limitation is that for each
cmd_buf
we have an associatedggml_metal_mem_pool
used for allocating fleeting buffers that are needed for some of the operators.
Wouldn't it be possible to use the same pool for all the command buffers? Since they are executed in sequence, not in parallel, I don't see why each one needs a different one.
On a side note, I realized that actually it would be easy to allocate these buffers in the compute buffer, and remove all the need of using pools completely. This could be done by allocating the memory necessary at the end of the tensor data, by changing get_alloc_size
to consider the OP of the tensor, and add any extra memory necessary to the size of the tensor. Maybe we could add some kind of API to ggml-backend to make this nicer to use, but essentially all the infrastructure necessary to do this is already there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a logic for freeing unused heaps (MTLHeap) from the pool when they haven't been used:
llama.cpp/ggml/src/ggml-metal/ggml-metal.m
Lines 664 to 696 in 4f63cd7
static void ggml_metal_mem_pool_reset(struct ggml_metal_mem_pool * mem_pool) { | |
for (NSUInteger i = 0; i < [mem_pool->heaps count]; i++) { | |
ggml_metal_heap_ptr * ptr = [mem_pool->heaps objectAtIndex:i]; | |
struct ggml_metal_heap * heap = ptr.data; | |
ggml_metal_heap_reset(heap); | |
// if the heap hasn't been used for a while, remove it | |
if (heap->n_unused >= 128) { | |
[mem_pool->heaps_to_remove addObject:@(i)]; | |
} | |
} | |
if (mem_pool->heaps_to_remove.count > 0) { | |
// remove in reverse order | |
for (NSUInteger i = [mem_pool->heaps_to_remove count] - 1; ; --i) { | |
NSUInteger index = [[mem_pool->heaps_to_remove objectAtIndex:i] intValue]; | |
ggml_metal_heap_ptr * ptr = [mem_pool->heaps objectAtIndex:index]; | |
struct ggml_metal_heap * heap = ptr.data; | |
ggml_metal_heap_free(heap); | |
[mem_pool->heaps removeObjectAtIndex:index]; | |
[ptr release]; | |
if (i == 0) { | |
break; | |
} | |
} | |
[mem_pool->heaps_to_remove removeAllObjects]; | |
} | |
} |
Not sure what would happen if during enqueuing the new graph, we decide that a heap should be released, while in the previous graph we are using a buffer allocated in that heap.
On a side note, I realized that actually it would be easy to allocate these buffers in the compute buffer
Nice. Would be great to remove the memory pools. Without the pools, I think there would be no reason to wait here.
[encoder endEncoding]; | ||
|
||
[cmd_buf commit]; | ||
[cmd_buf waitUntilScheduled]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not obvious to me why waitUntilScheduled
is necessary here, from what I can tell this is something that only needs to be done on iOS when the app is about to be moved to the background.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was worried that exiting the autoreleasepool
section before fully scheduled might be a problem, but my guess it that the it's destructor likely has this call in it, so probably we can remove it.
Superseded with #15906 |
Implement async support for the Metal backend. Benefits very large contexts since we can now prepare the KQ mask while the previous batch is being computed on the GPU.