Skip to content

Conversation

Acly
Copy link
Collaborator

@Acly Acly commented Sep 5, 2025

This PR makes ggml_gallocr distribute allocations to multiple backend buffers depending on the maximum allocation size reported by the backend. This allows eg. the Vulkan backend to process graphs which require >4 GB of memory.

I tried to avoid risk and minimize changes/complexity:

  • No API changes
  • No change in existing behavior (buffer layout / tensor offsets stay exactly the same as on master)

Implementation:

  • ggml_gallocr: almost no changes here, it continues to operate with contiguous offsets in [0, SIZE_MAX). Instead of using ggml_backend_buffer directly it now uses vbuffer
  • vbuffer: small local abstraction which distributes a virtual memory range to one or more backend buffers ("chunks")
  • ggml_dyn_tallocr: is now aware of backend's maximum buffer size to ensure tensors are not allocated across multiple chunks. This is done by setting the size of the last free_block to the maximum buffer size, and introducing a new block at the end of the range when additional memory is required.

Vulkan: modified to report actual maximum allocation size. This will change how weights are allocated. I'm not sure how important it is to keep the previous behavior there, happy to discuss alternatives.

* if the graph requires more memory than can fit into a single allocation, split it into multiple backend buffers
* vulkan: report the actual max  allocation size in buffer type  interface
@Acly Acly requested review from ggerganov and slaren September 5, 2025 09:42
@Acly Acly requested a review from 0cc4m as a code owner September 5, 2025 09:42
@0cc4m
Copy link
Collaborator

0cc4m commented Sep 5, 2025

I don't understand the change yet, what you describe is how it was already working, at least how I understood it. The graph allocator is merging as many tensors into one allocation as possible, as long as it stays below the backend's max allocation size.

We use the suballocation size in the Vulkan backend to reduce the allocation sizes for performance reasons, if possible. If a single tensor requires more than the actual max_allocation size, currently it will try to allocate that and usually the driver will respond with an exception. I don't think you are addressing this issue (and I don't think that's really possible from the GGML side).

@github-actions github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Sep 5, 2025
@Acly
Copy link
Collaborator Author

Acly commented Sep 5, 2025

@0cc4m It was working like that for allocating weights. But the allocator for the compute buffers always ignored the backend max size. It batched all tensors into one large buffer and tried to allocate it, failing for Vulkan if it's >4GB. See eg. #11332

It's uncommon to hit that limitation with LLMs I think, they have huge weights but relatively small computation. For images (and video) it becomes an issue as soon as you increase resolution a little.

A single tensor beyond the maximum allocation size is not possible, no change there.

The reason "suballocation size" gets in the way here is that all allocations to be done are mapped out first, before trying to do the actual backend allocation. The algorithm needs to know the actual maximum here, not a "soft" maximum. I'd also argue that in this case you don't want to artificially reduce batching, as it will increase total memory required due to increased fragmentation (harder to reuse memory of previous computations).

I'm sure we can find a way to reintrodce the soft max for weight allocation though, just wasn't sure why it was there exactly and how big of a difference it makes.

@0cc4m
Copy link
Collaborator

0cc4m commented Sep 5, 2025

I understand, thank you for the explanation. But we do need to keep that suballocation limit recommendation in some way, IMO.

@Acly
Copy link
Collaborator Author

Acly commented Sep 5, 2025

But we do need to keep that suballocation limit recommendation in some way, IMO.

Okay I read some of #11520 #12434 and related issues... in summary smaller buffers help with host-visible memory and driver issues. I see 2 options:

  1. Add a separate backend function to return a recommended max batch size for buffer allocations and use that for weight allocation in ggml_backend_alloc_ctx_tensors_from_buft.
  2. Track sizes of individual buffers in ggml_dyn_tallocr. That would enable it to work with a smaller max size and use similar logic to weight allocation (batch tensors up to max size, but still support larger allocations if a single tensor requires it)

Option 2 is nicer I guess since it can also avoid those allocation problems for the compute buffers. It increases complexity of ggml_dyn_tallocr a bit. Also currently the maximum number of buffers is 8, probably need to raise that if they're only ~1GB each (or make it dynamic).

I'll wait a bit to see if there are more opinions before implementing something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning testing Everything test related Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants