-
Notifications
You must be signed in to change notification settings - Fork 12.7k
CUDA: fix overflow in FA, tune performance #14840
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: fix overflow in FA, tune performance #14840
Conversation
ggml/src/ggml-cuda/fattn-vec-f32.cuh
Outdated
const int32_t ne00, const int32_t ne01, const int32_t ne02, const int32_t ne03, | ||
const int32_t nb01, const int32_t nb02, const int32_t nb03, | ||
const int32_t ne10, const int32_t ne11, const int32_t ne12, const int32_t ne13, | ||
const int32_t nb11, const int32_t nb12, const int64_t nb13, | ||
const int32_t nb21, const int32_t nb22, const int64_t nb23, | ||
const int32_t ne31, const int32_t ne32, const int32_t ne33, | ||
const int32_t nb31, const int32_t nb32, const int32_t nb33) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the logic for choosing between int32_t
and int64_t
here? For example, why is int64_t nb23
, but int32_t nb33
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mask is being broadcast across all attention heads so it's simply smaller than K/V. I suppose you could also use 64 bit for nb33
, it should still be fine in terms of register pressure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More generally, the only offsets that are going to be really large are those that scale with the number of tokens, so the offsets between sequences.
ec05b08
to
d4209ee
Compare
* origin/master: docs : update HOWTO‑add‑model.md for ModelBase and new model classes (ggml-org#14874) ggml : remove invalid portPos specifiers from dot files (ggml-org#14838) context : restore preemptive sched reset when LLAMA_SET_ROWS=0 (ggml-org#14870) mtmd : fix 32-bit narrowing issue in export-lora and mtmd clip (ggml-org#14503) rpc : check for null buffers in get/set/copy tensor endpoints (ggml-org#14868) sched : fix multiple evaluations of the same graph with pipeline parallelism (ggml-org#14855) musa: upgrade musa sdk to rc4.2.0 (ggml-org#14498) sync : ggml cmake : fix usage issues (ggml/1257) ggml-cpu : remove stdlib include from repack.cpp (ggml/1276) context : perform output reorder lazily upon access after sync (ggml-org#14853) chat : fix kimi-k2 chat template (ggml-org#14852) sycl: fixed semantics of block offset calculation (ggml-org#14814) llama : fix MiniCPM inference after Granite Four changes (ggml-org#14850) docs: add libcurl-dev install hint for Linux distros (ggml-org#14801) metal : fix fusion across different encoders (ggml-org#14849) sycl: fix undefined variable in work group size check (ggml-org#14843) convert : text-only support for GLM-4.1V-9B-Thinking (ggml-org#14823) CUDA: fix overflow in FA, tune performance (ggml-org#14840) CUDA: fix compilation with GGML_CUDA_F16 (ggml-org#14837)
Hi, not sure if this is expected, but I'm seeing a regression in PP performance with FA enabled on my RX 6800:
vs. previous commit:
I also found other unrelated regression, so keep in mind it may also influence benchmark results at the same time (see #14624). |
In my testing the new build is consistently faster:
|
Huh, interesting. I wonder if ROCm version could explain the difference; I haven't updated in a while (the install directory says 6.0.0, though I'm not sure if that's the specific release or just 6.x.x.). I'll try to update and see if it changes anything. Great job on the other regression btw.! |
After upgrading to ROCm 6.4.2, the PP speed with FA is now comparable between both commits. It's still lower than ROCm 6.0.0 + b5972 (~620 t/s vs. ~820), but I suppose it means the regression isn't directly related to this PR, but rather to some version-specific "something" that just happened to be triggered by this PR in the previous version.
vs.
|
Due to numerical overflows the CUDA FlashAttention code on master does not work correctly for very long contexts (something like several million tokens across all sequences). This PR uses 64 bit math for those parts of the code susceptible to such problems: the K/V offsets between sequences and the calculation of K/V offsets within a sequence. For the vector kernel there was a performance regression on Pascal when simply casting the offsets to 64 bit, for this reason I'm adding a 32 bit offset after each iteration (turns out to be faster for Pascal/AMD anyways). I am not seeing any performance differences for the other kernels so I'm just casting the offsets to 64 bit. While working on this I noticed that at some point the tile FA kernels seem to have gotten faster than the vector kernels on my RX 6800 so I'm enabling them for AMD.
Performance changes