Separate output and accumulator type for Flash Attention Prefill Cached #448

muhammad-tanvir-1211 · 2025-06-26T14:41:21Z

This PR separates the output type and accumulator type for Flash Attention Prefill Cached. Combinations supported are:

bf16 inputs, fp32 accumulator, bf16 | fp32 output
fp16 inputs, fp32 accumulator, fp16 | fp32 output

It also fixes the PagedKV cache support when used with Variable Length sequences.

joeatodd

LGTM - some suggestions.

joeatodd · 2025-06-27T12:43:36Z

applications/flash_attention_v2/collective/xe_flash_attn_prefill_epilogue_cachedKV.hpp

  using TiledMmaOutput = typename TiledMMAHelper<MMA_Atom<MMAOperation_>, Layout<TileShapeOutput>, SubgroupLayout>::TiledMMA;
  using GmemTiledCopyO = CopyOpO;
  using ElementOutput = ElementO_;
  using ElementCompute = ElementO_;


Suggested change

using ElementCompute = ElementO_;

using ElementCompute = ElementCompute_;

joeatodd · 2025-06-27T12:44:10Z

applications/flash_attention_v2/collective/xe_flash_attn_prefill_epilogue_cachedKV.hpp

    Tensor tOgO = thread_xe_store_o.partition_D(gO);

-    copy(params.xe_store_o, out_reg, tOgO);
+    Tensor final_out_reg = make_fragment_like<ElementOutput>(out_reg);


as in #443, I think a comment explaining this if/else would be useful.

Added the comment after this line.

joeatodd · 2025-06-27T12:47:05Z

applications/flash_attention_v2/kernel/xe_flash_attn_prefill_cachedKV.hpp

    bool mode_implementable = args.mode == gemm::GemmUniversalMode::kGemm or
                              (args.mode == gemm::GemmUniversalMode::kBatched && rank(ProblemShape{}) == 4);
-    return mode_implementable;
+    bool valid_page_size = !PagedKV ? true : args.mainloop.page_size >= QK_BLK_N && args.mainloop.page_size % QK_BLK_N == 0;


Suggested change

bool valid_page_size = !PagedKV ? true : args.mainloop.page_size >= QK_BLK_N && args.mainloop.page_size % QK_BLK_N == 0;

bool valid_page_size = !PagedKV || (args.mainloop.page_size >= QK_BLK_N && args.mainloop.page_size % QK_BLK_N == 0);

joeatodd · 2025-06-27T12:58:07Z

examples/sycl/06_bmg_flash_attention/bmg_flash_attn_prefill_cachedKV_runner.hpp

+        int seq_len_cache = isVarLen ? cumulative_seqlen_kv_cache[b + 1] - cumulative_seqlen_kv_cache[b] : seq_len_kv_cache;
+        int pages_per_seq = ceil_div(seq_len_cache, paged_kv_cache.page_size);
+        num_pages_per_seq.push_back(num_pages_per_seq.back() + pages_per_seq);
+        num_pages += pages_per_seq;


Am I understanding correctly that num_pages_per_seq actually stores the offset of the first page of each sequence? If so, I'd say that it is misnamed. It is seq_page_offsets or something like that.

For Variable length sequences, the seq_len_cache is not constant, so each batch has it's own set of pages. num_pages_per_seq is storing all the indices of the pages in all of the batches. It's size would be batch * num_pages_for_all_batches.

joeatodd · 2025-06-27T13:10:42Z

examples/sycl/06_bmg_flash_attention/bmg_flash_attn_prefill_cachedKV_runner.hpp

+        std::vector<int> physical_pages(num_pages_per_seq[b + 1] - num_pages_per_seq[b]);
+        std::iota(physical_pages.begin(), physical_pages.end(), 0);
+        // shuffle physical pages
+        std::shuffle(physical_pages.begin(), physical_pages.end(), std::mt19937{ std::random_device{}() });
+        for (int blk = 0; blk < physical_pages.size(); ++blk) {
+          int logical_idx = num_pages_per_seq[b] + blk;
+          page_mapping[logical_idx] = physical_pages[blk];
        }


Maybe this is correct, but I think the page_mapping for each seq will contain the same indices (though shuffled differently).

In other words, the value e.g. 0 will appear cute::get<0>(problem_shape) times in the final page_mapping.

Is this expected?

Yes, this is true. We may get repeated indices. This is needed for Variable length sequences when each batch can have a different number of pages, and the physical page mapping for each batch could vary. So we need this bigger vector to hold the mapping information.

joeatodd · 2025-06-27T13:15:12Z

.../unit/flash_attention/flash_attention_prefill_cachedkv/flash_prefill_cachedkv_testbed_3x.hpp

          int max_idx = row;
          for (int col = 0; col < seq_len_kv_total; col++, idx++) {
-            host_S[idx] = expf((host_S[idx] - max_vec[max_idx]) * softmax_scale);
+            host_S[idx] = expf((host_S[idx] - max_vec[max_idx]) / std::sqrt(static_cast<ElementAccumulator>((head_size_qk))));


This change leaves the softmax_scale argument to verify unused. Also extra () brackets around head_size_qk.

Separate output and accum type

b923b3a

muhammad-tanvir-1211 requested a review from a team June 26, 2025 14:41

muhammad-tanvir-1211 added the release label Jun 26, 2025

muhammad-tanvir-1211 mentioned this pull request Jun 26, 2025

Refactor tests for Flash Attention Prefill Cached #449

Open

Fix race condition for PagedKV case

4d1c653

aacostadiaz approved these changes Jun 27, 2025

View reviewed changes

Remove redundant wait from runners

11ce9d7

joeatodd approved these changes Jun 27, 2025

View reviewed changes

muhammad-tanvir-1211 and others added 4 commits June 27, 2025 14:45

Fix incorrect epilogue type

b741148

Address more PR feedback

274f39c

Add comment on output type conversion

9e02382

Merge branch 'sycl-develop' into flash_prefill_cached_separate_out_type

ecb93c5

aacostadiaz merged commit 5377d14 into intel:sycl-develop Jun 30, 2025
15 of 21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Separate output and accumulator type for Flash Attention Prefill Cached #448

Separate output and accumulator type for Flash Attention Prefill Cached #448

Uh oh!

muhammad-tanvir-1211 commented Jun 26, 2025

Uh oh!

joeatodd left a comment

Uh oh!

joeatodd Jun 27, 2025

Uh oh!

joeatodd Jun 27, 2025

Uh oh!

muhammad-tanvir-1211 Jun 27, 2025

Uh oh!

joeatodd Jun 27, 2025

Uh oh!

joeatodd Jun 27, 2025

Uh oh!

muhammad-tanvir-1211 Jun 27, 2025

Uh oh!

joeatodd Jun 27, 2025

Uh oh!

muhammad-tanvir-1211 Jun 27, 2025

Uh oh!

joeatodd Jun 27, 2025

Uh oh!

Uh oh!

Uh oh!

	using ElementCompute = ElementO_;
	using ElementCompute = ElementCompute_;

	bool valid_page_size = !PagedKV ? true : args.mainloop.page_size >= QK_BLK_N && args.mainloop.page_size % QK_BLK_N == 0;
	bool valid_page_size = !PagedKV \|\| (args.mainloop.page_size >= QK_BLK_N && args.mainloop.page_size % QK_BLK_N == 0);

Separate output and accumulator type for Flash Attention Prefill Cached #448

Separate output and accumulator type for Flash Attention Prefill Cached #448

Uh oh!

Conversation

muhammad-tanvir-1211 commented Jun 26, 2025

Uh oh!

joeatodd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!