[`FA`] Remaining Cleanup #40424

vasqu · 2025-08-25T11:08:08Z

Parts of the flash attention generation was moved to the generate preparation and was reverted in #40161

The prepare in generation had more ripple effects tho:

Flash Attention fails with non aligned position_ids #39814 was now happening with the subsequent fix in FA2 can continue generation from cache #39843
However, FA2 can continue generation from cache #39843 introduces a somewhat breaking change in the API as seen in [API] Current query_length API can break ulysses-sp patch #40399

This PR reverts these changes completely to be aligned with the changes in #40161:

Revert the breaking change in FA2 can continue generation from cache #39843
Prepare from pos ids is only for training!

Additional context

The long version explanation:

[FA] Remaining Cleanup #40424 (comment)

The short version explanation:

Flash attention uses the base flash_fn when we have no padding making anything related to varlen combined with generate obsolete

Fixes #40399
Closes #40412

cc @ArthurZucker @zucchini-nlp @Cyrilvallez

vasqu · 2025-08-25T11:19:41Z

src/transformers/modeling_flash_attention_utils.py

    tensor_kwargs = {"dtype": torch.int32, "device": position_ids.device}
-    if not is_packed_sequence:


This path was meant for generation and is no longer valid, it's a dead path that is no longer used as of #40161

HuggingFaceDocBuilderDev · 2025-08-25T11:21:02Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

vasqu · 2025-08-25T11:31:52Z

Can you check if this is working for you, as for #39814? @maxjeblick @alessiodevoto

I ran the reproducer and can no longer get the error so I assume it's not an issue even with these changes here.

zucchini-nlp

The deleted path was meant to help Nvidia KVPress or users that perform manual generation loop to use FA2 (#39814). Otherwise the cu_seq_lens always assume that we have a packed sequence in input. We need to keep it somewhere and allow users to do custom generation with FA2

vasqu · 2025-08-25T13:46:11Z

Trying to give a history of what happened and why ultimately don't need that path anymore:

Kernels flash attn #39474 introduces a refactor for FA integrating kernels

Preparation during generation is introduced

transformers/src/transformers/generation/utils.py

Lines 680 to 697 in c0f4f09

    
           if "flash" in self.config._attn_implementation and self._supports_attention_backend: 
        
               tensor_kws = {"dtype": torch.int32, "device": self.device} 
        
               pos = model_inputs["position_ids"][:, -1] 
        
               cu_seq_lens_k = torch.cat([torch.zeros(1, **tensor_kws), pos.cumsum(0).add(1)], 0) 
        
               max_length_k = int(pos.max()) + 1 
        
               bs, seq_len = input_ids.size() 
        
               q_len = torch.ones(bs, **tensor_kws) if seq_len == 1 else pos.to(torch.int32).add(1) 
        
               cu_seq_lens_q = torch.cat([torch.zeros(1, **tensor_kws), q_len.cumsum(0)], 0) 
        
               max_length_q = int(q_len.max()) 
        
               model_inputs.update( 
        
                   cu_seq_lens_q=cu_seq_lens_q.to(self.device), 
        
                   cu_seq_lens_k=cu_seq_lens_k.to(self.device), 
        
                   max_length_q=max_length_q, 
        
                   max_length_k=max_length_k, 
        
               )

We always enter this path (when no mask) as the fa kwargs are prepared

transformers/src/transformers/modeling_flash_attention_utils.py

Line 399 in efceeaf

use_mask = position_ids is not None or all([cu_seq_lens_q, cu_seq_lens_k, max_length_q, max_length_k])

This has not changed but it is important that if we provide the kwargs ourselves we need to be sure that everything is correctly prepared, i.e.

transformers/src/transformers/modeling_flash_attention_utils.py

Lines 623 to 624 in 3b5b9f6

    
           # Case 2. Some models pass directly pre-computed `cu_seqlens` so we don't need to infer it from position ids. It is safe to 
        
           # use `flash_varlen_fn` knowing we already have all necessary the kwargs.

[FA2] Fix it finally - revert fa kwargs preparation #40161 removes this as it does not have real benefits and causes more breaks instead

Own attempt at what happens in our fa forward:

If we have an attention mask then we always enter the first path at

transformers/src/transformers/modeling_flash_attention_utils.py

Lines 634 to 657 in 3b5b9f6

    
           if attention_mask is not None: 
        
               q, k, v, indices_q, (cu_seq_lens_q, cu_seq_lens_k), (max_length_q, max_length_k) = _upad_input( 
        
                   query_states, key_states, value_states, attention_mask, query_length, unpad_fn 
        
               ) 
        
               # TODO for now this is required to work with 
        
               # https://huggingface.co/kernels-community/metal-flash-sdpa/blob/main/torch-ext/metal_flash_sdpa/__init__.py 
        
               if "mps" in str(q.device): 
        
                   cu_seq_lens_k = cu_seq_lens_k.clone() 
        
               out_unpad = flash_varlen_fn( 
        
                   q, 
        
                   k, 
        
                   v, 
        
                   cu_seqlens_q=cu_seq_lens_q, 
        
                   cu_seqlens_k=cu_seq_lens_k, 
        
                   max_seqlen_q=max_length_q, 
        
                   max_seqlen_k=max_length_k, 
        
                   **flash_kwargs, 
        
               ) 
        
               if isinstance(out_unpad, tuple): 
        
                   out_unpad = out_unpad[0] 
        
               out = pad_fn(out_unpad, indices_q, query_states.size(0), query_length)

(this did not change with the refactors)

If no padding or batches of the same length ( == no attention mask (needed))

Before [FA2] Fix it finally - revert fa kwargs preparation #40161: We always enter the varlen path and introduce overhead (e.g. prep during generate)
- Flash Attention fails with non aligned position_ids #39814 fixes a bug here in case we have unusual position ids. This is necessary as this is now part of the generate logic.

After [FA2] Fix it finally - revert fa kwargs preparation #40161: We no longer enter the varlen path and use the basic fa function

transformers/src/transformers/modeling_flash_attention_utils.py

Line 692 in 3b5b9f6

out = flash_fn(query_states, key_states, value_states, **flash_kwargs)

This makes sense as we don't have padding - varlen is not needed.

The previous varlen path is no longer part of any generation (except for a few special cases like qwen2.5 vl in their first forward).

transformers/src/transformers/modeling_flash_attention_utils.py

Lines 659 to 660 in 3b5b9f6

    
           # Padding free, i.e. sequences flattened into one total sequence 
        
           elif is_fa_with_varlen_kwargs or is_fa_with_position_ids:

is not meant for generation and we could see that it broke more stuff than it helped.

In essence, this path could only be entered for non-padded input which the basic fa function already covers more easily without any overhead.

This PR removes these artefacts that were based on Kernels flash attn #39474 and Flash Attention fails with non aligned position_ids #39814 as varlen for generation is no longer valid.
- We should not enter the varlen with no padding.

vasqu · 2025-08-25T13:49:17Z

@zucchini-nlp Tried to explain things above ^

tl;dr: We no longer enter varlen during generate (except when we use attention mask which hasn't changed) - this was inefficient and broke more things; this cleans the rest up connected to #40161

zucchini-nlp · 2025-08-25T13:52:46Z

We no longer enter varlen during generate

Sorry, I am a bit lazy to read through all PRs 🙃

Just to make sure, if users have a custom generation where they call forward several times, which path of FA2 does it lead to? Do we still take non-varlen path if attention mask isn't provided in forward call? AFAIK it all depended on presence of attention mask, which was the reason above linked issue failed in KVPress

If the linked issue isn't reproducible anymore, it should be fine. But I'd like us to have a test to avoid regression

vasqu · 2025-08-25T14:01:39Z

Sorry, I am a bit lazy to read through all PRs 🙃

No worries 😆

Just to make sure, if users have a custom generation where they call forward several times, which path of FA2 does it lead to?

Depends on the input either a) No padding then we enter the basic fa function path (last in if else) b) Varlen path with attention mask where we manually do things (hasn't changed)

Do we still take non-varlen path when attention mask isn't provided in forward call? AFAIK it all depended on presence of attention mask, which was the reason above linked issue failed in KVPress

We circumvent that issue entirely by not going the varlen path here. It wasn't necessary to enter varlen here as we have no padding and we made our lives significantly harder with this 😓 When we entered the varlen path (for input with no padding), we needed that workaround that you made but with the removal of the prep during generate, we no longer need it.

If the linked issue isn't reproducible anymore, it should be fine. But I'd like us to have a test to avoid regression

I can revert the removal of the test - it should catch that error in case we decide to change things up there

ArthurZucker

LGTM but:

let's make sure we don't break ulyss as well
let's check our run slow on this PR the docker should ahve fa2 now
happy to have a small TLDR because I did not know either that now padding -> flash_fn instead (did not know it made a difference)

vasqu · 2025-08-25T14:29:02Z

run-slow: llama,mistral,bart

github-actions · 2025-08-25T14:30:16Z

This comment contains run-slow, running the specified jobs:

models: ['models/bart', 'models/llama', 'models/mistral']
quantizations: [] ...

vasqu · 2025-08-25T14:31:58Z

Can you check if this PR works for you @ETOgaosion @kisseternity? (ulysses-sp)

vasqu · 2025-08-25T14:35:52Z

run-slow: llama,mistral,bart

github-actions · 2025-08-25T14:37:21Z

This comment contains run-slow, running the specified jobs:

models: ['models/bart', 'models/llama', 'models/mistral']
quantizations: [] ...

vasqu · 2025-08-25T14:50:09Z

Identical failures to main (+ the dola tests which are known)

Waiting for feedback on ulysses and kvpress, then I'd merge

maxjeblick · 2025-08-25T15:54:38Z

Thanls for the heads up, I'll report later today.

ETOgaosion · 2025-08-25T16:10:12Z

I think it works for verl's ulysses patch, we use a special handling method for current 4.55 query length API, luckily _flash_attn_forward still has this API unchanged, so it's compatible.

vasqu · 2025-08-25T16:15:14Z

Gotcha @ETOgaosion, I'd be interested if this worked without the patch for 4.55? I assume this PR is safe then on your side

maxjeblick · 2025-08-25T18:09:42Z

Thanks a lot for the PR; from kvpress side, there are no issues!

kisseternity · 2025-08-26T02:01:42Z

Can you check if this PR works for you @ETOgaosion @kisseternity? (ulysses-sp)

Indeed I'm using https://github.com/huggingface/transformers/pull/40412/files for a quick fix and it looks good so far, when the training is done I'll give a try.

vasqu · 2025-08-26T15:46:36Z

@kisseternity The problem is that #40412 will use logic that is no longer valid for us, so this PR will supersede #40412. Just want to make sure that I don't break things here instead as well 👀

vasqu · 2025-08-26T15:47:06Z

Thanks for checking @maxjeblick 🤗

Cyrilvallez · 2025-08-27T14:25:59Z

Thanks for reverting the remaining dead code @vasqu! Indeed, we should NEVER take varlen path when we don't have attention mask or native packed format! This was a mistake that it was ever added

vasqu added 2 commits August 25, 2025 12:57

fa cleanup

21cd8a1

Merge branch 'main' into fa-cleanup

e35d171

This was referenced Aug 25, 2025

[API] Current query_length API can break ulysses-sp patch #40399

Closed

Fix Flash Attention query_length validation to be compile-friendly#40399 #40412

Closed

vasqu commented Aug 25, 2025

View reviewed changes

flaky tests

8f972e0

zucchini-nlp reviewed Aug 25, 2025

View reviewed changes

readd removed test and changeup comments to reflect the purpose

300c7e8

ArthurZucker approved these changes Aug 25, 2025

View reviewed changes

Merge branch 'main' into fa-cleanup

a067f72

Merge branch 'main' into fa-cleanup

81d2b75

vasqu and others added 3 commits August 28, 2025 12:14

Merge branch 'main' into fa-cleanup

a3bb42e

Merge branch 'main' into fa-cleanup

968329b

flaky tests

7ffd47c

ArthurZucker approved these changes Aug 28, 2025

View reviewed changes

vasqu merged commit 7e1aee4 into huggingface:main Aug 28, 2025
24 checks passed

vasqu deleted the fa-cleanup branch August 28, 2025 13:01

FightingZhen mentioned this pull request Aug 29, 2025

build(deps): bump transformers from 4.52.4 to 4.55.4 volcengine/verl#3224

Closed

		tensor_kwargs = {"dtype": torch.int32, "device": position_ids.device}
		if not is_packed_sequence:

[FA] Remaining Cleanup #40424

[FA] Remaining Cleanup #40424

Conversation

vasqu commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Additional context

Uh oh!

vasqu Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Aug 25, 2025

Uh oh!

vasqu commented Aug 25, 2025

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

vasqu commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vasqu commented Aug 25, 2025

Uh oh!

zucchini-nlp commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vasqu commented Aug 25, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

vasqu commented Aug 25, 2025

Uh oh!

github-actions bot commented Aug 25, 2025

Uh oh!

vasqu commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vasqu commented Aug 25, 2025

Uh oh!

github-actions bot commented Aug 25, 2025

Uh oh!

vasqu commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxjeblick commented Aug 25, 2025

Uh oh!

ETOgaosion commented Aug 25, 2025

Uh oh!

vasqu commented Aug 25, 2025

Uh oh!

maxjeblick commented Aug 25, 2025

Uh oh!

kisseternity commented Aug 26, 2025

Uh oh!

vasqu commented Aug 26, 2025

Uh oh!

vasqu commented Aug 26, 2025

Uh oh!

Cyrilvallez commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[`FA`] Remaining Cleanup #40424

[`FA`] Remaining Cleanup #40424

vasqu commented Aug 25, 2025 •

edited

Loading

vasqu commented Aug 25, 2025 •

edited

Loading

zucchini-nlp commented Aug 25, 2025 •

edited

Loading

vasqu commented Aug 25, 2025 •

edited

Loading

vasqu commented Aug 25, 2025 •

edited

Loading

Cyrilvallez commented Aug 27, 2025 •

edited

Loading