Support for Flash Attention 3 #38972

EduardDurech · 2025-06-23T00:11:13Z

Supports Flash Attention 3 for _flash_attention_forward

Parity test Flash Attention {2,3} based on https://github.com/sgl-project/sglang/blob/main/test/srt/models/test_generation_models.py

$ RUN_SLOW=1 pytest -s tests/generation/test_flash_attention_parity.py
> ============================================================================================================ test session starts ============================================================================================================
platform linux -- Python 3.12.3, pytest-8.1.1, pluggy-1.6.0
rootdir: /workspace/transformers
configfile: pyproject.toml
plugins: hydra-core-1.3.2, xdist-3.6.1, rerunfailures-15.1, hypothesis-6.130.8, shard-0.1.2, xdoctest-1.0.2, flakefinder-1.1.0, anyio-4.9.0, typeguard-4.3.0
collected 1 item                                                                                                                                                                                                                            
Running 1 items in this shard

tests/generation/test_flash_attention_parity.py::FlashAttentionParityTest::test_flash_attention_2_3_parity You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 3 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.

--- Flash Attention (2, 3) Parity Test on meta-llama/Llama-3.2-1B-Instruct ---
Prompt: 'The ETH AI Center is'
Generated text with Flash Attention 2: The ETH AI Center is a research center that focuses on the development of artificial intelligence and its applications in various fields. The center
Generated text with Flash Attention 3: The ETH AI Center is a research center that focuses on the development of artificial intelligence and its applications in various fields. The center
ROUGE-L: 1.0
Max absolute difference in logprobs: 0.00000e+00
Flash Attention 2 latency: 287.42 ms
Flash Attention 3 latency: 272.10 ms
Speed-up: 1.06x
---
PASSED

============================================================================================================= warnings summary ==============================================================================================================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: Type google._upb._message.MessageMapContainer uses PyType_Spec with a metaclass that has custom tp_new. This is deprecated and will no longer be allowed in Python 3.14.

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: Type google._upb._message.ScalarMapContainer uses PyType_Spec with a metaclass that has custom tp_new. This is deprecated and will no longer be allowed in Python 3.14.

../../usr/local/lib/python3.12/dist-packages/google/protobuf/internal/well_known_types.py:93
  /usr/local/lib/python3.12/dist-packages/google/protobuf/internal/well_known_types.py:93: DeprecationWarning: datetime.datetime.utcfromtimestamp() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.fromtimestamp(timestamp, datetime.UTC).
    _EPOCH_DATETIME_NAIVE = datetime.datetime.utcfromtimestamp(0)

../../usr/local/lib/python3.12/dist-packages/_pytest/config/__init__.py:1439
  /usr/local/lib/python3.12/dist-packages/_pytest/config/__init__.py:1439: PytestConfigWarning: Unknown config option: asyncio_default_fixture_loop_scope
  
    self._warn_or_fail_if_strict(f"Unknown config option: {key}\n")

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================================================================================================= 1 passed, 4 warnings in 8.18s =======================================================================================================

Closes #32219, #33373

Implements fwd and tests for Flash Attention 3 https://github.com/Dao-AILab/flash-attention/commits/main/hopper - Includes checks for dropout>0 and ALiBi in `modeling_utils.PreTrainedModel._check_and_enable_flash_attn_3` (Dropout will likely be supported soon, so this will need to be updated and `modeling_flash_attention_utils._flash_attention_forward` at the `if _IS_FLASH_ATTN_3_AVAILABLE: ...` An example Llama implementation is included in `modeling_llama.py` but other models would still need to be updated Based on huggingface#36190 which has model implementations and examples which could be merged

ArthurZucker

thanks for taking the time!

src/transformers/modeling_flash_attention_utils.py

src/transformers/modeling_utils.py

tests/generation/test_flash_attention_parity.py

tests/test_modeling_common.py

- `_prepare_flash_attention_from_position_ids` ->`prepare_fa2_from_position_ids` - Remove bettertransformer check in Flash Attention 3 - Merge tests - Add licensing

EduardDurech · 2025-06-24T20:25:50Z

@ArthurZucker all comments resolved

EduardDurech · 2025-06-25T00:19:53Z

Re: @tridao you mentioned dropout may be supported pytorch/pytorch#148891 (comment) if I could be pinged when that's done I can submit a new PR

ArthurZucker

Just deprecate the one method you are renaming and good to go!

ArthurZucker · 2025-06-25T12:39:33Z

Thanks a lot for the contribution! 🤗

1ytic · 2025-06-27T23:47:52Z

@EduardDurech thank you for the contribution! I'm trying to use it with FSDP2, but get this error:

NotImplementedError: flash_attn_3::fwd: attempted to run this operator with Meta tensors, but there was no fake impl or Meta kernel registered.

It's triggered from here. Any ideas how to fix this? Is it even possible?

1ytic · 2025-06-28T13:13:26Z

Feel free to ignore my prev question. I missed that my tp plan uses ColwiseParallel(use_local_output=False) for query/key/value states. Switching to standard torch tensors works fine.

EduardDurech · 2025-06-28T17:24:54Z

Haven't tested with FSDP2 but glad you got it sorted out 😃

ArthurZucker · 2025-06-30T09:08:59Z

@1ytic We want it to work with ColwiseParellel actually if you have a reproducer can you open an issue

EduardDurech · 2025-07-07T10:01:41Z

@1ytic @ArthurZucker this shouldn't be too difficult to fix, I won't have the time but if anyone wants it seems need to

Register fake/meta kernel, see https://gist.github.com/a-r-r-o-w/d08c37e8bd3e9c26b4ce80360be148c6#file-benchmark_kontext_cp-py-L169
Create a flash_fwd DTensor wrapper, see https://dev-discuss.pytorch.org/t/dtensor-status-design-and-looking-forward/2749
Include original and DTensor within dispatcher

This is more a low-level flash_attn_3 and PyTorch thing but it seems possible to patch in Transformers

EduardDurech · 2025-07-07T10:02:23Z

btw, anyone using Ascend NPU see #39166, thanks @FightingZhen

EduardDurech · 2025-07-07T14:11:50Z

@1ytic @ArthurZucker this shouldn't be too difficult to fix, I won't have the time but if anyone wants it seems need to
* Register fake/meta kernel, see https://gist.github.com/a-r-r-o-w/d08c37e8bd3e9c26b4ce80360be148c6#file-benchmark_kontext_cp-py-L169

* Create a flash_fwd DTensor wrapper, see https://dev-discuss.pytorch.org/t/dtensor-status-design-and-looking-forward/2749

* Include original and DTensor within dispatcher
This is more a low-level flash_attn_3 and PyTorch thing but it seems possible to patch in Transformers

Following up, @1ytic What PyTorch version are you using? Does ColwiseParallel(use_local_output=True) work?

Maybe try the first point it may be enough

import torch
from flash_attn_interface import flash_attn_func as flash_attn_3_func

@torch.library.register_fake("flash_attn_3::_flash_attn_forward")
def _fake_fa3(q,k,v,*,is_causal=False):
    B,S,H,D = q.shape
    return torch.empty_like(q), q.new_empty((B,S,H))

EduardDurech · 2025-07-07T14:26:57Z

Maybe @a-r-r-o-w could update

a-r-r-o-w · 2025-07-07T20:08:39Z

I'm not sure of the approach to be followed since I haven't tried FA3 with Pytorch TP with DTensor. I think what @EduardDurech mentioned in his comment sounds good. We might not need anything DTensor specific here, if my memory serves right, from similar tests with SageAttention, and maybe just the meta registration will allow it to work with FSDP2.

Fake/meta registrations should probably live within flash-attn (there's a PR: Dao-AILab/flash-attention#1590) but for the time being, it could maybe be added to transformers if the maintainers are okay. Without the registration, torch compile should also be failing with FA3, so it's important to have.

1ytic · 2025-07-07T22:01:07Z

This is more a low-level flash_attn_3 and PyTorch thing but it seems possible to patch in Transformers

Agree, it should be done on flash_attn side.

Does ColwiseParallel(use_local_output=True) work?

Yes, it works.

Just for context, I tried to use NeMo-RL for Qwen3 model with this tp plan. But with flash_attention_3 I changed it to this:

        base_model_tp_plan = {
            "lm_head": ColwiseParallel(
                input_layouts=Shard(1),
                output_layouts=Shard(-1),
                use_local_output=False,
            ),
            "model.embed_tokens": RowwiseParallel(
                input_layouts=Replicate(),
                output_layouts=Shard(1),
            ),
            "model.rotary_emb": RotaryEmbedParallel(use_local_output=True),
            "model.norm": SequenceParallel(),
            "model.layers.*.input_layernorm": SequenceParallel(),
            "model.layers.*.self_attn.q_proj": ColwiseParallel(use_local_output=False),
            "model.layers.*.self_attn.k_proj": ColwiseParallel(use_local_output=False),
            "model.layers.*.self_attn.v_proj": ColwiseParallel(use_local_output=True),
            "model.layers.*.self_attn.o_proj": RowwiseParallel(output_layouts=Shard(1)),
            "model.layers.*.self_attn.q_norm": Qwen3QKNorm(use_local_output=True),
            "model.layers.*.self_attn.k_norm": Qwen3QKNorm(use_local_output=True),
            "model.layers.*.post_attention_layernorm": SequenceParallel(),
            "model.layers.*.mlp.up_proj": ColwiseParallel(),
            "model.layers.*.mlp.gate_proj": ColwiseParallel(),
            "model.layers.*.mlp.down_proj": RowwiseParallel(output_layouts=Shard(1)),
        }

ArthurZucker · 2025-07-09T07:51:22Z

Can probably be added easily to the flash attention kernel on the hub if you need it fast! https://huggingface.co/kernels-community/flash-attn3 if you want to open a PR there!

kisseternity · 2025-08-25T11:09:47Z

Thanks for supporting fa3. Now I'm using fa3 with Ulysses sp, but it turns out the forward logits are nan for most time. Could you pls check if fa3 is good with Ulysses or not?

ArthurZucker · 2025-08-25T11:14:10Z

Yep we have a pr for that: #40412 tell us if it fixes!

kisseternity · 2025-08-25T12:56:18Z

Yep we have a pr for that: #40412 tell us if it fixes!

Impressive! It works, thanks!

EduardDurech added 2 commits June 23, 2025 01:04

Add tests for Flash Attention 2 and 3 parity

18dbd84

EduardDurech marked this pull request as draft June 23, 2025 00:14

EduardDurech force-pushed the FA3 branch 3 times, most recently from 629deca to 230f64f Compare June 23, 2025 01:28

ci fix

93f2984

EduardDurech force-pushed the FA3 branch from 230f64f to 93f2984 Compare June 23, 2025 01:41

EduardDurech marked this pull request as ready for review June 23, 2025 01:51

github-actions bot requested review from ArthurZucker and ydshieh June 23, 2025 01:51

ArthurZucker approved these changes Jun 24, 2025

View reviewed changes

src/transformers/modeling_flash_attention_utils.py Show resolved Hide resolved

src/transformers/modeling_utils.py Show resolved Hide resolved

tests/generation/test_flash_attention_parity.py Show resolved Hide resolved

tests/test_modeling_common.py Show resolved Hide resolved

FA2 compatibiity

81548db

- `_prepare_flash_attention_from_position_ids` ->`prepare_fa2_from_position_ids` - Remove bettertransformer check in Flash Attention 3 - Merge tests - Add licensing

EduardDurech force-pushed the FA3 branch from ed6e3c0 to 81548db Compare June 24, 2025 20:06

ci fix

5616187

EduardDurech added 3 commits June 24, 2025 22:28

Test naming consistency

30f7134

Merge branch 'main' into FA3

80d54d4

ci fix

241f872

ArthurZucker approved these changes Jun 25, 2025

View reviewed changes

EduardDurech added 3 commits June 25, 2025 14:01

Deprecation warning for prepare_fa2_from_position_ids

9bc4455

Merge branch 'main' into FA3

bbe6d5a

ci fix

3dd8796

ArthurZucker merged commit a2eb75c into huggingface:main Jun 25, 2025
18 checks passed

EduardDurech mentioned this pull request Jun 25, 2025

Any plans on adding Flash Attention 3? #33373

Closed

FightingZhen mentioned this pull request Jul 2, 2025

[bugfix] fix flash attention 2 unavailable error on Ascend NPU #39166

Merged

5 tasks

tsaoyu mentioned this pull request Jul 5, 2025

[WIP] [Refactor] Add AReaLite API and examples. inclusionAI/AReaL#125

Closed

19 tasks

EduardDurech mentioned this pull request Jul 8, 2025

feat: fa3 custom ops for compatibility with PT Compile Dao-AILab/flash-attention#1590

Open

Support for Flash Attention 3 #38972

Support for Flash Attention 3 #38972

Uh oh!

Conversation

EduardDurech commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

EduardDurech commented Jun 24, 2025

Uh oh!

EduardDurech commented Jun 25, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ArthurZucker commented Jun 25, 2025

Uh oh!

1ytic commented Jun 27, 2025

Uh oh!

1ytic commented Jun 28, 2025

Uh oh!

EduardDurech commented Jun 28, 2025

Uh oh!

ArthurZucker commented Jun 30, 2025

Uh oh!

EduardDurech commented Jul 7, 2025

Uh oh!

EduardDurech commented Jul 7, 2025

Uh oh!

EduardDurech commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EduardDurech commented Jul 7, 2025

Uh oh!

a-r-r-o-w commented Jul 7, 2025

Uh oh!

1ytic commented Jul 7, 2025

Uh oh!

ArthurZucker commented Jul 9, 2025

Uh oh!

kisseternity commented Aug 25, 2025

Uh oh!

ArthurZucker commented Aug 25, 2025

Uh oh!

kisseternity commented Aug 25, 2025

Uh oh!

Uh oh!

EduardDurech commented Jun 23, 2025 •

edited

Loading

EduardDurech commented Jul 7, 2025 •

edited

Loading