-
Notifications
You must be signed in to change notification settings - Fork 30.3k
Kernels flash attn #39474
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernels flash attn #39474
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
tf install:
env: Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.
- `transformers` version: 4.54.0.dev0
- Platform: Linux-6.11.0-29-generic-x86_64-with-glibc2.40
- Python version: 3.10.16
- Huggingface_hub version: 0.33.4
- Safetensors version: 0.5.3
- Accelerate version: 1.9.0
- Accelerate config: - compute_environment: LOCAL_MACHINE Error Message: [rank0]: ValueError: Specified `attn_implementation="https://huggingface.co/kernels-community/flash-attn3:flash_attention"` is not supported. The only possible arguments are `attn_implementation="eager"` (manual attention implementation), `"attn_implementation=flash_attention_3"` (implementation using flash attention 3), `"attn_implementation=flash_attention_2"` (implementation using flash attention 2), `"attn_implementation=sdpa"` (implementation using torch.nn.functional.scaled_dot_product_attention), `"attn_implementation=flex_attention"` (implementation using torch's flex_attention). |
You can't pass the full http! You need to pass |
I think writing the URL is silly too. However, since you shared it like this on Twitter, I gave it a try. New Error Message:
Should I wait for you to finish your development? |
Ah that's weird can you share a small reproducer? |
run-slow: llama,mistral,gemma |
This comment contains run-slow, running the specified jobs: models: ['models/gemma', 'models/llama', 'models/mistral'] |
@ArthurZucker I tried it with a different LLM model, and it worked. It seems that the dataset of the Qwen model is faulty. I will fix this and provide feedback on the performance. |
Thanks @kadirnar ! |
TRANSFORMERS_TEST_DEVICE="mps" RUN_SLOW=1 pytest tests/models/llama/test_modeling_llama.py -k kernels_m
ps -s added a new test for MPS! |
* update docs * Apply suggestions from code review Co-authored-by: Steven Liu <[email protected]> * applu suggestions --------- Co-authored-by: Steven Liu <[email protected]>
…ace/transformers into kernels-flash-attn
@ArthurZucker This method only supports LLM models, right? What should we do to add kernel support for speech models? Example: https://huggingface.co/docs/transformers/main/en/model_doc/dia |
This should be supported by all models as long as they have a the |
What does this PR do?