-
Notifications
You must be signed in to change notification settings - Fork 486
Switch DeepSeekV3 to Use FlexAttention by Default #1610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
With the current implementation DeepSeekV3 will not use FlashAttention nor cuDNN attention. It will use efficient attention, which has lower performance. The motivation of defaulting the models to SDPA in TorchTitan is because FlexCP is not ready but SDPA + CP + DeepSeekV3 doesn't work either. So this PR make all DeepSeekV3 configurations use FlexAttention, which gives us higher MFU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, the combination of SDPA + CP + DeepSeekV3 is also not functional.
I do want to understand this a bit more.
What stops MemEfficientAttention to work with CP? According to #1522 (comment), it is caused by precision issues when load balancing + AC are enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, today PP doesn't work with FlexAttn block causal masking, because PP can't receive eos_id
as a non-Tensor input (nor can it receive a mask function).
https://github.com/pytorch/torchtitan/blob/main/torchtitan/train.py#L433
Context: This regression is coming from a recent refactor #1424 to move eos_id
out of ModelArgs
(to remove dependency from model to tokenizer).
PP is indeed very important for DSV3 training at large scale, so we should make sure it runs. Created #1612 to track.
cc @H-Huang
The precision issue is the hypothesis. But we could not confirm this. The only evidence is that if we convert Q, K, V to float32, it will work. I'm working with a researcher to see if his debugging tool can help confirm the hypothesis. |
I have another PR that may fix the PP issue. Will submit it today. |
@fegin could you share some more details about the platform(s) you're testing? In theory cutting-edge cuDNN should support larger head dims incl. DeepSeekV3 but we have been conservative in enabling it due to a somewhat uneven support surface. |
@eqy H100, CUDA 12.6, CUDNN 9.10.2.21. |
@fegin thanks, I think we can try enabling this case for H100 via pytorch/pytorch#161210 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
reference: pytorch/torchtitan#1610 9.10 only for now, we would want to hold off on upgrading to either cuDNN frontend 1.14+/cuDNN 9.11+ due to some head-dim > 128 handling issues Pull Request resolved: #161210 Approved by: https://github.com/Skylion007
Currently, the only available backend for SDPA for DeepSeekV3 is efficient attention kernel. For FlashAttentionV2 (what current SDPA supports), the V embedding dimension must be the same as Q and K. For cuDNN attention, it is complaining the head dimension is too large.
The reason for defaulting the attention to SDPA in TorchTitan is that FlexCP is not yet ready. However, the combination of SDPA + CP + DeepSeekV3 is also not functional.
This PR updates all DeepSeekV3 configurations to use FlexAttention, which significantly improves the overall performance. Document masking also contributes to MFU improvement, but the majority is from FlexAttention itself.
SDPA:
FlexAttention (now)