Skip to content

Conversation

@imh966
Copy link

@imh966 imh966 commented Jan 13, 2024

Hi, I found that the attention mask tensor is created on cpu, leading to inefficient operations on attention mask and an extra H2D operation.

XZQshiyu pushed a commit to XZQshiyu/Megatron-DeepSpeed that referenced this pull request Jan 15, 2025
Co-authored-by: Hyeongmin Moon <[email protected]>
Co-authored-by: Zhewei Yao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant