Skip to content

Conversation

@fmassa
Copy link
Contributor

@fmassa fmassa commented Jun 12, 2025

For now assumes it's balanced.

Note that AutoParallel doesn't work in this case (sort operator not implemented, and maybe other things as well involving the indexing)

Note: For now I'm picking a single expert per token, but that can be changed (maybe at the expense of replicating the tokens, which needs to be assessed to see if the memory increase is reasonable)

For now assumes it's balanced
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 12, 2025
Copy link
Contributor

@wconstab wconstab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kinda curious, do you have the solution handy for this?

fmassa added a commit that referenced this pull request Sep 29, 2025
Taken from #3 and #29. Decomposing softmax_backward leads to prims.fma, which doesn't have a sharding rule and we end up having a Replicate showing up as only possible sharding
fmassa added a commit that referenced this pull request Oct 1, 2025
Taken from #3 and #29. Decomposing softmax_backward leads to prims.fma, which doesn't have a sharding rule and we end up having a Replicate showing up as only possible sharding
Very slow, need to try Sinkhorn-Knopp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants