Skip to content

Conversation

@masahi
Copy link
Collaborator

@masahi masahi commented Nov 21, 2025

There are two ways to copy scales into TMEM - tmem_copy and tmem_store. For the latter, making MMA asynchronous while scales are copied into TMEM requires that scales be double buffered in TMEM. So far neither SWP nor WS implement such double buffering, so using an MMA op with scales copied via tmem_store forces the MMA to be synchronous.

Motivated by applications for which tmem_copy might be difficult to apply, such as MoE with activation quantization, this PR enables double buffering of TMEM scales in WS when the scales are copied by tmem_store. We introduce a new predefined partition in partition-schedule which is responsible for storing scales into double-buffered TMEM. MMA can now be made asynchronous when its scale operand is double-buffered TMEM, in addition to the tmem_copy case.

@masahi masahi requested a review from ptillet as a code owner November 21, 2025 09:41
@masahi masahi removed the request for review from ptillet November 21, 2025 09:53
@masahi masahi requested a review from 3gx November 21, 2025 10:13
@ThomasRaoux
Copy link
Collaborator

I wonder if we need/want this case in practice. Having scales going through tmem store will still break the pipeline as we need to load from smem to register. Do we have cases where we cannot swizzle?

@ThomasRaoux
Copy link
Collaborator

based on discussion with @ptillet, we think we don't want this optimization. The result would still be very suboptimal and user should always be able to swizzle HBM

@masahi
Copy link
Collaborator Author

masahi commented Nov 21, 2025

The unswizzled path is expected to be more performant for small inputs for which runtime swizzling is relatively expensive, especially considering that for MoE, pad to a multiple of 128 needs to be done per expert. UPDATE: I think more relevant is the cost of tmem_copy on padded tensor vs tmem_store on unpadded one, where the ratio of padded elements to the real ones is big

Having scales going through tmem store will still break the pipeline as we need to load from smem to register

This might be the case for SWP, but I don't think it presents any difficulty for WS.

@3gx
Copy link
Collaborator

3gx commented Nov 21, 2025

Having scales going through tmem store will still break the pipeline as we need to load from smem to register

This might be the case for SWP, but I don't think it presents any difficulty for WS.

+1, if local_load + tmem_store is done in a dedicated partition, then i don't think would be an issue with WS, that can run in parallel with other partitions.

@ThomasRaoux
Copy link
Collaborator

Having scales going through tmem store will still break the pipeline as we need to load from smem to register

This might be the case for SWP, but I don't think it presents any difficulty for WS.

+1, if local_load + tmem_store is done in a dedicated partition, then i don't think would be an issue with WS, that can run in parallel with other partitions.

This is true for pipeliner as well.

The unswizzled path is expected to be more performant for small inputs for which runtime swizzling is relatively expensive, especially considering that for MoE, pad to a multiple of 128 needs to be done per expert. UPDATE: I think more relevant is the cost of tmem_copy on padded tensor vs tmem_store on unpadded one, where the ratio of padded elements to the real ones is big

are you saying that because we need to padd scales to 32bits along K it is more efficient to not do tmem_copy? TMem store also works at the 32bits granularity right?

@masahi
Copy link
Collaborator Author

masahi commented Nov 21, 2025

are you saying that because we need to padd scales to 32bits along K it is more efficient to not do tmem_copy? TMem store also works at the 32bits granularity right?

No, in low-latency case where inputs are small and we use a small block size like 8 (with A/B swap), tmem_store working on unpadded, small tensor might copy just one column into TMEM, while tmem_copy ends up over-copying four columns.

This is true for pipeliner as well.

Not sure what you mean by this? Are you saying that WS actually has an issue with local_load? Scales are copied into smem by the load partition, so the local_load in the "tmem copy partition" can work with that.

@masahi masahi mentioned this pull request Dec 9, 2025
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants