[WS] Support double buffering of scales in TMEM for tmem_store #8795

masahi · 2025-11-21T09:41:10Z

There are two ways to copy scales into TMEM - tmem_copy and tmem_store. For the latter, making MMA asynchronous while scales are copied into TMEM requires that scales be double buffered in TMEM. So far neither SWP nor WS implement such double buffering, so using an MMA op with scales copied via tmem_store forces the MMA to be synchronous.

Motivated by applications for which tmem_copy might be difficult to apply, such as MoE with activation quantization, this PR enables double buffering of TMEM scales in WS when the scales are copied by tmem_store. We introduce a new predefined partition in partition-schedule which is responsible for storing scales into double-buffered TMEM. MMA can now be made asynchronous when its scale operand is double-buffered TMEM, in addition to the tmem_copy case.

ThomasRaoux · 2025-11-21T16:39:52Z

I wonder if we need/want this case in practice. Having scales going through tmem store will still break the pipeline as we need to load from smem to register. Do we have cases where we cannot swizzle?

ThomasRaoux · 2025-11-21T16:52:56Z

based on discussion with @ptillet, we think we don't want this optimization. The result would still be very suboptimal and user should always be able to swizzle HBM

masahi · 2025-11-21T19:04:17Z

The unswizzled path is expected to be more performant for small inputs for which ~~runtime swizzling is relatively expensive, especially considering that for MoE, pad to a multiple of 128 needs to be done per expert.~~ UPDATE: I think more relevant is the cost of tmem_copy on padded tensor vs tmem_store on unpadded one, where the ratio of padded elements to the real ones is big

Having scales going through tmem store will still break the pipeline as we need to load from smem to register

This might be the case for SWP, but I don't think it presents any difficulty for WS.

3gx · 2025-11-21T20:47:28Z

Having scales going through tmem store will still break the pipeline as we need to load from smem to register

This might be the case for SWP, but I don't think it presents any difficulty for WS.

+1, if local_load + tmem_store is done in a dedicated partition, then i don't think would be an issue with WS, that can run in parallel with other partitions.

ThomasRaoux · 2025-11-21T22:39:03Z

Having scales going through tmem store will still break the pipeline as we need to load from smem to register

This might be the case for SWP, but I don't think it presents any difficulty for WS.

+1, if local_load + tmem_store is done in a dedicated partition, then i don't think would be an issue with WS, that can run in parallel with other partitions.

This is true for pipeliner as well.

The unswizzled path is expected to be more performant for small inputs for which ~~runtime swizzling is relatively expensive, especially considering that for MoE, pad to a multiple of 128 needs to be done per expert.~~ UPDATE: I think more relevant is the cost of tmem_copy on padded tensor vs tmem_store on unpadded one, where the ratio of padded elements to the real ones is big

are you saying that because we need to padd scales to 32bits along K it is more efficient to not do tmem_copy? TMem store also works at the 32bits granularity right?

masahi · 2025-11-21T23:13:12Z

are you saying that because we need to padd scales to 32bits along K it is more efficient to not do tmem_copy? TMem store also works at the 32bits granularity right?

No, in low-latency case where inputs are small and we use a small block size like 8 (with A/B swap), tmem_store working on unpadded, small tensor might copy just one column into TMEM, while tmem_copy ends up over-copying four columns.

This is true for pipeliner as well.

Not sure what you mean by this? Are you saying that WS actually has an issue with local_load? Scales are copied into smem by the load partition, so the local_load in the "tmem copy partition" can work with that.

masahi and others added 26 commits November 19, 2025 18:04

assign scale copy ops to dedicated partition

5863bf6

insert aref seems working

c347631

partition-loops looks good

a633688

wip double buffer

fe5f4e0

double buffer working

4e1c0e6

Merge branch 'main' into ws-tmem-scales-db

42593d4

fix pipelinable condition

b06a98d

fix aref combine when there are multiple producer partitions

9d95f47

add blocked-scale test stub

704fd41

format

3019fc9

WA for missing encoding

977d428

working

81ce8a2

fix

f8f55e1

stub mxfp test

d57ae53

fix

c8dd9a4

use aref-tmem-insert to correctly add aref for tmem scale

80dbb3a

fix lit test

21a9459

clean

50b823e

comment

4c26034

stub lit test

f67b191

wip

61f1eb2

wip

1b9d457

wip

7fa0e18

done partition-schedule lit test

93b27ba

stub lower-aref test

0743b7c

done lower-aref lit test

faf641c

masahi requested a review from ptillet as a code owner November 21, 2025 09:41

masahi removed the request for review from ptillet November 21, 2025 09:53

add comment

7a78369

masahi requested a review from 3gx November 21, 2025 10:13

masahi requested review from Mogball and ThomasRaoux November 21, 2025 10:14

masahi mentioned this pull request Dec 9, 2025

move swizzle func to triton side #8944

Draft

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WS] Support double buffering of scales in TMEM for tmem_store #8795

[WS] Support double buffering of scales in TMEM for tmem_store #8795

Uh oh!

masahi commented Nov 21, 2025 •

edited

Loading

Uh oh!

ThomasRaoux commented Nov 21, 2025

Uh oh!

ThomasRaoux commented Nov 21, 2025

Uh oh!

masahi commented Nov 21, 2025 •

edited

Loading

Uh oh!

3gx commented Nov 21, 2025

Uh oh!

ThomasRaoux commented Nov 21, 2025

Uh oh!

masahi commented Nov 21, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[WS] Support double buffering of scales in TMEM for tmem_store #8795

Are you sure you want to change the base?

[WS] Support double buffering of scales in TMEM for tmem_store #8795

Uh oh!

Conversation

masahi commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ThomasRaoux commented Nov 21, 2025

Uh oh!

ThomasRaoux commented Nov 21, 2025

Uh oh!

masahi commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

3gx commented Nov 21, 2025

Uh oh!

ThomasRaoux commented Nov 21, 2025

Uh oh!

masahi commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

masahi commented Nov 21, 2025 •

edited

Loading

masahi commented Nov 21, 2025 •

edited

Loading

masahi commented Nov 21, 2025 •

edited

Loading