Optimize layout for SubgroupMatrixLoad on Intel #25384

jchen10 · 2025-07-14T05:20:16Z

This introduces a new LayoutProgram to pre-process the input matrix A, converting it to a layout that is more efficient for the SubgroupMatrixLoad operation on Intel GPUs.

jchen10 · 2025-07-14T05:24:57Z

On LNL with the latest driver 32.0.101.6913, Prefill can reach 828 tps.

model_benchmark.exe -i ..\models\phi3.5-web-accuracy4-gqa --prompt_file prompt.txt -g 128 -r 10
Batch size: 1, prompt tokens: 1024, tokens to generate: 128
Prompt processing (time to first token):
        avg (us):       1.23556e+06
        avg (tokens/s): 828.772
        p50 (us):       1.24797e+06
        stddev (us):    23209
        n:              10 * 1024 token(s)
Token generation:
        avg (us):       35356.1
        avg (tokens/s): 28.2837
        p50 (us):       35227.7
        stddev (us):    1167.08
        n:              1270 * 1 token(s)
Token sampling:
        avg (us):       10.38
        avg (tokens/s): 96339.1
        p50 (us):       7.5
        stddev (us):    6.6114
        n:              10 * 1 token(s)
E2E generation (entire generation loop):
        avg (ms):       5725.89
        p50 (ms):       5728.04
        stddev (ms):    25.9712
        n:              10
Peak working set size (bytes): 4416569344

@xhcao @JianhuiD PTAL

This introduces a new LayoutProgram to pre-process the input matrix A, converting it to a layout that is more efficient for the SubgroupMatrixLoad operation on Intel GPUs.

qjia7

Excellent work, Jie!

If I understand correctly, your layout shader is like below (assume s0 is subgroup matrix 0):
Input [32x64] with subgroup matrix[8x16]

s0, s1, s2, s3,
s4, s5, s6, s7
s8, s9, s10, s11
s12, s13, s14, s15

output: [128, 16] with subgroup matrix[8x16]

s0,
s1,
s2,
s3, 
s4,
s5,
s6,
s7,
s8,
s9,
s10,
s11,
s12,
s13,
s14,
s15,

This change ensures that each subgroup's data are contiguous in memory. I am wondering will it further help the performance if we reassign the layout like below?

s0,
s4,
s8,
s12,
s1,
s5,
s9,
s13,
s2,
s6,
s10,
s14,
s3,
s7,
s11,
s15

This can make sure the all subgroups in one workgroup are accessing contiguous data in memory instead of one subgroup. Just curious about the result.

onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.cc

jchen10 · 2025-07-15T08:12:00Z

Excellent work, Jie!

If I understand correctly, your layout shader is like below (assume s0 is subgroup matrix 0): Input [32x64] with subgroup matrix[8x16]
s0, s1, s2, s3,
s4, s5, s6, s7
s8, s9, s10, s11
s12, s13, s14, s15
output: [128, 16] with subgroup matrix[8x16]
s0,
s1,
s2,
s3, 
s4,
s5,
s6,
s7,
s8,
s9,
s10,
s11,
s12,
s13,
s14,
s15,
This change ensures that each subgroup's data are contiguous in memory. I am wondering will it further help the performance if we reassign the layout like below?
s0,
s4,
s8,
s12,
s1,
s5,
s9,
s13,
s2,
s6,
s10,
s14,
s3,
s7,
s11,
s15
This can make sure the all subgroups in one workgroup are accessing contiguous data in memory instead of one subgroup. Just curious about the result.

Good point. I tried this approach, unfortunately it improved slightly, less than 20 tps.

jchen10 · 2025-07-15T08:16:31Z

Without the PR, the perf data is as below. So the improvement is 628->828, 32%.

model_benchmark.exe -i ..\models\phi3.5-web-accuracy4-gqa --prompt_file prompt.txt -g 128 -r 10
Batch size: 1, prompt tokens: 1024, tokens to generate: 128
Prompt processing (time to first token):
        avg (us):       1.62918e+06
        avg (tokens/s): 628.536
        p50 (us):       1.64457e+06
        stddev (us):    26716.1
        n:              10 * 1024 token(s)
Token generation:
        avg (us):       35091.2
        avg (tokens/s): 28.4971
        p50 (us):       35000.6
        stddev (us):    1001.65
        n:              1270 * 1 token(s)
Token sampling:
        avg (us):       7.29
        avg (tokens/s): 137174
        p50 (us):       7.3
        stddev (us):    0.276687
        n:              10 * 1 token(s)
E2E generation (entire generation loop):
        avg (ms):       6085.87
        p50 (ms):       6096.89
        stddev (ms):    34.2907
        n:              10
Peak working set size (bytes): 4398272512

qjia7

LGTM, thanks!

jchen10 · 2025-07-15T08:17:53Z

@sushraja-msft PTAL

sushraja-msft

LGTM otherwise

onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.cc

guschmue · 2025-07-15T18:18:02Z

CI nagging: run 'lintrunner -a'

jchen10 · 2025-07-16T05:17:44Z

CI nagging: run 'lintrunner -a'

Done, thanks!

qjia7

One more comments. Still LGTM for others.

onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.cc

jchen10 · 2025-07-18T06:37:44Z

One more comments. Still LGTM for others.

Done, thanks!

@fs-eire @guschmue Could you please help kick off the CI, thanks!

Optimize layout for SubgroupMatrixLoad on Intel

c618611

This introduces a new LayoutProgram to pre-process the input matrix A, converting it to a layout that is more efficient for the SubgroupMatrixLoad operation on Intel GPUs.

jchen10 force-pushed the layout branch from 866b8fb to c618611 Compare July 14, 2025 05:27

qjia7 reviewed Jul 14, 2025

View reviewed changes

guschmue added the ep:WebGPU ort-web webgpu provider label Jul 14, 2025

Fix nits

4d8f3c9

qjia7 previously approved these changes Jul 15, 2025

View reviewed changes

guschmue previously approved these changes Jul 15, 2025

View reviewed changes

sushraja-msft previously approved these changes Jul 15, 2025

View reviewed changes

onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.cc Outdated Show resolved Hide resolved

onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.cc Show resolved Hide resolved

Run LayoutProgram conditionally

291f374

jchen10 dismissed stale reviews from sushraja-msft, guschmue, and qjia7 via 291f374 July 16, 2025 05:10

qjia7 previously approved these changes Jul 17, 2025

View reviewed changes

onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.cc Outdated Show resolved Hide resolved

onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.cc Outdated Show resolved Hide resolved

Fix cache hint

fa322fd

jchen10 dismissed qjia7’s stale review via fa322fd July 17, 2025 08:27

jchen10 requested a review from guschmue July 21, 2025 14:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize layout for SubgroupMatrixLoad on Intel #25384

Optimize layout for SubgroupMatrixLoad on Intel #25384

jchen10 commented Jul 14, 2025

Uh oh!

jchen10 commented Jul 14, 2025

Uh oh!

qjia7 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jchen10 commented Jul 15, 2025

Uh oh!

jchen10 commented Jul 15, 2025

Uh oh!

qjia7 left a comment

Uh oh!

jchen10 commented Jul 15, 2025

Uh oh!

sushraja-msft left a comment

Uh oh!

Uh oh!

Uh oh!

guschmue commented Jul 15, 2025

Uh oh!

jchen10 commented Jul 16, 2025

Uh oh!

qjia7 left a comment

Uh oh!

Uh oh!

Uh oh!

jchen10 commented Jul 18, 2025

Uh oh!

Uh oh!

Optimize layout for SubgroupMatrixLoad on Intel #25384

Are you sure you want to change the base?

Optimize layout for SubgroupMatrixLoad on Intel #25384

Conversation

jchen10 commented Jul 14, 2025

Uh oh!

jchen10 commented Jul 14, 2025

Uh oh!

qjia7 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jchen10 commented Jul 15, 2025

Uh oh!

jchen10 commented Jul 15, 2025

Uh oh!

qjia7 left a comment

Choose a reason for hiding this comment

Uh oh!

jchen10 commented Jul 15, 2025

Uh oh!

sushraja-msft left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

guschmue commented Jul 15, 2025

Uh oh!

jchen10 commented Jul 16, 2025

Uh oh!

qjia7 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jchen10 commented Jul 18, 2025

Uh oh!

Uh oh!