-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Optimize layout for SubgroupMatrixLoad on Intel #25384
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
On LNL with the latest driver 32.0.101.6913, Prefill can reach 828 tps.
|
This introduces a new LayoutProgram to pre-process the input matrix A, converting it to a layout that is more efficient for the SubgroupMatrixLoad operation on Intel GPUs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work, Jie!
If I understand correctly, your layout shader is like below (assume s0 is subgroup matrix 0):
Input [32x64] with subgroup matrix[8x16]
s0, s1, s2, s3,
s4, s5, s6, s7
s8, s9, s10, s11
s12, s13, s14, s15
output: [128, 16] with subgroup matrix[8x16]
s0,
s1,
s2,
s3,
s4,
s5,
s6,
s7,
s8,
s9,
s10,
s11,
s12,
s13,
s14,
s15,
This change ensures that each subgroup's data are contiguous in memory. I am wondering will it further help the performance if we reassign the layout like below?
s0,
s4,
s8,
s12,
s1,
s5,
s9,
s13,
s2,
s6,
s10,
s14,
s3,
s7,
s11,
s15
This can make sure the all subgroups in one workgroup are accessing contiguous data in memory instead of one subgroup. Just curious about the result.
onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.cc
Outdated
Show resolved
Hide resolved
onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.cc
Show resolved
Hide resolved
onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.cc
Outdated
Show resolved
Hide resolved
Good point. I tried this approach, unfortunately it improved slightly, less than 20 tps. |
Without the PR, the perf data is as below. So the improvement is 628->828, 32%.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
@sushraja-msft PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM otherwise
onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.cc
Outdated
Show resolved
Hide resolved
onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.cc
Show resolved
Hide resolved
CI nagging: run 'lintrunner -a' |
291f374
Done, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more comments. Still LGTM for others.
This introduces a new LayoutProgram to pre-process the input matrix A, converting it to a layout that is more efficient for the SubgroupMatrixLoad operation on Intel GPUs.