-
Notifications
You must be signed in to change notification settings - Fork 354
Remove unnecessary Load and Store pair #8433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove unnecessary Load and Store pair #8433
Conversation
I need some help to understand about the failing test. The test expects the result to be
But with my change, I get
The 3rd is from I think the problem is on this function,
Note that Without this PR, Slang emits the following when targeting CPP.
Note that
I am not sure if this is intentional or a bug. With this PR, now Slang emits the following when targeting CPP.
Note that |
I went ahead and tested this fix in the falcor2 codebase. This already helps a lot, CUDA compile times for the main pathtrace shader go from 100s to around 5s and the framerate goes from "frames per minute" to "frames per second", so that's a good start! However, looking at the generated code we still get expensive copies sometimes. One example is:
Which is compiled to something like this:
|
It is intentional that s60 is not copied back after calling the prop function. If your change is breaking this test then it is making wrong assumptions and making changes that it shouldn’t be making. you can only remove a load/store if you can prove it is safe to so, which means the behavior code code won’t change. This means that it is only possible when the address being loaded can be proved to be immutable between the load and the use, and the callee isn’t modifying the data through the address. |
Basically you should look for Store(var, load(address) and change it to call(address) only when:
|
Thanks for the review. It makes sense that the optimization should be limited to immutable cases. I will make changes and get tested. |
f7dd0f9
to
bf10743
Compare
I pushed a new implementation that applies the optimization to more functions. This new change should address the problem @skallweitNV mentioned above, However the PR is still in draft because one of tests is failing/crashing with this PR. |
I pushed another fix that addressed all of known problems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure that this PR is optimizing out everything that Simon is seeing. Can you double check that access to nested elements in a parameter block can also be eliminated? Your code doesn't seem to be able to recognize that (I have commented).
source/slang/slang-ir-dce.cpp
Outdated
if (!loadInst) | ||
continue; | ||
|
||
// Do not optimize the primitive types because the legalization step may assume the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we need this. Does not seem principled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I was trying to explain more with examples, I got more questions of why I needed the change.
I will leave a comment later after I debug and understand better.
But for now, here is a quick answer of why it was needed.
This change is required to not break one of the tests, tests/spirv/nested-entrypoint.slang
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up using a little different workaround for the same reason.
Here is an explanation.
The normal operation for the test, tests/spirv/nested-entrypoint.slang
, shows the following IR dump,
[nameHint("outerMain")]
[layout(%1)]
func %outerMain : Func(Void, ConstRef(Int, 1 : UInt64, 2147483647 : UInt64))
{
block %29(
[layout(%7)]
[nameHint("id")]
[semantic("SV_DispatchThreadID", 0 : Int)]
param %id2 : ConstRef(Int, 1 : UInt64, 2147483647 : UInt64)):
let %30 : Int = load(%id2)
let %31 : Ptr(Int) = var
store(%31, %30)
call %innerMain(%31)
return_val(void_constant)
Then, it gets legalized to the following,
[import("gl_GlobalInvocationID")]
[layout(%1)]
let %34 : Ptr(Vec(UInt, 3 : Int), 0 : UInt64, 7 : UInt64) = global_param
[entryPoint(6 : Int, "outerMain", "nested-entrypoint")]
[keepAlive]
[numThreads(1 : Int, 1 : Int, 1 : Int)]
[export("_SR20nested_2Dxentrypoint9outerMainp1pi_iV")]
[nameHint("outerMain")]
[layout(%6)]
func %outerMain : Func(Void)
{
block %35:
let %36 : Vec(UInt, 3 : Int) = load(%34)
let %37 : UInt = swizzle(%36, 0 : Int)
let %38 : Int = intCast(%37)
let %39 : Ptr(Int) = var
store(%39, %38)
call %innerMain(%39)
return_val(void_constant)
}
Because my load/store optimization was happening before the legalization, the legalization step got broken.
My load/store optimization changes the first dump to the following,
[nameHint("outerMain")]
[layout(%1)]
func %outerMain : Func(Void, ConstRef(Int, 1 : UInt64, 2147483647 : UInt64))
{
block %29(
[layout(%7)]
[nameHint("id")]
[semantic("SV_DispatchThreadID", 0 : Int)]
param %id2 : ConstRef(Int, 1 : UInt64, 2147483647 : UInt64)):
call %innerMain(%id2)
return_val(void_constant)
Note that "innerMain" is called directly with "%id2", which is a good/expected behavior.
But then the legalization happens later and the following "broken" IR dump is generated,
[import("gl_GlobalInvocationID")]
[layout(%1)]
let %34 : Ptr(Vec(UInt, 3 : Int), 0 : UInt64, 7 : UInt64) = global_param
[entryPoint(6 : Int, "outerMain", "nested-entrypoint")]
[keepAlive]
[numThreads(1 : Int, 1 : Int, 1 : Int)]
[export("_SR20nested_2Dxentrypoint9outerMainp1pi_iV")]
[nameHint("outerMain")]
[layout(%6)]
func %outerMain : Func(Void)
{
block %35:
let %36 : Vec(UInt, 3 : Int) = load(%34)
let %37 : UInt = swizzle(%36, 0 : Int)
let %38 : Int = intCast(%37)
call %innerMain(%38)
return_val(void_constant)
}
The problem here is that now "innerMain" is not getting a pointer value.
The legalization that is related to this is in slang-ir-glsl-legalization.cpp
.
It has a concept of "actualType" and "pretendType".
It converts a builtin variable whose type is float3 to a scalar value.
And the legalization is meant to remove/hide a load instruction from a vector and make it pretend to be a scalar.
I may be able to make some changes in the legalization step but it became pretty complicated.
So I decided to avoid the load/store optimization when the value loads from a variable with a semantic.
Besides the optimization implemented in this PR, we should also find out the places where we are inserting the temp vars and see if we can avoid emitting the temp copies in the first place, as Tess suggested. |
4ead542
to
0fe6fd6
Compare
I pushed a new change that addressed all comments. It looks like metal tests are failing too,
I will investigate more tomorrow. |
4badd55
to
6322903
Compare
All failing tests are now passing. I applied a few refactoring based on the code review feedback. I assume that the PR is verbally approved. |
if (auto paramPtrType = as<IRConstRefType>(param->getFullType())) | ||
{ | ||
if (paramPtrType->getAddressSpace() != loadAddressSpace) | ||
goto unsafeToOptimize; // incompatible address space |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check isn't necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was needed on MacOS where "constant*" point was not compatible with "thread*" pointer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test is tests/compute/cbuffer-legalize.slang
, and the emitted Metal shader is following, as a quick reference.
#include <metal_stdlib>
#include <metal_math>
#include <metal_texture>
using namespace metal;
struct P_0
{
uint4 c_0;
};
float4 test_0(const P_0 thread* p_0, texture2d<float, access::sample> p_t_0, sampler p_s_0)
{
return ((p_t_0).sample((p_s_0), (float2(0.0) ), level((0.0)))) + float4(p_0->c_0);
}
struct SLANG_ParameterGroup_C_0
{
P_0 p_1;
};
struct KernelContext_0
{
SLANG_ParameterGroup_C_0 constant* C_0;
texture2d<float, access::sample> C_p_t_0;
sampler C_p_s_0;
float device* outputBuffer_0;
};
[[kernel]] void computeMain(uint3 dispatchThreadID_0 [[thread_position_in_grid]], SLANG_ParameterGroup_C_0 constant* C_1 [[buffer(0)]], texture2d<float, access::sample> C_p_t_1 [[texture(0)]], sampler C_p_s_1 [[sampler(0)]], float device* outputBuffer_1 [[buffer(1)]])
{
KernelContext_0 kernelContext_0;
(&kernelContext_0)->C_0 = C_1;
(&kernelContext_0)->C_p_t_0 = C_p_t_1;
(&kernelContext_0)->C_p_s_0 = C_p_s_1;
(&kernelContext_0)->outputBuffer_0 = outputBuffer_1;
P_0 _S1 = (&kernelContext_0)->C_0->p_1;
float4 _S2 = test_0(&_S1, (&kernelContext_0)->C_p_t_0, (&kernelContext_0)->C_p_s_0);
*(outputBuffer_1+int(0)) = _S2.x;
*((&kernelContext_0)->outputBuffer_0+int(1)) = _S2.y;
*((&kernelContext_0)->outputBuffer_0+int(2)) = _S2.z;
*((&kernelContext_0)->outputBuffer_0+int(3)) = _S2.w;
return;
}
The trouble is when test_0()
was called as following,
float4 _S2 = test_0(&((&kernelContext_0)->C_0->p_1), (&kernelContext_0)->C_p_t_0, (&kernelContext_0)->C_p_s_0);
The constant*
was not compatible with thread*
.
I made the change to treat a few more cases as mutable. |
9c2024a
This commit removes unnecessary Load and Store pairs in IR.
When the IR is like
This PR will replace all uses of %1 with %ptr.
And the load and store instructions will be removed.
But I found that there can be cases where %2 might be still used later in other IRs.
For these cases, the removal of load instruction relies on DCE.