Remove unnecessary Load and Store pair #8433

jkwak-work · 2025-09-11T01:11:10Z

This commit removes unnecessary Load and Store pairs in IR.

When the IR is like

let %1 = var
let %2 = load(%ptr)
store(%1 %2)

This PR will replace all uses of %1 with %ptr.
And the load and store instructions will be removed.

But I found that there can be cases where %2 might be still used later in other IRs.
For these cases, the removal of load instruction relies on DCE.

jkwak-work · 2025-09-11T02:17:38Z

I need some help to understand about the failing test.
https://github.com/shader-slang/slang/tree/master/tests/autodiff/reverse-inout-param-2.slang

The test expects the result to be

5, 9, 1, 2, 1

But with my change, I get

5, 9, 9, 3, 1.

The 3rd is from p.m and 4th is from p.n.

I think the problem is on this function,

[BackwardDifferentiable]
void f(inout no_diff D p, out no_diff D p0, out ND v1, inout ND v2, float x, out float y)
{
    // v2.nd is 3.
    g(p, p0, v1, v2, x, y);
    // v2.nd is now 4, now g is equivalent to detach(4+x)*x, so g' = 9.
    g(p, p0, v1, v2, x, y);
}

Note that p has a modifier inout.
And I think the intention is to have p modified after the function calls to g.

Without this PR, Slang emits the following when targeting CPP.

static void s_bwd_f_0(D_0 * _S50, ND_0 * _S51, DiffPair_float_0 * _S52, float _S53)
{
    D_0 _S54 = *_S50;
    ND_0 _S55 = *_S51;
    D_0 _S56;
    ND_0 _S57;
    float _S58;
    s_bwd_prop_f_Intermediates_0 _S59;
    s_primal_ctx_f_0(&_S54, &_S56, &_S57, &_S55, (*_S52).primal_0, &_S58, &_S59);
    D_0 _S60 = *_S50;
    ND_0 _S61 = *_S51;
    s_bwd_prop_f_Intermediates_0 _S62 = _S59;
    s_bwd_prop_f_0(&_S60, &_S61, _S52, _S53, &_S62);
    return;
}

Note that _S50 will be unchanged, because its local copies are used for calling, s_primal_ctx_f_0() and s_bwd_prop_f_0().
And strangely _S54 and _S60 are not copied back to _S50.
I expect that we should have something like,

s_primal_ctx_f_0(&_S54, &_S56, &_S57, &_S55, (*_S52).primal_0, &_S58, &_S59);
_S50 = _S54

I am not sure if this is intentional or a bug.

With this PR, now Slang emits the following when targeting CPP.

static void s_bwd_f_0(D_0 * _S50, ND_0 * _S51, DiffPair_float_0 * _S52, float _S53)
{
    D_0 _S54;
    ND_0 _S55;
    float _S56;
    s_bwd_prop_f_Intermediates_0 _S57;
    s_primal_ctx_f_0(_S50, &_S54, &_S55, _S51, (*_S52).primal_0, &_S56, &_S57);
    s_bwd_prop_f_Intermediates_0 _S58 = _S57;
    s_bwd_prop_f_0(_S50, _S51, _S52, _S53, &_S58);
    return;
}

Note that _S50 will be modified because it is directly used for the function calls.

skallweitNV · 2025-09-11T07:06:02Z

I went ahead and tested this fix in the falcor2 codebase. This already helps a lot, CUDA compile times for the main pathtrace shader go from 100s to around 5s and the framerate goes from "frames per minute" to "frames per second", so that's a good start!

However, looking at the generated code we still get expensive copies sometimes. One example is:

public struct Scene { ... } // large scene struct

public ParameterBlock<Scene> g_scene;

public struct EnvMap
{
    ...
    public float3 eval(float2 uv)
    {
        return g_scene.sample_texture_linear(texture_id, uv) * scaling_factor;
    }
}

Which is compiled to something like this:

__device__ float3  EnvMap_eval_0(EnvMap_0 * this_31, float2  * uv_9)
{
    Scene_0 _S182 = *globalParams_0->g_scene_0;
    float3  _S183 = Scene_sample_texture_linear_0(&_S182, this_31->texture_id_0, uv_9);
    return _S183 * this_31->scaling_factor_0;
}

_S182 is still a value type and does an expensive copy of the full scene struct. What we need is something like:

__device__ float3  EnvMap_eval_0(EnvMap_0 * this_31, float2  * uv_9)
{
    Scene_0* _S182 = globalParams_0->g_scene_0;
    float3  _S183 = Scene_sample_texture_linear_0(_S182, this_31->texture_id_0, uv_9);
    return _S183 * this_31->scaling_factor_0;
}

csyonghe · 2025-09-14T18:07:51Z

It is intentional that s60 is not copied back after calling the prop function. If your change is breaking this test then it is making wrong assumptions and making changes that it shouldn’t be making.

you can only remove a load/store if you can prove it is safe to so, which means the behavior code code won’t change. This means that it is only possible when the address being loaded can be proved to be immutable between the load and the use, and the callee isn’t modifying the data through the address.

csyonghe · 2025-09-14T18:11:48Z

Basically you should look for

Store(var, load(address)
call(var)

and change it to call(address) only when:

the parameter type of the callee is a IRConstRef, not IRInOut or anything else.
The var is only being used in a store and the call.
The address is immutable, I.e. whose root is a IRConstRef or from a read only buffer.

source/slang/slang-ir-transform-params-to-constref.cpp

jkwak-work · 2025-09-15T04:43:30Z

Thanks for the review.
The code is mostly written by LLM. I am sorry about that but I was rushing to get this code out for a review to get help as early as possible due to the urgency of the matter.

It makes sense that the optimization should be limited to immutable cases. I will make changes and get tested.

jkwak-work · 2025-09-16T17:49:29Z

I pushed a new implementation that applies the optimization to more functions.
The previous optimization was applied only to certain functions that got the const-ref fixup and CUDA/CPP targets.
But this new change applies to all functions to all targets, which I expect to improve the perf more.

This new change should address the problem @skallweitNV mentioned above,

Remove unnecessary Load and Store pair #8433 (comment)

However the PR is still in draft because one of tests is failing/crashing with this PR. tests/compute/kernel-context-threading.slang.
I am debugging it.

jkwak-work · 2025-09-17T01:53:17Z

I pushed another fix that addressed all of known problems.
It is ready for another review.

csyonghe

I am not sure that this PR is optimizing out everything that Simon is seeing. Can you double check that access to nested elements in a parameter block can also be eliminated? Your code doesn't seem to be able to recognize that (I have commented).

source/slang/slang-ir-dce.cpp

csyonghe · 2025-09-17T13:23:38Z

source/slang/slang-ir-dce.cpp

+            if (!loadInst)
+                continue;
+
+            // Do not optimize the primitive types because the legalization step may assume the


I don't understand why we need this. Does not seem principled.

As I was trying to explain more with examples, I got more questions of why I needed the change.
I will leave a comment later after I debug and understand better.

But for now, here is a quick answer of why it was needed.
This change is required to not break one of the tests, tests/spirv/nested-entrypoint.slang.

I ended up using a little different workaround for the same reason.
Here is an explanation.

The normal operation for the test, tests/spirv/nested-entrypoint.slang, shows the following IR dump,

[nameHint("outerMain")] [layout(%1)] func %outerMain : Func(Void, ConstRef(Int, 1 : UInt64, 2147483647 : UInt64)) { block %29( [layout(%7)] [nameHint("id")] [semantic("SV_DispatchThreadID", 0 : Int)] param %id2 : ConstRef(Int, 1 : UInt64, 2147483647 : UInt64)): let %30 : Int = load(%id2) let %31 : Ptr(Int) = var store(%31, %30) call %innerMain(%31) return_val(void_constant)

Then, it gets legalized to the following,

[import("gl_GlobalInvocationID")] [layout(%1)] let %34 : Ptr(Vec(UInt, 3 : Int), 0 : UInt64, 7 : UInt64) = global_param [entryPoint(6 : Int, "outerMain", "nested-entrypoint")] [keepAlive] [numThreads(1 : Int, 1 : Int, 1 : Int)] [export("_SR20nested_2Dxentrypoint9outerMainp1pi_iV")] [nameHint("outerMain")] [layout(%6)] func %outerMain : Func(Void) { block %35: let %36 : Vec(UInt, 3 : Int) = load(%34) let %37 : UInt = swizzle(%36, 0 : Int) let %38 : Int = intCast(%37) let %39 : Ptr(Int) = var store(%39, %38) call %innerMain(%39) return_val(void_constant) }

Because my load/store optimization was happening before the legalization, the legalization step got broken.
My load/store optimization changes the first dump to the following,

[nameHint("outerMain")] [layout(%1)] func %outerMain : Func(Void, ConstRef(Int, 1 : UInt64, 2147483647 : UInt64)) { block %29( [layout(%7)] [nameHint("id")] [semantic("SV_DispatchThreadID", 0 : Int)] param %id2 : ConstRef(Int, 1 : UInt64, 2147483647 : UInt64)): call %innerMain(%id2) return_val(void_constant)

Note that "innerMain" is called directly with "%id2", which is a good/expected behavior.
But then the legalization happens later and the following "broken" IR dump is generated,

[import("gl_GlobalInvocationID")] [layout(%1)] let %34 : Ptr(Vec(UInt, 3 : Int), 0 : UInt64, 7 : UInt64) = global_param [entryPoint(6 : Int, "outerMain", "nested-entrypoint")] [keepAlive] [numThreads(1 : Int, 1 : Int, 1 : Int)] [export("_SR20nested_2Dxentrypoint9outerMainp1pi_iV")] [nameHint("outerMain")] [layout(%6)] func %outerMain : Func(Void) { block %35: let %36 : Vec(UInt, 3 : Int) = load(%34) let %37 : UInt = swizzle(%36, 0 : Int) let %38 : Int = intCast(%37) call %innerMain(%38) return_val(void_constant) }

The problem here is that now "innerMain" is not getting a pointer value.

The legalization that is related to this is in slang-ir-glsl-legalization.cpp.
It has a concept of "actualType" and "pretendType".
It converts a builtin variable whose type is float3 to a scalar value.
And the legalization is meant to remove/hide a load instruction from a vector and make it pretend to be a scalar.

I may be able to make some changes in the legalization step but it became pretty complicated.
So I decided to avoid the load/store optimization when the value loads from a variable with a semantic.

source/slang/slang-ir-dce.cpp

source/slang/slang-ir-dce.h

source/slang/slang-ir-dce.cpp

csyonghe · 2025-09-17T13:43:19Z

Besides the optimization implemented in this PR, we should also find out the places where we are inserting the temp vars and see if we can avoid emitting the temp copies in the first place, as Tess suggested.

jkwak-work · 2025-09-23T04:04:58Z

I pushed a new change that addressed all comments.

It looks like metal tests are failing too,

tests/autodiff/dynamic-dispatch-material.slang.3 syn (mtl)
tests/bugs/dyn-dispatch-single-conformance.slang.1 syn (mtl)
tests/bugs/dynamic-interface-property.slang.1 syn (mtl)
tests/compute/cbuffer-legalize.slang.2 syn (mtl)
tests/hlsl-intrinsic/packed/pack-unpack.slang.2 (mtl)
tests/hlsl-intrinsic/packed/pack-unpack.slang.8 (mtl)
tests/language-feature/anyvalue-matrix-layout.slang.1 syn (mtl)

I will investigate more tomorrow.

jkwak-work · 2025-09-24T02:02:44Z

All failing tests are now passing.
The problem on Metal was that there were cases where the address-space were not compatible.

I applied a few refactoring based on the code review feedback.

I assume that the PR is verbally approved.
I will merge as soon as CI passes.

source/slang/slang-ir-redundancy-removal.cpp

csyonghe · 2025-09-24T02:56:06Z

source/slang/slang-ir-redundancy-removal.cpp

+                        if (auto paramPtrType = as<IRConstRefType>(param->getFullType()))
+                        {
+                            if (paramPtrType->getAddressSpace() != loadAddressSpace)
+                                goto unsafeToOptimize; // incompatible address space


This check isn't necessary.

This was needed on MacOS where "constant*" point was not compatible with "thread*" pointer.

The test is tests/compute/cbuffer-legalize.slang, and the emitted Metal shader is following, as a quick reference.

#include <metal_stdlib> #include <metal_math> #include <metal_texture> using namespace metal; struct P_0 { uint4 c_0; }; float4 test_0(const P_0 thread* p_0, texture2d<float, access::sample> p_t_0, sampler p_s_0) { return ((p_t_0).sample((p_s_0), (float2(0.0) ), level((0.0)))) + float4(p_0->c_0); } struct SLANG_ParameterGroup_C_0 { P_0 p_1; }; struct KernelContext_0 { SLANG_ParameterGroup_C_0 constant* C_0; texture2d<float, access::sample> C_p_t_0; sampler C_p_s_0; float device* outputBuffer_0; }; [[kernel]] void computeMain(uint3 dispatchThreadID_0 [[thread_position_in_grid]], SLANG_ParameterGroup_C_0 constant* C_1 [[buffer(0)]], texture2d<float, access::sample> C_p_t_1 [[texture(0)]], sampler C_p_s_1 [[sampler(0)]], float device* outputBuffer_1 [[buffer(1)]]) { KernelContext_0 kernelContext_0; (&kernelContext_0)->C_0 = C_1; (&kernelContext_0)->C_p_t_0 = C_p_t_1; (&kernelContext_0)->C_p_s_0 = C_p_s_1; (&kernelContext_0)->outputBuffer_0 = outputBuffer_1; P_0 _S1 = (&kernelContext_0)->C_0->p_1; float4 _S2 = test_0(&_S1, (&kernelContext_0)->C_p_t_0, (&kernelContext_0)->C_p_s_0); *(outputBuffer_1+int(0)) = _S2.x; *((&kernelContext_0)->outputBuffer_0+int(1)) = _S2.y; *((&kernelContext_0)->outputBuffer_0+int(2)) = _S2.z; *((&kernelContext_0)->outputBuffer_0+int(3)) = _S2.w; return; }

The trouble is when test_0() was called as following,

float4 _S2 = test_0(&((&kernelContext_0)->C_0->p_1), (&kernelContext_0)->C_p_t_0, (&kernelContext_0)->C_p_s_0);

The constant* was not compatible with thread*.

jkwak-work · 2025-09-24T04:24:16Z

I made the change to treat a few more cases as mutable.
The performance result with Falcor2 hasn't changed; it is still 10x slower for the specific test we were talking about.

jkwak-work self-assigned this Sep 11, 2025

jkwak-work added the pr: non-breaking PRs without breaking changes label Sep 11, 2025

jkwak-work changed the title ~~Remove unnecessary Load and Store pair~~ [WIP] Remove unnecessary Load and Store pair Sep 11, 2025

slangbot mentioned this pull request Sep 11, 2025

Format code for PR #8433 jkwak-work/slang#97

Merged

shader-slang deleted a comment from slangbot Sep 11, 2025

jkwak-work added the CoPilot label Sep 11, 2025

csyonghe reviewed Sep 14, 2025

View reviewed changes

source/slang/slang-ir-transform-params-to-constref.cpp Outdated Show resolved Hide resolved

csyonghe reviewed Sep 14, 2025

View reviewed changes

source/slang/slang-ir-transform-params-to-constref.cpp Outdated Show resolved Hide resolved

jkwak-work force-pushed the fix/avoid-unnecessary-copy-of-param branch from f7dd0f9 to bf10743 Compare September 16, 2025 17:47

slangbot mentioned this pull request Sep 16, 2025

Format code for PR #8433 jkwak-work/slang#98

Merged

jkwak-work marked this pull request as ready for review September 16, 2025 23:20

jkwak-work requested a review from a team as a code owner September 16, 2025 23:20

jkwak-work changed the title ~~[WIP] Remove unnecessary Load and Store pair~~ Remove unnecessary Load and Store pair Sep 16, 2025

shader-slang deleted a comment from slangbot Sep 16, 2025

csyonghe reviewed Sep 17, 2025

View reviewed changes

jkwak-work added 2 commits September 22, 2025 20:28

Optimize Load/Store pairs

485c66c

Fix the failing/crashing test

244e17e

jkwak-work force-pushed the fix/avoid-unnecessary-copy-of-param branch 2 times, most recently from 4ead542 to 0fe6fd6 Compare September 23, 2025 03:32

slangbot mentioned this pull request Sep 23, 2025

Format code for PR #8433 jkwak-work/slang#102

Merged

shader-slang deleted a comment from slangbot Sep 23, 2025

Refactor based on the feedback

6322903

jkwak-work force-pushed the fix/avoid-unnecessary-copy-of-param branch 2 times, most recently from 4badd55 to 6322903 Compare September 23, 2025 07:40

jkwak-work added 4 commits September 23, 2025 01:11

Fix Metal test failures

ed51278

Check compatibility of address space for function call

b39da60

Remove too much restriction

ba02075

Refactor based on review feedbacks

77702f6

csyonghe reviewed Sep 24, 2025

View reviewed changes

source/slang/slang-ir-redundancy-removal.cpp Outdated Show resolved Hide resolved

csyonghe reviewed Sep 24, 2025

View reviewed changes

source/slang/slang-ir-redundancy-removal.cpp Outdated Show resolved Hide resolved

csyonghe reviewed Sep 24, 2025

View reviewed changes

source/slang/slang-ir-redundancy-removal.cpp Outdated Show resolved Hide resolved

csyonghe reviewed Sep 24, 2025

View reviewed changes

Treat a few aliasing cases as mutable

8e13179

slangbot mentioned this pull request Sep 24, 2025

Format code for PR #8433 jkwak-work/slang#104

Merged

format code (#104)

dbe1992

shader-slang deleted a comment from slangbot Sep 24, 2025

csyonghe approved these changes Sep 24, 2025

View reviewed changes

jkwak-work added this pull request to the merge queue Sep 24, 2025

Merged via the queue into shader-slang:master with commit 9c2024a Sep 24, 2025
64 of 66 checks passed

jkwak-work mentioned this pull request Sep 24, 2025

Avoid unnecessary copy of function parameters #8412

Closed

Remove unnecessary Load and Store pair #8433

Remove unnecessary Load and Store pair #8433

Uh oh!

Conversation

jkwak-work commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkwak-work commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

skallweitNV commented Sep 11, 2025

Uh oh!

csyonghe commented Sep 14, 2025

Uh oh!

csyonghe commented Sep 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jkwak-work commented Sep 15, 2025

Uh oh!

jkwak-work commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkwak-work commented Sep 17, 2025

Uh oh!

csyonghe left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

csyonghe Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

jkwak-work Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

jkwak-work Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

csyonghe commented Sep 17, 2025

Uh oh!

jkwak-work commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkwak-work commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

csyonghe Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

jkwak-work Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

jkwak-work Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

jkwak-work commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jkwak-work commented Sep 11, 2025 •

edited

Loading

jkwak-work commented Sep 11, 2025 •

edited

Loading

csyonghe commented Sep 14, 2025 •

edited

Loading

jkwak-work commented Sep 16, 2025 •

edited

Loading

jkwak-work commented Sep 23, 2025 •

edited

Loading

jkwak-work commented Sep 24, 2025 •

edited

Loading

jkwak-work commented Sep 24, 2025 •

edited

Loading