Skip to content

Conversation

jkwak-work
Copy link
Collaborator

@jkwak-work jkwak-work commented Sep 11, 2025

This commit removes unnecessary Load and Store pairs in IR.

When the IR is like

let %1 = var
let %2 = load(%ptr)
store(%1 %2)

This PR will replace all uses of %1 with %ptr.
And the load and store instructions will be removed.

But I found that there can be cases where %2 might be still used later in other IRs.
For these cases, the removal of load instruction relies on DCE.

@jkwak-work jkwak-work self-assigned this Sep 11, 2025
@jkwak-work jkwak-work added the pr: non-breaking PRs without breaking changes label Sep 11, 2025
@jkwak-work jkwak-work changed the title Remove unnecessary Load and Store pair [WIP] Remove unnecessary Load and Store pair Sep 11, 2025
@shader-slang shader-slang deleted a comment from slangbot Sep 11, 2025
@jkwak-work
Copy link
Collaborator Author

jkwak-work commented Sep 11, 2025

I need some help to understand about the failing test.
https://github.com/shader-slang/slang/tree/master/tests/autodiff/reverse-inout-param-2.slang

The test expects the result to be

  • 5, 9, 1, 2, 1

But with my change, I get

  • 5, 9, 9, 3, 1.

The 3rd is from p.m and 4th is from p.n.

I think the problem is on this function,

[BackwardDifferentiable]
void f(inout no_diff D p, out no_diff D p0, out ND v1, inout ND v2, float x, out float y)
{
    // v2.nd is 3.
    g(p, p0, v1, v2, x, y);
    // v2.nd is now 4, now g is equivalent to detach(4+x)*x, so g' = 9.
    g(p, p0, v1, v2, x, y);
}

Note that p has a modifier inout.
And I think the intention is to have p modified after the function calls to g.

Without this PR, Slang emits the following when targeting CPP.

static void s_bwd_f_0(D_0 * _S50, ND_0 * _S51, DiffPair_float_0 * _S52, float _S53)
{
    D_0 _S54 = *_S50;
    ND_0 _S55 = *_S51;
    D_0 _S56;
    ND_0 _S57;
    float _S58;
    s_bwd_prop_f_Intermediates_0 _S59;
    s_primal_ctx_f_0(&_S54, &_S56, &_S57, &_S55, (*_S52).primal_0, &_S58, &_S59);
    D_0 _S60 = *_S50;
    ND_0 _S61 = *_S51;
    s_bwd_prop_f_Intermediates_0 _S62 = _S59;
    s_bwd_prop_f_0(&_S60, &_S61, _S52, _S53, &_S62);
    return;
}

Note that _S50 will be unchanged, because its local copies are used for calling, s_primal_ctx_f_0() and s_bwd_prop_f_0().
And strangely _S54 and _S60 are not copied back to _S50.
I expect that we should have something like,

s_primal_ctx_f_0(&_S54, &_S56, &_S57, &_S55, (*_S52).primal_0, &_S58, &_S59);
_S50 = _S54

I am not sure if this is intentional or a bug.

With this PR, now Slang emits the following when targeting CPP.

static void s_bwd_f_0(D_0 * _S50, ND_0 * _S51, DiffPair_float_0 * _S52, float _S53)
{
    D_0 _S54;
    ND_0 _S55;
    float _S56;
    s_bwd_prop_f_Intermediates_0 _S57;
    s_primal_ctx_f_0(_S50, &_S54, &_S55, _S51, (*_S52).primal_0, &_S56, &_S57);
    s_bwd_prop_f_Intermediates_0 _S58 = _S57;
    s_bwd_prop_f_0(_S50, _S51, _S52, _S53, &_S58);
    return;
}

Note that _S50 will be modified because it is directly used for the function calls.

@skallweitNV
Copy link
Collaborator

I went ahead and tested this fix in the falcor2 codebase. This already helps a lot, CUDA compile times for the main pathtrace shader go from 100s to around 5s and the framerate goes from "frames per minute" to "frames per second", so that's a good start!

However, looking at the generated code we still get expensive copies sometimes. One example is:

public struct Scene { ... } // large scene struct

public ParameterBlock<Scene> g_scene;

public struct EnvMap
{
    ...
    public float3 eval(float2 uv)
    {
        return g_scene.sample_texture_linear(texture_id, uv) * scaling_factor;
    }
}

Which is compiled to something like this:

__device__ float3  EnvMap_eval_0(EnvMap_0 * this_31, float2  * uv_9)
{
    Scene_0 _S182 = *globalParams_0->g_scene_0;
    float3  _S183 = Scene_sample_texture_linear_0(&_S182, this_31->texture_id_0, uv_9);
    return _S183 * this_31->scaling_factor_0;
}

_S182 is still a value type and does an expensive copy of the full scene struct. What we need is something like:

__device__ float3  EnvMap_eval_0(EnvMap_0 * this_31, float2  * uv_9)
{
    Scene_0* _S182 = globalParams_0->g_scene_0;
    float3  _S183 = Scene_sample_texture_linear_0(_S182, this_31->texture_id_0, uv_9);
    return _S183 * this_31->scaling_factor_0;
}

@csyonghe
Copy link
Collaborator

It is intentional that s60 is not copied back after calling the prop function. If your change is breaking this test then it is making wrong assumptions and making changes that it shouldn’t be making.

you can only remove a load/store if you can prove it is safe to so, which means the behavior code code won’t change. This means that it is only possible when the address being loaded can be proved to be immutable between the load and the use, and the callee isn’t modifying the data through the address.

@csyonghe
Copy link
Collaborator

csyonghe commented Sep 14, 2025

Basically you should look for

Store(var, load(address)
call(var)

and change it to call(address) only when:

  1. the parameter type of the callee is a IRConstRef, not IRInOut or anything else.
  2. The var is only being used in a store and the call.
  3. The address is immutable, I.e. whose root is a IRConstRef or from a read only buffer.

@jkwak-work
Copy link
Collaborator Author

Thanks for the review.
The code is mostly written by LLM. I am sorry about that but I was rushing to get this code out for a review to get help as early as possible due to the urgency of the matter.

It makes sense that the optimization should be limited to immutable cases. I will make changes and get tested.

@jkwak-work jkwak-work force-pushed the fix/avoid-unnecessary-copy-of-param branch from f7dd0f9 to bf10743 Compare September 16, 2025 17:47
@jkwak-work
Copy link
Collaborator Author

jkwak-work commented Sep 16, 2025

I pushed a new implementation that applies the optimization to more functions.
The previous optimization was applied only to certain functions that got the const-ref fixup and CUDA/CPP targets.
But this new change applies to all functions to all targets, which I expect to improve the perf more.

This new change should address the problem @skallweitNV mentioned above,

However the PR is still in draft because one of tests is failing/crashing with this PR. tests/compute/kernel-context-threading.slang.
I am debugging it.

@jkwak-work jkwak-work marked this pull request as ready for review September 16, 2025 23:20
@jkwak-work jkwak-work requested a review from a team as a code owner September 16, 2025 23:20
@jkwak-work jkwak-work changed the title [WIP] Remove unnecessary Load and Store pair Remove unnecessary Load and Store pair Sep 16, 2025
@shader-slang shader-slang deleted a comment from slangbot Sep 16, 2025
@jkwak-work
Copy link
Collaborator Author

I pushed another fix that addressed all of known problems.
It is ready for another review.

Copy link
Collaborator

@csyonghe csyonghe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that this PR is optimizing out everything that Simon is seeing. Can you double check that access to nested elements in a parameter block can also be eliminated? Your code doesn't seem to be able to recognize that (I have commented).

if (!loadInst)
continue;

// Do not optimize the primitive types because the legalization step may assume the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we need this. Does not seem principled.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I was trying to explain more with examples, I got more questions of why I needed the change.
I will leave a comment later after I debug and understand better.

But for now, here is a quick answer of why it was needed.
This change is required to not break one of the tests, tests/spirv/nested-entrypoint.slang.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up using a little different workaround for the same reason.
Here is an explanation.

The normal operation for the test, tests/spirv/nested-entrypoint.slang, shows the following IR dump,

[nameHint("outerMain")]
[layout(%1)]
func %outerMain : Func(Void, ConstRef(Int, 1 : UInt64, 2147483647 : UInt64))
{
block %29(
                [layout(%7)]
                [nameHint("id")]
                [semantic("SV_DispatchThreadID", 0 : Int)]
                param %id2      : ConstRef(Int, 1 : UInt64, 2147483647 : UInt64)):
        let  %30        : Int   = load(%id2)
        let  %31        : Ptr(Int)      = var
        store(%31, %30)
        call %innerMain(%31)
        return_val(void_constant)

Then, it gets legalized to the following,

[import("gl_GlobalInvocationID")]
[layout(%1)]
let  %34        : Ptr(Vec(UInt, 3 : Int), 0 : UInt64, 7 : UInt64)       = global_param
[entryPoint(6 : Int, "outerMain", "nested-entrypoint")]
[keepAlive]
[numThreads(1 : Int, 1 : Int, 1 : Int)]
[export("_SR20nested_2Dxentrypoint9outerMainp1pi_iV")]
[nameHint("outerMain")]
[layout(%6)]
func %outerMain : Func(Void)
{
block %35:
        let  %36        : Vec(UInt, 3 : Int)    = load(%34)
        let  %37        : UInt  = swizzle(%36, 0 : Int)
        let  %38        : Int   = intCast(%37)
        let  %39        : Ptr(Int)      = var
        store(%39, %38)
        call %innerMain(%39)
        return_val(void_constant)
}

Because my load/store optimization was happening before the legalization, the legalization step got broken.
My load/store optimization changes the first dump to the following,

[nameHint("outerMain")]
[layout(%1)]
func %outerMain : Func(Void, ConstRef(Int, 1 : UInt64, 2147483647 : UInt64))
{
block %29(
                [layout(%7)]
                [nameHint("id")]
                [semantic("SV_DispatchThreadID", 0 : Int)]
                param %id2      : ConstRef(Int, 1 : UInt64, 2147483647 : UInt64)):
        call %innerMain(%id2)
        return_val(void_constant)

Note that "innerMain" is called directly with "%id2", which is a good/expected behavior.
But then the legalization happens later and the following "broken" IR dump is generated,

[import("gl_GlobalInvocationID")]
[layout(%1)]
let  %34        : Ptr(Vec(UInt, 3 : Int), 0 : UInt64, 7 : UInt64)       = global_param
[entryPoint(6 : Int, "outerMain", "nested-entrypoint")]
[keepAlive]
[numThreads(1 : Int, 1 : Int, 1 : Int)]
[export("_SR20nested_2Dxentrypoint9outerMainp1pi_iV")]
[nameHint("outerMain")]
[layout(%6)]
func %outerMain : Func(Void)
{
block %35:
        let  %36        : Vec(UInt, 3 : Int)    = load(%34)
        let  %37        : UInt  = swizzle(%36, 0 : Int)
        let  %38        : Int   = intCast(%37)
        call %innerMain(%38)
        return_val(void_constant)
}

The problem here is that now "innerMain" is not getting a pointer value.

The legalization that is related to this is in slang-ir-glsl-legalization.cpp.
It has a concept of "actualType" and "pretendType".
It converts a builtin variable whose type is float3 to a scalar value.
And the legalization is meant to remove/hide a load instruction from a vector and make it pretend to be a scalar.

I may be able to make some changes in the legalization step but it became pretty complicated.
So I decided to avoid the load/store optimization when the value loads from a variable with a semantic.

@csyonghe
Copy link
Collaborator

Besides the optimization implemented in this PR, we should also find out the places where we are inserting the temp vars and see if we can avoid emitting the temp copies in the first place, as Tess suggested.

@jkwak-work jkwak-work force-pushed the fix/avoid-unnecessary-copy-of-param branch 2 times, most recently from 4ead542 to 0fe6fd6 Compare September 23, 2025 03:32
@jkwak-work
Copy link
Collaborator Author

jkwak-work commented Sep 23, 2025

I pushed a new change that addressed all comments.

It looks like metal tests are failing too,

tests/autodiff/dynamic-dispatch-material.slang.3 syn (mtl)
tests/bugs/dyn-dispatch-single-conformance.slang.1 syn (mtl)
tests/bugs/dynamic-interface-property.slang.1 syn (mtl)
tests/compute/cbuffer-legalize.slang.2 syn (mtl)
tests/hlsl-intrinsic/packed/pack-unpack.slang.2 (mtl)
tests/hlsl-intrinsic/packed/pack-unpack.slang.8 (mtl)
tests/language-feature/anyvalue-matrix-layout.slang.1 syn (mtl)

I will investigate more tomorrow.

@jkwak-work jkwak-work force-pushed the fix/avoid-unnecessary-copy-of-param branch 2 times, most recently from 4badd55 to 6322903 Compare September 23, 2025 07:40
@jkwak-work
Copy link
Collaborator Author

jkwak-work commented Sep 24, 2025

All failing tests are now passing.
The problem on Metal was that there were cases where the address-space were not compatible.

I applied a few refactoring based on the code review feedback.

I assume that the PR is verbally approved.
I will merge as soon as CI passes.

if (auto paramPtrType = as<IRConstRefType>(param->getFullType()))
{
if (paramPtrType->getAddressSpace() != loadAddressSpace)
goto unsafeToOptimize; // incompatible address space
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check isn't necessary.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was needed on MacOS where "constant*" point was not compatible with "thread*" pointer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is tests/compute/cbuffer-legalize.slang, and the emitted Metal shader is following, as a quick reference.

#include <metal_stdlib>
#include <metal_math>
#include <metal_texture>
using namespace metal;
struct P_0
{
    uint4 c_0;
};

float4 test_0(const P_0 thread* p_0, texture2d<float, access::sample> p_t_0, sampler p_s_0)
{
    return ((p_t_0).sample((p_s_0), (float2(0.0) ), level((0.0)))) + float4(p_0->c_0);
}

struct SLANG_ParameterGroup_C_0
{
    P_0 p_1;
};

struct KernelContext_0
{
    SLANG_ParameterGroup_C_0 constant* C_0;
    texture2d<float, access::sample> C_p_t_0;
    sampler C_p_s_0;
    float device* outputBuffer_0;
};

[[kernel]] void computeMain(uint3 dispatchThreadID_0 [[thread_position_in_grid]], SLANG_ParameterGroup_C_0 constant* C_1 [[buffer(0)]], texture2d<float, access::sample> C_p_t_1 [[texture(0)]], sampler C_p_s_1 [[sampler(0)]], float device* outputBuffer_1 [[buffer(1)]])
{
    KernelContext_0 kernelContext_0;
    (&kernelContext_0)->C_0 = C_1;
    (&kernelContext_0)->C_p_t_0 = C_p_t_1;
    (&kernelContext_0)->C_p_s_0 = C_p_s_1;
    (&kernelContext_0)->outputBuffer_0 = outputBuffer_1;
    P_0 _S1 = (&kernelContext_0)->C_0->p_1;
    float4 _S2 = test_0(&_S1, (&kernelContext_0)->C_p_t_0, (&kernelContext_0)->C_p_s_0);
    *(outputBuffer_1+int(0)) = _S2.x;
    *((&kernelContext_0)->outputBuffer_0+int(1)) = _S2.y;
    *((&kernelContext_0)->outputBuffer_0+int(2)) = _S2.z;
    *((&kernelContext_0)->outputBuffer_0+int(3)) = _S2.w;
    return;
}

The trouble is when test_0() was called as following,

    float4 _S2 = test_0(&((&kernelContext_0)->C_0->p_1), (&kernelContext_0)->C_p_t_0, (&kernelContext_0)->C_p_s_0);

The constant* was not compatible with thread*.

@jkwak-work
Copy link
Collaborator Author

jkwak-work commented Sep 24, 2025

I made the change to treat a few more cases as mutable.
The performance result with Falcor2 hasn't changed; it is still 10x slower for the specific test we were talking about.

@shader-slang shader-slang deleted a comment from slangbot Sep 24, 2025
@jkwak-work jkwak-work added this pull request to the merge queue Sep 24, 2025
Merged via the queue into shader-slang:master with commit 9c2024a Sep 24, 2025
64 of 66 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CoPilot pr: non-breaking PRs without breaking changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants