Skip to content

Conversation

jz009
Copy link
Contributor

@jz009 jz009 commented Aug 28, 2025

Objective

Solution

  • Compute the transpose when LightProbeInfo is created and reference it from there

Testing

cargo run -p ci

Copy link
Contributor

Welcome, new contributor!

Please make sure you've read our contributing guide and we look forward to reviewing your pull request shortly ✨

@alice-i-cecile alice-i-cecile added A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times S-Needs-Benchmarking This set of changes needs performance benchmarking to double-check that they help labels Aug 28, 2025
@alice-i-cecile
Copy link
Member

This is a reasonable change, but I generally think we should be measuring this sort of perf-oriented change. Caching is great, but it takes up memory and can lead to bugs so it's not free.

I expect that this is valuable, since light probes are often static, but I've learned not to trust my intuitions very far on this stuff.

Also, since I'm still a relative rendering noob: how will this cache get updated?

@@ -145,6 +145,9 @@ where
// The transform from world space to light probe space.
light_from_world: Affine3A,

// The transpose of the inverse of [`light_from_world`].
light_from_world_transposed: Mat4,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are to improve performance, this would need to replace light_from_world with the three axes used in maybe_gather_light_probes. light_from_world otherwise goes completely unused.

@jz009 jz009 requested a review from james7132 August 28, 2025 18:51
@jz009
Copy link
Contributor Author

jz009 commented Aug 29, 2025

I expect that this is valuable, since light probes are often static, but I've learned not to trust my intuitions very far on this stuff.

I did some testing and it seems that this basically only avoids one .inverse().transpose() per frame per view in the scene. At least in the example I was using to test which was irradiance_volumes.rs.

Also, since I'm still a relative rendering noob: how will this cache get updated?

The cache gets updated every frame since LightProbeInfo only persists for a single frame before being recreated.

@mate-h
Copy link
Contributor

mate-h commented Aug 29, 2025

The code and the goal of the PR looks good to me, I am thinking of profiling this with tracy and Metal debugger to test whether there is performance improvement as Alice suggested.

@atlv24
Copy link
Contributor

atlv24 commented Sep 2, 2025

This doesn't avoid an inverse per frame - the inverse is already avoided. This just avoids a transpose, to be clear. Im curious as to how much time we even save by transposing it to avoid sending a vec4 to the gpu tbh. We send full mat4s in a lot of places in bevy that we could 'optimize' similarly to here by sending transposes and omitting the last row. Anyways im not really for or against this pr one way or another, my main motivation for touching this code was precision related w.r.t. the inverses. Id need to see benches /shrug

Copy link
Member

@james7132 james7132 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if it doesn't save compute time, this does reduce the CPU size of LightProbeInfo by one Vec4.

@alice-i-cecile alice-i-cecile added S-Needs-Review Needs reviewer attention (from anyone!) to move forward and removed S-Needs-Benchmarking This set of changes needs performance benchmarking to double-check that they help labels Sep 3, 2025
@atlv24
Copy link
Contributor

atlv24 commented Sep 4, 2025

im not convinced the readability loss is worth it, not gonna block this but im not giving it the second approve either. id have to see benches to feel that this matters enough to be fuddling around with Vec4 arrays. If anyone else feels strongly enough to approve go for it.

We send full mat4s everywhere, if this is actually a worthwhile optimization lets have evidence of it first and then do it elsewhere too, probably with a TransposedAffine3A component which is a thin wrapper around [Vec4; 3] with ctor that handles the transpose etc. and centralized docs explaining the reasoning on it

@james7132
Copy link
Member

We send full mat4s everywhere, if this is actually a worthwhile optimization lets have evidence of it first and then do it elsewhere too, probably with a TransposedAffine3A component which is a thin wrapper around [Vec4; 3] with ctor that handles the transpose etc. and centralized docs explaining the reasoning on it

From the GPU-side, @superdump, I believe, was the one who brought up that transposes on modern GPUs are basically free, which is where this optimization comes from.

From the CPU-side, in general, data-oriented design prioritizes shaping code to match the target hardware. For modern CPUs, this often means minimizing padding and shrinking or, in other cases, splitting our memory fetches to make every RAM fetch worth it, so I think we should keep that in mind for all CPU-side operations, engine wide. A bit of extra in-register computation tends to be rather cheap.

I agreed that a helper type would be useful here. There are definitely areas (e.g. skinned mesh bindposes) where we store and send many more Mat4s/Affine3As to the GPU, and shaving 25% off of the memory usage and bandwidth there is nothing to scoff at.

@atlv24
Copy link
Contributor

atlv24 commented Sep 5, 2025

Okay im convinced, i presume a helper type is left as followup work? Or should we do it in this PR

@tychedelia
Copy link
Member

From the CPU-side, in general, data-oriented design prioritizes shaping code to match the target hardware. For modern CPUs, this often means minimizing padding and shrinking or, in other cases, splitting our memory fetches to make every RAM fetch worth it, so I think we should keep that in mind for all CPU-side operations, engine wide. A bit of extra in-register computation tends to be rather cheap.

The theory here makes absolute sense to me, but I'll also articulate @atlv24 more general concern about empiricism with respect to these changes. I don't want to blame this particular PR though, this is something we struggle with in general, and we should have more robust mechanisms to automate running benches via automation so we can answer these questions. In the meantime, I am also very convinced by your explanation here!

@alice-i-cecile alice-i-cecile added S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it and removed S-Needs-Review Needs reviewer attention (from anyone!) to move forward labels Sep 5, 2025
@alice-i-cecile alice-i-cecile added this pull request to the merge queue Sep 11, 2025
Merged via the queue into bevyengine:main with commit e78b62d Sep 11, 2025
36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Light probe matrix caching could be accelerated
6 participants