-
-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Compute light probe matrix earlier and cache in LightProbeInfo #20738 #20782
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compute light probe matrix earlier and cache in LightProbeInfo #20738 #20782
Conversation
Welcome, new contributor! Please make sure you've read our contributing guide and we look forward to reviewing your pull request shortly ✨ |
This is a reasonable change, but I generally think we should be measuring this sort of perf-oriented change. Caching is great, but it takes up memory and can lead to bugs so it's not free. I expect that this is valuable, since light probes are often static, but I've learned not to trust my intuitions very far on this stuff. Also, since I'm still a relative rendering noob: how will this cache get updated? |
@@ -145,6 +145,9 @@ where | |||
// The transform from world space to light probe space. | |||
light_from_world: Affine3A, | |||
|
|||
// The transpose of the inverse of [`light_from_world`]. | |||
light_from_world_transposed: Mat4, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are to improve performance, this would need to replace light_from_world
with the three axes used in maybe_gather_light_probes
. light_from_world
otherwise goes completely unused.
…ransposed from LightProbeInfo
I did some testing and it seems that this basically only avoids one
The cache gets updated every frame since LightProbeInfo only persists for a single frame before being recreated. |
The code and the goal of the PR looks good to me, I am thinking of profiling this with tracy and Metal debugger to test whether there is performance improvement as Alice suggested. |
This doesn't avoid an inverse per frame - the inverse is already avoided. This just avoids a transpose, to be clear. Im curious as to how much time we even save by transposing it to avoid sending a vec4 to the gpu tbh. We send full mat4s in a lot of places in bevy that we could 'optimize' similarly to here by sending transposes and omitting the last row. Anyways im not really for or against this pr one way or another, my main motivation for touching this code was precision related w.r.t. the inverses. Id need to see benches /shrug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if it doesn't save compute time, this does reduce the CPU size of LightProbeInfo by one Vec4
.
im not convinced the readability loss is worth it, not gonna block this but im not giving it the second approve either. id have to see benches to feel that this matters enough to be fuddling around with Vec4 arrays. If anyone else feels strongly enough to approve go for it. We send full mat4s everywhere, if this is actually a worthwhile optimization lets have evidence of it first and then do it elsewhere too, probably with a TransposedAffine3A component which is a thin wrapper around [Vec4; 3] with ctor that handles the transpose etc. and centralized docs explaining the reasoning on it |
From the GPU-side, @superdump, I believe, was the one who brought up that transposes on modern GPUs are basically free, which is where this optimization comes from. From the CPU-side, in general, data-oriented design prioritizes shaping code to match the target hardware. For modern CPUs, this often means minimizing padding and shrinking or, in other cases, splitting our memory fetches to make every RAM fetch worth it, so I think we should keep that in mind for all CPU-side operations, engine wide. A bit of extra in-register computation tends to be rather cheap. I agreed that a helper type would be useful here. There are definitely areas (e.g. skinned mesh bindposes) where we store and send many more Mat4s/Affine3As to the GPU, and shaving 25% off of the memory usage and bandwidth there is nothing to scoff at. |
Okay im convinced, i presume a helper type is left as followup work? Or should we do it in this PR |
The theory here makes absolute sense to me, but I'll also articulate @atlv24 more general concern about empiricism with respect to these changes. I don't want to blame this particular PR though, this is something we struggle with in general, and we should have more robust mechanisms to automate running benches via automation so we can answer these questions. In the meantime, I am also very convinced by your explanation here! |
Objective
Solution
Testing
cargo run -p ci