Skip to content

Write only lower triangle in _JTDAJ_sparse#1275

Merged
adenzler-nvidia merged 2 commits intogoogle-deepmind:mainfrom
adenzler-nvidia:adenzler/jtdaj-sparse-single-triangle
Apr 2, 2026
Merged

Write only lower triangle in _JTDAJ_sparse#1275
adenzler-nvidia merged 2 commits intogoogle-deepmind:mainfrom
adenzler-nvidia:adenzler/jtdaj-sparse-single-triangle

Conversation

@adenzler-nvidia
Copy link
Copy Markdown
Collaborator

Summary

  • _JTDAJ_sparse was writing both triangles of the H matrix (two atomic_add per off-diagonal pair), but the Cholesky factorization only reads the lower triangle
  • Write a single entry to the lower triangle instead, halving atomic adds for off-diagonal elements

Benchmark

three_humanoids (8192 worlds, nconmax=100, njmax=192, RTX PRO 6000 Blackwell):

Run 1 Run 2 Run 3 Mean
Before 972K 901K 904K 925K steps/s
After 996K 1073K 1073K 1047K steps/s
Delta +13%

Solver convergence is unchanged (mean 2.784 iters, p95=5, 8192/8192 converged).

Nsight Systems kernel-level profile confirms _JTDAJ_sparse drops from 1,248ms to 731ms per 1000 steps (-41%).

Test plan

  • solver_test.py (40/40 passed)
  • smooth_test.py (59/59 passed)
  • forward_test.py (60/60 passed)

The H matrix assembly in _JTDAJ_sparse was writing both triangles
(two atomic adds per off-diagonal pair), but the Cholesky
factorization only reads the lower triangle. Write a single entry
to the lower triangle instead, halving the number of atomic adds
for off-diagonal elements.
@adenzler-nvidia adenzler-nvidia requested a review from thowell April 2, 2026 11:34
Copy link
Copy Markdown
Collaborator

@thowell thowell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@adenzler-nvidia adenzler-nvidia merged commit 5253652 into google-deepmind:main Apr 2, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants