Updating Healpix CUDA primitive #290

ASKabalan · 2025-03-26T16:23:14Z

Adding a few updates

Updating to the newest custom call API (API 4) using FFI
implementing a grad rule for healpix cuda FFT
Implementing a Batching rule

A batching rule seems to be very important for two things
Being able to jacrev/ jacfwd
and because in most cases .. the size of a healpix map can fit on a single GPU but sometimes we want to batch the spherical transform

I will be doing that next

ASKabalan · 2025-03-28T16:16:58Z

Hello @matt-graham @jasonmcewen @CosmoMatt

Just a quick PR to wrap up a few stuff

Updated the binding API to the newest FFI
Added a vmap implementation of the cuda primitive
Added a transpose rule which allows jacfwd and jacrev (consequently grad aswell)
added more tests https://github.com/astro-informatics/s2fft/blob/ASKabalan/tests/test_healpix_ffts.py#L100
Removed two files which are now no longer needed with the FFI API (kernel helpers) (so maybe they should be removed from the license section)
Constrained nanobind to be nanobind >=2.0,<2.6" because of a regression [BUG]: Regression when using scikit build tools and nanobind wjakob/nanobind#982

And finally I added cudastreamhandler which is used to split the XLA provided stream for the VMAP lowering (this header is my own work)

There is an issue with building pyssht not sure that this is my fault

I will check the failing worflows when I get the chance, but in the meantime a review is appreciated

matt-graham

Hello @matt-graham @jasonmcewen @CosmoMatt

Just a quick PR to wrap up a few stuff
1. Updated the binding API to the newest [FFI](https://docs.jax.dev/en/latest/ffi.html)

2. Added a vmap implementation of the cuda primitive

3. Added a transpose rule which allows jacfwd and jacrev (consequently grad aswell)

4. added more tests https://github.com/astro-informatics/s2fft/blob/ASKabalan/tests/test_healpix_ffts.py#L100

5. Removed two files which are now no longer needed with the FFI API ([kernel helpers](https://github.com/astro-informatics/s2fft/blob/main/lib/include/kernel_helpers.h)) (so maybe they should be removed from the license section)

6. Constrained nanobind to be nanobind >=2.0,<2.6" because of a regression [[BUG]: Regression when using scikit build tools and nanobind wjakob/nanobind#982](https://github.com/wjakob/nanobind/issues/982)
And finally I added cudastreamhandler which is used to split the XLA provided stream for the VMAP lowering (this header is my own work)

There is an issue with building pyssht not sure that this is my fault

I will check the failing worflows when I get the chance, but in the meantime a review is appreciated

Hi @ASKabalan, sorry for the delay in getting back to you.

This all sounds great - thanks for picking up #237 in particular and for the updates to use the newer FFI interface.

With regards to the failing workflows - this was probably due to #292 which was fixed in #293. If you merge in latest main here that should hopefully resolve the upstream dependency build problems that were causing the test workflows to fail.

I've added some initial review comments below. Will have a closer look next week and try testing this out, but don't have access to GPU machine atm.

tests/test_healpix_ffts.py

matt-graham · 2025-04-11T17:06:56Z

tests/test_healpix_ffts.py

+        flm_hp = samples.flm_2d_to_hp(flm, L)
+        f = hp.sphtfunc.alm2map(flm_hp, nside, lmax=L - 1)


I think we could use s2fft.inverse(flm, L=L, reality=False, method="jax", sampling="healpix") here instead of going via healpy? Rationale being that I would have a slight preference for minimising the number of additional tests that depend on healpy as it we are no longer requiring it as direct dependency for package and in the long run it might be possible to also remove it as a test dependency.

tests/test_healpix_ffts.py

s2fft/utils/healpix_ffts.py

Co-authored-by: Matt Graham <[email protected]>

matt-graham · 2025-04-16T10:45:25Z

I've tried building, installing and running this on a system with CUDA 12.6 + a NVIDIA A100, and running the HEALPix FFT tests with

pytest tests/test_healpix_ffts.py

consistently the tests hang when trying to run the first test_healpix_fft_cuda instance.

Running just the IFFT tests with

pytest tests/test_healpix_ffts.py::test_healpix_ifft_cuda

the tests for both set of test parameters pass.

Trying to dig into this a bit, running the following locally

import healpy
import jax
import s2fft
import numpy

jax.config.update("jax_enable_x64", True)

seed = 20250416
nside = 4
L = 2 * nside
reality = False

rng = numpy.random.default_rng(seed)
flm = s2fft.utils.signal_generator.generate_flm(rng=rng, L=L, reality=reality)
flm_hp = s2fft.sampling.s2_samples.flm_2d_to_hp(flm, L)
f = healpy.sphtfunc.alm2map(flm_hp, nside, lmax=L - 1)
flm_cuda = s2fft.utils.healpix_ffts.healpix_fft_cuda(f=f, L=L, nside=nside, reality=reality).block_until_ready()

raises an error

jaxlib.xla_extension.XlaRuntimeError: INTERNAL: CUDA error: : CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered

so it looks like there is some memory addressing issue somewhere in the healpix_fft_cuda implementation?

ASKabalan · 2025-04-18T09:09:59Z

Thank you

I was able to reproduce with 12.4.1 but not locally with 12.4

I will take a look

codecov · 2025-06-19T22:23:31Z

Codecov Report

❌ Patch coverage is 75.00000% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 96.14%. Comparing base (b239a11) to head (3bcb69a).

Files with missing lines	Patch %	Lines
s2fft/utils/healpix_ffts.py	66.66%	3 Missing ⚠️
s2fft/utils/jax_primitive.py	75.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #290      +/-   ##
==========================================
- Coverage   96.50%   96.14%   -0.36%     
==========================================
  Files          32       32              
  Lines        3434     3453      +19     
==========================================
+ Hits         3314     3320       +6     
- Misses        120      133      +13

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ASKabalan · 2025-06-20T09:22:02Z

@matt-graham Hey
I am picking were i left off
So it seems that there is an error when building with python3.8
Doesn't seem to be coming from my code.
It seems to be because of some compile error when compiling sht

I would suggest dropping python3.8 from the test suite since JAX no longer supports it anyway

matt-graham · 2025-06-23T16:09:47Z

@matt-graham Hey I am picking were i left off So it seems that there is an error when building with python3.8 Doesn't seem to be coming from my code. It seems to be because of some compile error when compiling sht

Hi @ASKabalan. Do you mean so3 rather than (py)ssht? From a quick look at the logs of the failing Actions workflow job on Python 3.8 / ubuntu-latest it appears like it's an error with building so3 (ERROR: Failed building wheel for so3). If so this is likely the same issue as described in #308. I've opened a PR to try to fix this upstream in so3 (astro-informatics/so3#31).

I would suggest dropping python3.8 from the test suite since JAX no longer supports it anyway

Yes agreed we should drop Python 3.8 from test matrix - we have an open pull request #305 to update to only supporting Python 3.11+ but this is partially blocked by #212 as the tests currently exit with fatal errors when running on MacOS / Python 3.9+ due to an incompatibility between the OpenMP runtime's the MacOS wheels for healpy and PyTorch are built for (healpy/healpy#1012)

Add comprehensive documentation and fix dependency issues for CUDA FFT integration. This commit introduces extensive docstrings and inline comments across the C++ and Python codebase, particularly for the CUDA FFT implementation. It also addresses a dependency issue in to ensure proper installation and functionality. Key changes include: - no more CUDA Malloc .. all memory is allocated in Python by XLA - Added detailed docstrings to C++ header files - Enhanced inline comments in C++ source files to explain complex logic and algorithms. - Updated to relax JAX version dependency, resolving installation issues. - Refined docstrings and comments in Python files for clarity and consistency. - Cleaned up debug print statements

ASKabalan · 2025-07-02T17:02:34Z

@matt-graham I fixed the issue with CUDA 12.4 and above
I cleaned up the code and added docstrings everywhere
There is still a small issue with mac (since the JAX version is not the same)
But I think this is pretty mature to be merged

matt-graham

Hi @ASKabalan. Thanks for the updates, this is looking great.

I have added some initial comments from a quick high level check.

I also tried running some of our benchmarks via the scripts in the benchmarks directory to check everything is running as expected. Results are below (with method="jax" corresponding to current native JAX implementation and method="jax_cuda" using the CUDA primitive).

forward
(method: jax, L: 64, L_lower: 0, sampling: healpix, spin: 0, L_to_nside_ratio: 2, reality: True, spmd: False, n_iter: None):
    min(run times):  0.0016s, max(run times):  0.0016s, compile time:     3.9s, peak memory: 1.1e+04B, max(abs(error)):   0.025, floating point ops: 1.2e+06, mem access: 4.3e+06B
(method: jax, L: 128, L_lower: 0, sampling: healpix, spin: 0, L_to_nside_ratio: 2, reality: True, spmd: False, n_iter: None):
    min(run times):  0.0035s, max(run times):  0.0036s, compile time:     6.9s, peak memory: 1.1e+04B, max(abs(error)):   0.033, floating point ops: 5.1e+06, mem access: 1.9e+07B
(method: jax, L: 256, L_lower: 0, sampling: healpix, spin: 0, L_to_nside_ratio: 2, reality: True, spmd: False, n_iter: None):
    min(run times):   0.010s, max(run times):   0.010s, compile time:     13.s, peak memory: 1.1e+04B, max(abs(error)):   0.032, floating point ops: 2.2e+07, mem access: 8.2e+07B
(method: jax, L: 512, L_lower: 0, sampling: healpix, spin: 0, L_to_nside_ratio: 2, reality: True, spmd: False, n_iter: None):
    min(run times):   0.050s, max(run times):   0.050s, compile time:     26.s, peak memory: 1.0e+04B, max(abs(error)):  0.0095, floating point ops: 9.2e+07, mem access: 3.2e+08B
(method: jax_cuda, L: 64, L_lower: 0, sampling: healpix, spin: 0, L_to_nside_ratio: 2, reality: True, spmd: False, n_iter: None):
    min(run times):  0.0015s, max(run times):  0.0015s, compile time:    0.73s, peak memory: 8.6e+03B, max(abs(error)):   0.036, floating point ops: 5.3e+05, mem access: 3.9e+06B
(method: jax_cuda, L: 128, L_lower: 0, sampling: healpix, spin: 0, L_to_nside_ratio: 2, reality: True, spmd: False, n_iter: None):
    min(run times):  0.0032s, max(run times):  0.0033s, compile time:    0.87s, peak memory: 8.6e+03B, max(abs(error)):    0.67, floating point ops: 2.1e+06, mem access: 1.5e+07B
(method: jax_cuda, L: 256, L_lower: 0, sampling: healpix, spin: 0, L_to_nside_ratio: 2, reality: True, spmd: False, n_iter: None):
    min(run times):  0.0096s, max(run times):  0.0096s, compile time:    0.92s, peak memory: 8.6e+03B, max(abs(error)):    0.61, floating point ops: 8.1e+06, mem access: 6.6e+07B
(method: jax_cuda, L: 512, L_lower: 0, sampling: healpix, spin: 0, L_to_nside_ratio: 2, reality: True, spmd: False, n_iter: None):
    min(run times):   0.049s, max(run times):   0.049s, compile time:     1.3s, peak memory: 8.6e+03B, max(abs(error)):    0.71, floating point ops: 3.1e+07, mem access: 2.5e+08B
inverse
(method: jax, L: 64, L_lower: 0, sampling: healpix, spin: 0, L_to_nside_ratio: 2, reality: True, spmd: False):
    min(run times):  0.0021s, max(run times):  0.0021s, compile time:     5.3s, peak memory: 8.8e+03B, floating point ops: 1.2e+06, mem access: 4.5e+07B
(method: jax, L: 128, L_lower: 0, sampling: healpix, spin: 0, L_to_nside_ratio: 2, reality: True, spmd: False):
    min(run times):  0.0046s, max(run times):  0.0046s, compile time:     12.s, peak memory: 8.8e+03B, floating point ops: 5.3e+06, mem access: 3.5e+08B
(method: jax, L: 256, L_lower: 0, sampling: healpix, spin: 0, L_to_nside_ratio: 2, reality: True, spmd: False):
    min(run times):   0.011s, max(run times):   0.011s, compile time:     27.s, peak memory: 8.8e+03B, floating point ops: 2.2e+07, mem access: 2.7e+09B
(method: jax, L: 512, L_lower: 0, sampling: healpix, spin: 0, L_to_nside_ratio: 2, reality: True, spmd: False):
    min(run times):   0.056s, max(run times):   0.056s, compile time:     66.s, peak memory: 8.8e+03B, floating point ops: 9.4e+07, mem access: 2.2e+10B
(method: jax_cuda, L: 64, L_lower: 0, sampling: healpix, spin: 0, L_to_nside_ratio: 2, reality: True, spmd: False):
    min(run times):  0.0017s, max(run times):  0.0017s, compile time:    0.66s, peak memory: 8.6e+03B, floating point ops: 5.1e+05, mem access: 3.7e+06B
(method: jax_cuda, L: 128, L_lower: 0, sampling: healpix, spin: 0, L_to_nside_ratio: 2, reality: True, spmd: False):
    min(run times):  0.0037s, max(run times):  0.0038s, compile time:    0.67s, peak memory: 8.6e+03B, floating point ops: 2.0e+06, mem access: 1.5e+07B
(method: jax_cuda, L: 256, L_lower: 0, sampling: healpix, spin: 0, L_to_nside_ratio: 2, reality: True, spmd: False):
    min(run times):  0.0094s, max(run times):  0.0094s, compile time:    0.78s, peak memory: 8.6e+03B, floating point ops: 7.5e+06, mem access: 5.8e+07B
(method: jax_cuda, L: 512, L_lower: 0, sampling: healpix, spin: 0, L_to_nside_ratio: 2, reality: True, spmd: False):
    min(run times):   0.052s, max(run times):   0.052s, compile time:    0.88s, peak memory: 8.6e+03B, floating point ops: 3.0e+07, mem access: 2.3e+08B

In terms of the run and compilation times this all looks great - run times the same or slightly improved for a corresponding bandlimit L and compilation times massively reduced (and close to constant in L over the range tested). However, there seems to be something slightly odd going on with the round-trip errors (indicated by max(abs(error)) entries) which are significantly larger for the CUDA primitive version than for the native JAX implementation. For the HEALPix sampling scheme we would expect a non-negligible round-trip error, particularly without iterative refinement, but the size of the errors here seems to be too large and in particular I wouldn't expect there to be a significant difference in the errors compared to the native JAX version. Further when we use iterative refinement with 3 iterations the errors seem to be getting larger when using the CUDA primitive version rather than getting smaller as for the native JAX version.

forward
(method: jax_cuda, L: 64, L_lower: 0, sampling: healpix, spin: 0, L_to_nside_ratio: 2, reality: True, spmd: False, n_iter: 3):
    min(run times):   0.012s, max(run times):   0.012s, compile time:     2.4s, peak memory: 1.1e+04B, max(abs(error)):     49., floating point ops: 3.9e+06, mem access: 2.9e+07B
(method: jax_cuda, L: 128, L_lower: 0, sampling: healpix, spin: 0, L_to_nside_ratio: 2, reality: True, spmd: False, n_iter: 3):
    min(run times):   0.027s, max(run times):   0.027s, compile time:     2.6s, peak memory: 1.1e+04B, max(abs(error)):     68., floating point ops: 1.5e+07, mem access: 1.2e+08B
(method: jax_cuda, L: 256, L_lower: 0, sampling: healpix, spin: 0, L_to_nside_ratio: 2, reality: True, spmd: False, n_iter: 3):
    min(run times):   0.074s, max(run times):   0.074s, compile time:     2.7s, peak memory: 1.1e+04B, max(abs(error)): 1.1e+02, floating point ops: 5.5e+07, mem access: 4.9e+08B
(method: jax_cuda, L: 512, L_lower: 0, sampling: healpix, spin: 0, L_to_nside_ratio: 2, reality: True, spmd: False, n_iter: 3):
    min(run times):    0.39s, max(run times):    0.39s, compile time:     3.4s, peak memory: 1.0e+04B, max(abs(error)): 1.7e+02, floating point ops: 2.1e+08, mem access: 1.9e+09B

I haven't yet figured out what is causing this. The errors seem to be larger at higher bandlimits which might explain why the tests are not picking this up, but need to investigate this in more detail.

CMakeLists.txt

matt-graham · 2025-07-09T09:02:01Z

lib/include/kernel_helpers.h

With this file removed we can remove the comment in README

s2fft/README.md

Lines 350 to 352 in d77e9cb

The file [`lib/include/kernel_helpers.h`](https://github.com/astro-informatics/s2fft/blob/main/lib/include/kernel_helpers.h) is adapted from

[code](https://github.com/dfm/extending-jax/blob/c33869665236877a2ae281f3f5dbff579e8f5b00/lib/kernel_helpers.h) in [a tutorial on extending JAX](https://github.com/dfm/extending-jax) by

[Dan Foreman-Mackey](https://github.com/dfm) and licensed under a [MIT license](https://github.com/dfm/extending-jax/blob/371dca93c6405368fa8e71690afd3968d75f4bac/LICENSE).

matt-graham · 2025-07-09T09:02:22Z

lib/include/kernel_nanobind_helpers.h

With this file removed we can remove comment in README

s2fft/README.md

Lines 354 to 357 in d77e9cb

The file [`lib/include/kernel_nanobind_helpers.h`](https://github.com/astro-informatics/s2fft/blob/main/lib/include/kernel_nanobind_helpers.h)

is adapted from [code](https://github.com/jax-ml/jax/blob/3d389a7fb440c412d95a1f70ffb91d58408247d0/jaxlib/kernel_nanobind_helpers.h)

by the [JAX](https://github.com/jax-ml/jax) authors

and licensed under a [Apache-2.0 license](https://github.com/jax-ml/jax/blob/3d389a7fb440c412d95a1f70ffb91d58408247d0/LICENSE).

matt-graham · 2025-07-25T15:51:48Z

I have been trying to diagnose what is causing the numerical issues here. Not isolated the precise cause yet, but have somewhat narrowed things down.

This only seems to affect the forward tranform s2fft.utils.healpix_ffts.healpix_fft_cuda and not the backward transform s2fft.utils.healpix_ffts.healpix_ifft_cuda. Changing nsides_to_test in tests/test_healpix_fft.py to nside_to_test = list(range(2, 16)) + [16, 32, 64] and running pytest tests/test_healpix_ffts.py::test_healpix_ifft_cuda, the tests all pass consistently over multiple tries.
For s2fft.utils.healpix_ffts.healpix_fft_cuda, there are localised differences in output compared to s2fft.utils.healpix_ffts.healpix_fft_jax for specific nside / L values, with differences seeming to become more likely for larger nside. For example for nside = 12 we get that indices that the outputs differ in are
```
>>> np.nonzero((abs(ftm_jax - ftm_cuda) > 1e-7))
(array([10, 10, 10, 10, 10, 10, 10, 10, 21, 21, 21, 21, 21, 21, 21, 21]), 
 array([16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]))
```

There seems to be some non-determinism in whether there are differences or not when running pytest tests/test_healpix_ffts.py::test_healpix_fft_cuda for given nside values, despite all input data being the same (the tests used a fixed random seed to generate data). For example for consecutive test runs I got

tests/test_healpix_ffts.py::test_healpix_fft_cuda[8966433580120847635-2] PASSED                                                                                                                                                       [  5%]
tests/test_healpix_ffts.py::test_healpix_fft_cuda[8966433580120847635-3] PASSED                                                                                                                                                       [ 11%]
tests/test_healpix_ffts.py::test_healpix_fft_cuda[8966433580120847635-4] PASSED                                                                                                                                                       [ 17%]
tests/test_healpix_ffts.py::test_healpix_fft_cuda[8966433580120847635-5] FAILED                                                                                                                                                       [ 23%]
tests/test_healpix_ffts.py::test_healpix_fft_cuda[8966433580120847635-6] PASSED                                                                                                                                                       [ 29%]
tests/test_healpix_ffts.py::test_healpix_fft_cuda[8966433580120847635-7] PASSED                                                                                                                                                       [ 35%]
tests/test_healpix_ffts.py::test_healpix_fft_cuda[8966433580120847635-8] PASSED                                                                                                                                                       [ 41%]
tests/test_healpix_ffts.py::test_healpix_fft_cuda[8966433580120847635-9] PASSED                                                                                                                                                       [ 47%]
tests/test_healpix_ffts.py::test_healpix_fft_cuda[8966433580120847635-10] PASSED                                                                                                                                                      [ 52%]
tests/test_healpix_ffts.py::test_healpix_fft_cuda[8966433580120847635-11] PASSED                                                                                                                                                      [ 58%]
tests/test_healpix_ffts.py::test_healpix_fft_cuda[8966433580120847635-12] FAILED                                                                                                                                                      [ 64%]
tests/test_healpix_ffts.py::test_healpix_fft_cuda[8966433580120847635-13] PASSED                                                                                                                                                      [ 70%]
tests/test_healpix_ffts.py::test_healpix_fft_cuda[8966433580120847635-14] FAILED                                                                                                                                                      [ 76%]
tests/test_healpix_ffts.py::test_healpix_fft_cuda[8966433580120847635-15] FAILED                                                                                                                                                      [ 82%]
tests/test_healpix_ffts.py::test_healpix_fft_cuda[8966433580120847635-16] FAILED                                                                                                                                                      [ 88%]
tests/test_healpix_ffts.py::test_healpix_fft_cuda[8966433580120847635-32] FAILED                                                                                                                                                      [ 94%]
tests/test_healpix_ffts.py::test_healpix_fft_cuda[8966433580120847635-64] FAILED                                                                                                                                                      [100%]

and then

tests/test_healpix_ffts.py::test_healpix_fft_cuda[8966433580120847635-2] PASSED                                                                                                                                                       [  5%]
tests/test_healpix_ffts.py::test_healpix_fft_cuda[8966433580120847635-3] PASSED                                                                                                                                                       [ 11%]
tests/test_healpix_ffts.py::test_healpix_fft_cuda[8966433580120847635-4] PASSED                                                                                                                                                       [ 17%]
tests/test_healpix_ffts.py::test_healpix_fft_cuda[8966433580120847635-5] PASSED                                                                                                                                                       [ 23%]
tests/test_healpix_ffts.py::test_healpix_fft_cuda[8966433580120847635-6] PASSED                                                                                                                                                       [ 29%]
tests/test_healpix_ffts.py::test_healpix_fft_cuda[8966433580120847635-7] PASSED                                                                                                                                                       [ 35%]
tests/test_healpix_ffts.py::test_healpix_fft_cuda[8966433580120847635-8] PASSED                                                                                                                                                       [ 41%]
tests/test_healpix_ffts.py::test_healpix_fft_cuda[8966433580120847635-9] PASSED                                                                                                                                                       [ 47%]
tests/test_healpix_ffts.py::test_healpix_fft_cuda[8966433580120847635-10] PASSED                                                                                                                                                      [ 52%]
tests/test_healpix_ffts.py::test_healpix_fft_cuda[8966433580120847635-11] PASSED                                                                                                                                                      [ 58%]
tests/test_healpix_ffts.py::test_healpix_fft_cuda[8966433580120847635-12] FAILED                                                                                                                                                      [ 64%]
tests/test_healpix_ffts.py::test_healpix_fft_cuda[8966433580120847635-13] FAILED                                                                                                                                                      [ 70%]
tests/test_healpix_ffts.py::test_healpix_fft_cuda[8966433580120847635-14] PASSED                                                                                                                                                      [ 76%]
tests/test_healpix_ffts.py::test_healpix_fft_cuda[8966433580120847635-15] FAILED                                                                                                                                                      [ 82%]
tests/test_healpix_ffts.py::test_healpix_fft_cuda[8966433580120847635-16] PASSED                                                                                                                                                      [ 88%]
tests/test_healpix_ffts.py::test_healpix_fft_cuda[8966433580120847635-32] FAILED                                                                                                                                                      [ 94%]
tests/test_healpix_ffts.py::test_healpix_fft_cuda[8966433580120847635-64] FAILED                                                                                                                                                      [100%]

with for example nside=5 failing in the first run and not the second, and nside=13 failing in the second run but not the first.

The last point in particular makes me think this is something to do with unsafe memory access. I have been looking through the code but haven't spotted anything obvious so far. Unfortunately I also don't seem to be able to get the version before changes in this PR to run - while I can build the CUDA extension module, when running the tests on a GPU the process hangs indefinitely when reaching any calls to the custom primitives.

matt-graham · 2025-07-28T17:18:44Z

After a bit more investigation I suspect the non-determinancy and inconsistency issues are arising due a race condition in the application of FFT shifting in the forward transform using the shift_normalize_kernel in s2fft_kernels.cu.

Specifically in

s2fft/lib/src/s2fft_kernels.cu

Lines 339 to 344 in c27dc7e

    
           // Step 4a: Compute shifted position within ring 
        
           long long int shifted_o = (o + nphi / 2) % nphi; 
        
           shifted_o = shifted_o < 0 ? nphi + shifted_o : shifted_o; 
        
           long long int dest_p = r_start + shifted_o; 
        
           // printf(" -> CUDA: Applying shift: p=%lld, dest_p=%lld, shifted_o=%lld\n", p, dest_p, shifted_o); 
        
           data[dest_p] = element;

the (normalized) complex value stored in element is written to index dest_p of the data array. While within a block, as each thread reads data[p] in to element earlier in

s2fft/lib/src/s2fft_kernels.cu

Line 322 in c27dc7e

complex element = data[p];

and so can safely write to dest_p indices corresponding to other threads within the same block due to the thread synchronisation operation between the read in to element and write to data[dest_p],

s2fft/lib/src/s2fft_kernels.cu

Line 335 in c27dc7e

__syncthreads(); // Ensure all threads have completed normalization

if the dest_p indices instead maps to a pixel index corresponding to p in a different block, then writing to data[dest_p] may occur before the threads in that block have read dest[p] into element, meaning the final values in data will depend on the block execution order.

Supporting this hypothesis - if we increase the block size to 1024 in launch_shift_normalize_kernel in

s2fft/lib/src/s2fft_kernels.cu

Line 438 in c27dc7e

int block_size = 256;

thus decreasing the number of blocks in grid and so cross-block race conditions, the test_healpix_ffts.py::test_healpix_fft_cuda test passes for all nside in list(range(2, 16)) + [16, 32, 64] consistently, but still fails sometimes if we set nside large enough - for example I got failures for nside = 256 with high likelihood. Conversely changing the block size to be smaller, for example block_size = 32, we start getting failures even for the smallest nside values.

A simple but potentially memory inefficient solution would be to write out the shifted values to a different array rather than updating in-place.

Bumps [pypa/cibuildwheel](https://github.com/pypa/cibuildwheel) from 2.23.3 to 3.0.0. - [Release notes](https://github.com/pypa/cibuildwheel/releases) - [Changelog](https://github.com/pypa/cibuildwheel/blob/main/docs/changelog.md) - [Commits](pypa/cibuildwheel@v2.23.3...v3.0.0) --- updated-dependencies: - dependency-name: pypa/cibuildwheel dependency-version: 3.0.0 dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Update python_requires and test matrix * Ruff autofixes for type hints with 3.11+ features * Use miniforge to install pytorch / healpy on MacOS * Try using conda-pypi to install dependencies on MacOS * Manually specify dependencies to install with conda * Fix pytorch conda package name and skip PyPI dependency install on MacOS * Add tmate step to allow debugging * Remove tmate and use explicit shell * Set explicit shell options as default for job + relax NumPy requirement * Readd upper bound on NumPy version * Exclude Python 3.13 on MacOS from matrix

* Update Python version used in docs workflow * Trigger docs workflow on pull-requests * Deploy only on push to main

Bumps [pypa/cibuildwheel](https://github.com/pypa/cibuildwheel) from 3.0.0 to 3.0.1. - [Release notes](https://github.com/pypa/cibuildwheel/releases) - [Changelog](https://github.com/pypa/cibuildwheel/blob/main/docs/changelog.md) - [Commits](pypa/cibuildwheel@v3.0.0...v3.0.1) --- updated-dependencies: - dependency-name: pypa/cibuildwheel dependency-version: 3.0.1 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Bumps [pypa/cibuildwheel](https://github.com/pypa/cibuildwheel) from 3.0.1 to 3.1.3. - [Release notes](https://github.com/pypa/cibuildwheel/releases) - [Changelog](https://github.com/pypa/cibuildwheel/blob/main/docs/changelog.md) - [Commits](pypa/cibuildwheel@v3.0.1...v3.1.3) --- updated-dependencies: - dependency-name: pypa/cibuildwheel dependency-version: 3.1.3 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Update custom_ops.py Small compatibility change which disables jitting on the s2fft side, in turn enables higher level jitting in s2ai. * Update custom_ops.py Removed commented lines for linting purposes * Removing now unused imports --------- Co-authored-by: Matt Graham <[email protected]>

Bumps [actions/checkout](https://github.com/actions/checkout) from 4.2.2 to 5.0.0. - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](actions/checkout@v4.2.2...v5.0.0) --- updated-dependencies: - dependency-name: actions/checkout dependency-version: 5.0.0 dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 4 to 5. - [Release notes](https://github.com/actions/download-artifact/releases) - [Commits](actions/download-artifact@v4...v5) --- updated-dependencies: - dependency-name: actions/download-artifact dependency-version: '5' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Bumps [pypa/cibuildwheel](https://github.com/pypa/cibuildwheel) from 3.1.3 to 3.1.4. - [Release notes](https://github.com/pypa/cibuildwheel/releases) - [Changelog](https://github.com/pypa/cibuildwheel/blob/main/docs/changelog.md) - [Commits](pypa/cibuildwheel@v3.1.3...v3.1.4) --- updated-dependencies: - dependency-name: pypa/cibuildwheel dependency-version: 3.1.4 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

ASKabalan · 2025-11-11T02:25:36Z

@matt-graham
Took some time for to get back at this
But I think that the race condition is now fixed

…rms.

ASKabalan · 2025-11-11T16:07:21Z

@matt-graham I think this is now good to go
I have issues with mac os but I am not sure I am the root cause

Otherwise I made a much more robust out of place shift in the CUDA kernels by fusing two of the kernels I had earlier

In short there is no longer need for synchronization
This can be merged IMO (it also works with latests JAX)

ASKabalan added 4 commits March 26, 2025 17:18

Update JAX Binding to use FFI

2fd3c8a

Update JAX Primitive to accept is_linear

2b591ca

Update healpix_ffts to use new FFI lowered cuda healpix ffts

8fe86c2

Update benchmarks

933ac2a

ASKabalan marked this pull request as draft March 26, 2025 16:25

ASKabalan added 5 commits March 28, 2025 16:54

Update Pyproject.toml and build to include FFI headers

e2cc68c

Implement VMAP and transpose rules for cuda primitive

b5cbeac

Update JAX binding layer

9e0f121

add vmap jacrev and jacfwd tests

92fe6a0

Fix build without CUDA NVCC

a70b262

ASKabalan marked this pull request as ready for review March 28, 2025 16:08

ASKabalan requested review from CosmoMatt and matt-graham March 28, 2025 16:08

ASKabalan mentioned this pull request Mar 31, 2025

Check autodiff and batching support for healpix_fft_cuda primitive and add if needed #237

Open

matt-graham reviewed Apr 11, 2025

View reviewed changes

ASKabalan and others added 2 commits April 16, 2025 11:17

Implement requested changes

0e03787

Update tests/test_healpix_ffts.py

6f6c07e

Co-authored-by: Matt Graham <[email protected]>

matt-graham linked an issue Apr 16, 2025 that may be closed by this pull request

Check autodiff and batching support for healpix_fft_cuda primitive and add if needed #237

Open

matt-graham mentioned this pull request Apr 23, 2025

Tests failing when running with JAX v0.6.0 due to breaking changes #299

Closed

ASKabalan marked this pull request as draft May 10, 2025 22:09

Merge remote-tracking branch 'origin/main' into ASKabalan

f8a9a6d

don't include ffi headers if cuda is not available

866d1f2

ASKabalan added 2 commits June 28, 2025 13:41

remove strict requirement on JAX being less than 0.6.0

fd7860e

ASKabalan added 4 commits July 2, 2025 18:17

code works

b75c0ce

Updating CUDA extension and removing CUFFT callbacks

00b169c

remvove callback params workspace

9775bba

format

fb8d0df

ASKabalan marked this pull request as ready for review July 2, 2025 16:57

matt-graham reviewed Jul 9, 2025

View reviewed changes

matt-graham mentioned this pull request Sep 15, 2025

Error using s2fft with healpix in GPU #326

Open

dependabot bot and others added 11 commits November 11, 2025 03:13

Update Python version used in docs workflow (#314)

ba5a531

* Update Python version used in docs workflow * Trigger docs workflow on pull-requests * Deploy only on push to main

Fix race condition error and update notebook

928ea12

fix pyproject.toml

bca4837

ASKabalan force-pushed the ASKabalan branch from c27dc7e to bca4837 Compare November 11, 2025 02:16

ASKabalan added 2 commits November 11, 2025 03:17

Merge remote-tracking branch 'upstream/main' into ASKabalan

50e2840

Remove nano_bind helpers reference for License section

757e022

ASKabalan added 4 commits November 11, 2025 16:49

Fuse normalize and shift kernels for both forward and inverse transfo…

6400681

…rms.

format

77cbc96

Update notebooks/JAX_CUDA_HEALPix.ipynb

2a2e6e7

fix pre-commit

3bcb69a

		flm_hp = samples.flm_2d_to_hp(flm, L)
		f = hp.sphtfunc.alm2map(flm_hp, nside, lmax=L - 1)

	The file [`lib/include/kernel_helpers.h`](https://github.com/astro-informatics/s2fft/blob/main/lib/include/kernel_helpers.h) is adapted from
	[code](https://github.com/dfm/extending-jax/blob/c33869665236877a2ae281f3f5dbff579e8f5b00/lib/kernel_helpers.h) in [a tutorial on extending JAX](https://github.com/dfm/extending-jax) by
	[Dan Foreman-Mackey](https://github.com/dfm) and licensed under a [MIT license](https://github.com/dfm/extending-jax/blob/371dca93c6405368fa8e71690afd3968d75f4bac/LICENSE).

	The file [`lib/include/kernel_nanobind_helpers.h`](https://github.com/astro-informatics/s2fft/blob/main/lib/include/kernel_nanobind_helpers.h)
	is adapted from [code](https://github.com/jax-ml/jax/blob/3d389a7fb440c412d95a1f70ffb91d58408247d0/jaxlib/kernel_nanobind_helpers.h)
	by the [JAX](https://github.com/jax-ml/jax) authors
	and licensed under a [Apache-2.0 license](https://github.com/jax-ml/jax/blob/3d389a7fb440c412d95a1f70ffb91d58408247d0/LICENSE).

Updating Healpix CUDA primitive #290

Are you sure you want to change the base?

Updating Healpix CUDA primitive #290

Uh oh!

Conversation

ASKabalan commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ASKabalan commented Mar 28, 2025

Uh oh!

matt-graham left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

matt-graham Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

matt-graham commented Apr 16, 2025

Uh oh!

ASKabalan commented Apr 18, 2025

Uh oh!

codecov bot commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ASKabalan commented Jun 20, 2025

Uh oh!

matt-graham commented Jun 23, 2025

Uh oh!

ASKabalan commented Jul 2, 2025

Uh oh!

matt-graham left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

matt-graham Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

matt-graham Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

matt-graham commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matt-graham commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ASKabalan commented Nov 11, 2025

Uh oh!

ASKabalan commented Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ASKabalan commented Mar 26, 2025 •

edited

Loading

codecov bot commented Jun 19, 2025 •

edited

Loading

matt-graham commented Jul 25, 2025 •

edited

Loading

matt-graham commented Jul 28, 2025 •

edited

Loading