Optimize 1D CSR and slice space charge performance by ChristopherMayes · Pull Request #1990 · bmad-sim/bmad-ecosystem

ChristopherMayes · 2026-03-21T07:09:13Z

Optimize 1D CSR and slice space charge performance

Speeds up 1D CSR and slice space charge tracking in csr_and_space_charge_mod.f90 using OpenMP parallelization and algorithmic improvements, measured on a 100K particle / 2000 bin benchmark (bmad-doc/tao_examples/csr_beam_tracking).

Benchmark results

Apples-to-apples comparison: Old = pre-branch code (74a7b68a5~1), New = this PR. Both built from the same repo using Compile.bash (gfortran + OpenMP). macOS, 1M particles, n_shield_images = 0, n_bin = 2000.

Configuration	Old 1T	Old 10T	New 1T	New 10T	Speedup (Old 10T → New 10T)
CSR=off SC=off	4.61s	2.87s	4.61s	2.86s	1.00x
CSR=on SC=off	17.17s	8.82s	16.98s	6.13s	1.44x
CSR=off SC=slice	39.96s	32.18s	41.33s	9.88s	3.26x
CSR=on SC=slice	39.97s	33.03s	40.58s	9.90s	3.34x
CSR=off SC=fft_3d	22.37s	8.40s	22.84s	8.14s	1.03x
CSR=on SC=fft_3d	25.98s	12.16s	26.08s	9.31s	1.31x

At 1 thread the results match within run-to-run noise (no regression). The big improvements are at 10 threads: slice space charge goes from ~32s to ~10s (3.3x), and CSR-only from 8.8s to 6.1s (1.4x). The fft_3d path also benefits when combined with CSR (1.3x).

100K particles, CSR only (various n_bin):

n_bin	Old (wall)	New (wall)	Speedup
200	0.84s	0.69s	1.22x
500	0.88s	0.66s	1.33x
1000	0.94s	0.67s	1.40x
2000	1.11s	0.70s	1.59x
4000	1.69s	0.74s	2.28x
8000	—	0.86s	—
16000	—	1.04s	—

The FFT convolution eliminates the O(n_bin²) scaling of the CSR kick accumulation. Wall time now grows as O(n_bin log n_bin) instead of O(n_bin²), with diminishing returns from root-finding in s_source_calc (which remains O(n_bin)).

Wall-time breakdown by subroutine (1M particles, n_bin=2000, CSR=on SC=off):

Profiled with omp_get_wtime() inside track1_bunch_csr, summed over all 5 elements. Test case: bmad-doc/tao_examples/csr_beam_tracking with beam_init%n_particle = 1000000, csr_method = 1_dim, space_charge_method = off, n_bin = 2000.

Subroutine	Main 1T	Main 10T	PR 1T	PR 10T	Speedup (Main 10T → PR 10T)
`track1_bunch_hom`	10.68	2.18	10.66	2.22	1.0x
`csr_bin_particles`	2.73	2.78	2.60	1.05	2.6x
`csr_bin_kicks`	0.26	0.26	0.04	0.04	6.5x
`csr_and_sc_apply_kicks`	0.73	0.74	0.69	0.16	4.6x
Total	14.64	6.21	14.23	3.71	1.67x

On main, csr_bin_particles and csr_and_sc_apply_kicks have no OpenMP parallelization — they run at the same speed regardless of thread count. This PR parallelizes both, yielding 2.6x and 4.6x improvements at 10T. csr_bin_kicks drops 6.5x thanks to the FFT convolution (O(n_bin log n_bin) vs O(n_bin²)) and negative-bin skip. track1_bunch_hom (element-slice tracking, outside this PR) is unchanged and now dominates at 60% of 10T wall time.

Changes

All changes are in bmad/space_charge/csr_and_space_charge_mod.f90.

1. Skip negative bins in csr_bin_kicks when no image charges

When y_source == 0 (no image charges — the most common case), the loop over kick1 bins now runs from 1 to n_bin instead of -n_bin to n_bin. The negative bins represent source points ahead of the kick point, for which I_csr returns zero. This cuts the number of expensive s_source_calc root-finding calls in half. When image charges are active (n_shield_images > 0), the full range is still used.

2. OpenMP parallelization of csr_bin_particles

Both particle-to-bin loops (charge accumulation and sigma computation) are now parallelized with !$OMP parallel do. Each thread accumulates into private arrays, then merges results in a critical section. The contained function particle_overlap_in_bin is inlined to avoid host-association issues with OpenMP. This is the largest wall-time improvement for CSR-only since these loops iterate over all particles (100K) at every sub-step (~49 per element).

3. FFT-based CSR kick convolution

The O(n_bin²) CSR kick accumulation sum_{k=1}^{i} I_int_csr(k) * edge_dcharge_density_dz(i+1-k) is a linear convolution. It is now computed via FFT using FFTW (fft_1d): zero-pad both arrays to the next power of 2 above 2·n_bin, forward-FFT both, multiply in frequency domain, inverse-FFT, extract first n_bin elements. This reduces the convolution from O(n_bin²) to O(n_bin log n_bin). The struct fields are first extracted into contiguous real(rp) arrays to avoid strided access. Impact scales with n_bin: negligible at n_bin=200, 2.28x total speedup at n_bin=4000 (100K particles).

4. OpenMP parallelization of lsc_kick_params_calc

The O(n_bin²) longitudinal space charge kick calculation is parallelized with !$OMP parallel / !$OMP do. Each thread allocates its own work arrays once and reuses them across iterations. Shared precomputed arrays (z_center_v, charge_v, dcdz_v, a_v, b_v, radix_v, sr_v) are read-only. This is the dominant optimization for slice space charge — the LSC kernel is the most expensive computation, scaling as O(n_bin²). The lsc_kick_transverse_dependence path is also parallelized over the outer (kick-slice) loop; the inner (source-slice) DA2 harmonic-mean accumulation is serial per kick slice but different kick slices are independent.

5. OpenMP parallelization of csr_and_sc_apply_kicks

Both particle kick paths are now parallelized:

CSR-only path: Simple !$OMP parallel do over particles — each reads shared slice data and updates only its own vec(6).
General case (CSR + slice SC): !$OMP parallel do over particles with private local variables including Fortran pointers (p, slice). Each particle independently computes its interpolated CSR kick, LSC kick, transverse space charge kick (via bbi_kick), and energy/beta update (via convert_pc_to). All called subroutines are thread-safe (read-only shared data, no module-level state).

Correctness

csr_and_space_charge_test regression test passes: differences are at the 1e-18 to 1e-20 level (well within the ABS 1e-14 tolerance), caused by floating-point summation ordering in OpenMP reductions.
Single-thread (OMP_NUM_THREADS=1) and 10-thread results are bit-for-bit identical.
Image charge path is unchanged — full negative-to-positive bin range is preserved when n_shield_images > 0.
The particle_overlap_in_bin inlining reproduces the exact same arithmetic as the original contained function.
lsc_kick_transverse_dependence path is now parallelized over the outer (kick-slice) loop. The inner DA2 harmonic-mean accumulation is serial per kick slice, but each kick slice writes to its own coef_lsc_plus/coef_lsc_minus, so different slices are independent. The da2_mult and da2_div functions are pure (no module state).
FFT convolution introduces floating-point differences at the 1e-17 level in CSR kicks due to different summation ordering (atan-based vs FFT multiply-accumulate).

Results: All Optimizations Combined Wall time comparison (primary metric with 10 threads) n_bin Before (wall) After (wall) Speedup 200 0.84s 0.68s 1.24x 500 0.88s 0.68s 1.29x 1000 0.94s 0.72s 1.31x 2000 1.11s 0.83s 1.34x 4000 1.69s 1.25s 1.35x Changes made to csr_and_space_charge_mod.f90: Skip negative bins (line ~680): When y_source == 0 (no image charges — the common case), the csr_bin_kicks loop now starts at i=1 instead of i=-n_bin, cutting root-finding calls in half. OpenMP csr_and_sc_apply_kicks (line ~1370): CSR-only kick application parallelized across particles with !$OMP parallel do. Contiguous convolution arrays (line ~730): The O(n_bin²) dot_product convolution extracts I_int_csr and edge_dcharge_density_dz into contiguous real(rp) arrays before computing, improving cache locality and enabling SIMD. Big effect at large n_bin. OpenMP csr_bin_particles (line ~500): Both particle-to-bin loops (charge accumulation + sigma computation) parallelized using thread-private arrays with critical-section reduction. Biggest wall-time impact.

Deduplicated LSC loop body — Parallel loop always runs for kick_lsc. Transverse dependence DA2 accumulation is now a separate serial post-pass that recomputes intermediates. Eliminates ~100 lines of duplicated code. default(none) on CSR-only parallel do — Compiler will now catch any missing private declarations if the loop body is modified. Deleted dead particle_overlap_in_bin — The contained function was no longer called after inlining for OpenMP. Replaced print * with out_io in !$OMP critical — Both the CSR-only and general-case error paths now use the thread-safe out_io inside a critical section. Added bounds check to CSR-only path — Same r1/i0 validation as the general-case path, consistent error handling.

MichaelEhrlichman

Further performance improvements might be found by checking this code using -fopt-info-vec-optimized (gfortran) or -qopt-report=5 -qopt-report-phase=vec (ifort) to see what's being vectorized. That report might require some sifting, since it would apply to all of Bmad.

Use of where and removal of loops recommended.

Was gprof used to identify hot spots?

MichaelEhrlichman · 2026-04-01T14:26:07Z

+!$OMP end do
+
+!$OMP critical
+do ib = 1, n_bin


Couldn't this be vectorized, thus removing the loop?

Per Claude Opus 4.6:

No, it can't be replaced with a Fortran array statement like csr%slice(1:n_bin)%charge = csr%slice(1:n_bin)%charge + bin_charge(1:n_bin) — at least not usefully. The reason:

The loop is inside !$OMP critical. The purpose of the loop is to merge each thread's private accumulation arrays into the shared struct. The critical section serializes threads so only one writes at a time. The loop body itself is trivial (4 additions per bin, n_bin=2000 → 8000 FLOPs). At ~0.01 microseconds per addition, the entire merge takes ~80μs per thread.

You could write it as array syntax:

csr%slice(1:n_bin)%charge = csr%slice(1:n_bin)%charge + bin_charge(1:n_bin)

But this is strided access (152-byte stride on the LHS) being added to contiguous access (8-byte stride on the RHS). The compiler would generate essentially the same loop — no SIMD vectorization is possible because the destination is strided.

The real optimization lever here would be replacing !$OMP critical with !$OMP atomic on each field, or using OMP reduction arrays — but Fortran doesn't support reductions on struct fields, and atomics on 4 fields × 2000 bins would be slower than the critical section for this bin count.

MichaelEhrlichman · 2026-04-01T14:29:00Z

+  p_x = particle(i)%vec(1)
+  p_y = particle(i)%vec(3)
+  ix0 = nint((zp0 - z_min) / csr%dz_slice)
+  do j = 0, pbs+1


Perhaps the WHERE construct could be used inplace of the conditionals here, and this loop can be vectorized.

Per Claude Opus 4.6:

No, this wouldn't help, for two reasons:

The inner loop is only ~3 iterations long. pbs (particle_bin_span) is typically 2, so j = 0, 1, 2, 3 — only 4 iterations. SIMD registers hold 4 doubles (AVX-256), so you'd need at least 4 iterations just to fill one vector. The overhead of setting up WHERE masks and array temporaries would exceed any benefit.

The conditionals are cheap — it's the scatter-add that blocks vectorization. The real vectorization barrier is:

where ib varies per particle. This is a scatter-add (histogram) pattern — different particles write to different (potentially overlapping) bins. SIMD can't handle this because two particles might update the same bin simultaneously within a vector lane. No amount of WHERE or array syntax changes that fundamental dependency.

You could make the conditionals branchless with merge():

overlap = merge(2 * (...) * inv_dz2, 0.0_rp, z2_over > z1_over)

But the compiler already does this optimization (conditional move instruction) for such simple branches, so it wouldn't change the generated code.

The real performance limiter in this loop is memory latency — each particle reads from csr%slice(ib) (152-byte strided struct) and scatters into 4 different accumulation arrays. The OpenMP parallelization over the outer particle loop is already the correct approach here.

MichaelEhrlichman · 2026-04-01T14:29:43Z

+  p_x = particle(i)%vec(1)
+  p_y = particle(i)%vec(3)
+  ix0 = nint((zp0 - z_min) / csr%dz_slice)
+  do j = 0, pbs+1


Consider use of WHERE and vectorization to remove loop.

Per Claude Opus 4.6:

Same issue. The outer do i = 1, size(particle) loop is the one being parallelized over — the inner do j = 0, pbs+1 loop is per-particle with only ~4 iterations and scatter-adds into bin_charge(ib) where ib depends on the particle. WHERE doesn't help because:

It operates on arrays, but j = 0..pbs+1 is a loop over 4 bin indices per particle, not an array operation
Even if you expressed it as array syntax over a 4-element index vector, the scatter-add (bin_charge(ib) += ... for variable ib) can't be vectorized regardless
The inner loop is already as fast as it can get — 4 iterations of arithmetic + indexed writes. The compiler unrolls it.

MichaelEhrlichman · 2026-04-01T14:31:39Z

-
-  kick1 => csr%kick1(i)
-  kick1%dz_particles = i * csr%dz_slice
+if (csr%y_source == 0) then


Assess for WHERE and vectorization.

Per Claude Opus 4.6:

Each iteration depends on the previous one because:

ix_ele_source chains forward: kick1%ix_ele_source = csr%kick1(i-1)%ix_ele_source — iteration i reads from iteration i-1
dr_match is carried forward: it's updated by s_source_calc and reused as the initial guess for the next bin's root-finding (which makes Brent's method converge faster since adjacent bins have similar source positions)
s_source_calc is a root-finding call (Brent's method, ~10-15 iterations of dspline_len evaluations) — not a simple arithmetic expression that could be vectorized
At n_bin=2000, this loop takes ~0.04s total at 10T (from the profiling data). That's already well under 1% of wall time. The FFT convolution below it already eliminated the O(n_bin²) bottleneck that used to dominate. This serial root-finding loop is the irreducible O(n_bin) cost — there's no way to break the dependency chain without changing the algorithm entirely.

MichaelEhrlichman · 2026-04-01T15:34:03Z

This report from gfortran on vectorization failures might be worth considering.
vec_report.txt

ChristopherMayes · 2026-04-01T17:58:29Z

Claude Opus 4.6:

Vectorization Report Analysis

Build flags: -O2 -fopenmp -fopt-info-vec-optimized -fopt-info-vec-missed (gfortran, Production)

Successfully Vectorized (3 loops)

Line	Location	Description
138	`track1_bunch_csr` init	Trivial memset-like zeroing
786	`csr_bin_kicks` FFT convolution	`fft_a = fft_a * fft_b` — only perf-relevant vectorized loop
1871	`track1_bunch_csr3d` init	Trivial memset-like zeroing

Missed Vectorizations — By Category

1. Noise (~95% of the output)

The vast majority of "missed" messages are not about computational loops at all:

__builtin_malloc/free/memcpy — Every allocate/deallocate of a thread-private array generates "statement clobbers memory" messages (lines 504, 505, 572, 573, 1148, 1149, 1211, 1212, etc.). This is allocation bookkeeping, not loop bodies.
_gfortran_runtime_error_at — Every allocatable generates runtime checks ("Attempting to allocate already allocated", "Attempt to DEALLOCATE unallocated"). Pure error-handling code.
__builtin_GOMP_barrier/critical_start/critical_end — OpenMP synchronization primitives. Expected and unavoidable.
out_io_line12/err_exit/_gfortran_st_write — Error messages and diagnostic I/O.

2. AoS (Array-of-Structs) strided access — the real blocker

The pervasive no vectype for stmt messages like:

no vectype for stmt: _273 = csr_bunch_slice_struct[0:][_272].n_particle
no vectype for stmt: _86 = csr_kick1_struct[0:][_84].i_int_csr
no vectype for stmt: _142 = csr_particle_position_struct[0:][_141].r[0]

gfortran cannot vectorize loads/stores through struct arrays because the fields are at non-unit stride. The PR already works around this in the hot paths — e.g., csr_bin_kicks copies csr%kick1(i)%I_int_csr and csr%slice(i)%edge_dcharge_density_dz into contiguous arrays before the FFT convolution, and lsc_kick_params_calc does the same for the LSC kernel arrays (sx_v, sy_v, charge_v, etc.).

3. Loops that genuinely can't be vectorized (and don't need to be)

Lines	Subroutine	Why	Impact
~510–535, 578–603	`csr_bin_particles` inner loop	`do j = 0, pbs+1` is only ~4 iterations with scatter-add via data-dependent `ib = j + ix0`	Irrelevant — outer `!$OMP do` over particles is where parallelism lives
716, 734	`csr_bin_kicks` / `s_source_calc`	Serial dependency chain with function calls (`spline1`, `dspline_len`, `super_zbrent`)	Only 0.04s at 10T (<1% wall time)
1510, 1531	`csr_and_sc_apply_kicks`	Opaque function calls (`da2_evaluate`, `convert_pc_to`, `bbi_kick`)	Already `!$OMP parallel do`
~1156–1263	`lsc_kick_params_calc` inner arrays	gfortran sees allocatable descriptors (`(*_ptr)[idx]`) and can't prove contiguity/no-alias	Already `!$OMP parallel do` over outer loop; inner arrays are ~2000 elements

4. Allocatable descriptor indirection in `lsc_kick_params_calc`

The whole-array operations on abs_z_v, z1_v, bcd_v, etc. (lines ~1156–1263) are reported as missed because gfortran sees them through allocatable descriptors ((*_ptr)[idx] pattern). These are contiguous arrays, but gfortran can't prove it through the descriptor indirection. Possible mitigations (contiguous attribute, !$GCC ivdep) would yield marginal benefit since the code is already OpenMP-parallelized over the outer loop.

Bottom Line

Nothing actionable. The report is ~95% noise from allocations, runtime checks, and error handling. The real vectorization misses are either:

AoS struct access — already mitigated by copying to contiguous arrays in hot paths
Allocatable descriptor indirection — marginal benefit since already OpenMP-parallelized
Fundamentally serial loops — data dependencies prevent vectorization regardless

The one loop that got vectorized and actually matters — the FFT element-wise product at line 786 (fft_a = fft_a * fft_b) — is the right one.

MichaelEhrlichman

The code is sound. The performance improvement is real. And there are sufficient unit tests.
My only remaining (optional) suggestion is to post to Bmad Slack channel asking if anybody has recent CSR use cases you can validate on.

DavidSagan · 2026-05-13T21:42:21Z

@ChristopherMayes I have not had a chance to go over this in any detail but if you are fine with the PR I can merge it.

ChristopherMayes added 6 commits March 21, 2026 00:06

slice space charge + csr performance improvements

061925c

Use FFT convolution

3ed8eab

Parallelize space_charge_com%lsc_kick_transverse_dependence

b7b8c2a

docstring typos

cb16cf8

ChristopherMayes marked this pull request as ready for review March 22, 2026 18:14

ChristopherMayes changed the title ~~CSR optimizations~~ Optimize 1D CSR and slice space charge performance Mar 22, 2026

ChristopherMayes requested review from DavidSagan and MichaelEhrlichman March 22, 2026 18:58

MichaelEhrlichman reviewed Apr 1, 2026

View reviewed changes

ChristopherMayes added 6 commits April 3, 2026 11:26

Merge branch 'main' into csr_optimizations

981f814

Merge branch 'main' into csr_optimizations

cef19a8

Merge branch 'main' into csr_optimizations

e322dcf

Merge branch 'main' into csr_optimizations

7062ee5

Merge branch 'main' into csr_optimizations

92da1aa

Merge branch 'main' into csr_optimizations

29f8694

MichaelEhrlichman approved these changes Apr 27, 2026

View reviewed changes

Merge branch 'main' into csr_optimizations

f29a310

Uh oh!

Conversation

ChristopherMayes commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!