Optimize 1D CSR and slice space charge performance#1990
Optimize 1D CSR and slice space charge performance#1990ChristopherMayes wants to merge 13 commits into
Conversation
Results: All Optimizations Combined Wall time comparison (primary metric with 10 threads) n_bin Before (wall) After (wall) Speedup 200 0.84s 0.68s 1.24x 500 0.88s 0.68s 1.29x 1000 0.94s 0.72s 1.31x 2000 1.11s 0.83s 1.34x 4000 1.69s 1.25s 1.35x Changes made to csr_and_space_charge_mod.f90: Skip negative bins (line ~680): When y_source == 0 (no image charges — the common case), the csr_bin_kicks loop now starts at i=1 instead of i=-n_bin, cutting root-finding calls in half. OpenMP csr_and_sc_apply_kicks (line ~1370): CSR-only kick application parallelized across particles with !$OMP parallel do. Contiguous convolution arrays (line ~730): The O(n_bin²) dot_product convolution extracts I_int_csr and edge_dcharge_density_dz into contiguous real(rp) arrays before computing, improving cache locality and enabling SIMD. Big effect at large n_bin. OpenMP csr_bin_particles (line ~500): Both particle-to-bin loops (charge accumulation + sigma computation) parallelized using thread-private arrays with critical-section reduction. Biggest wall-time impact.
Deduplicated LSC loop body — Parallel loop always runs for kick_lsc. Transverse dependence DA2 accumulation is now a separate serial post-pass that recomputes intermediates. Eliminates ~100 lines of duplicated code. default(none) on CSR-only parallel do — Compiler will now catch any missing private declarations if the loop body is modified. Deleted dead particle_overlap_in_bin — The contained function was no longer called after inlining for OpenMP. Replaced print * with out_io in !$OMP critical — Both the CSR-only and general-case error paths now use the thread-safe out_io inside a critical section. Added bounds check to CSR-only path — Same r1/i0 validation as the general-case path, consistent error handling.
MichaelEhrlichman
left a comment
There was a problem hiding this comment.
Further performance improvements might be found by checking this code using -fopt-info-vec-optimized (gfortran) or -qopt-report=5 -qopt-report-phase=vec (ifort) to see what's being vectorized. That report might require some sifting, since it would apply to all of Bmad.
Use of where and removal of loops recommended.
Was gprof used to identify hot spots?
| !$OMP end do | ||
|
|
||
| !$OMP critical | ||
| do ib = 1, n_bin |
There was a problem hiding this comment.
Couldn't this be vectorized, thus removing the loop?
There was a problem hiding this comment.
Per Claude Opus 4.6:
No, it can't be replaced with a Fortran array statement like csr%slice(1:n_bin)%charge = csr%slice(1:n_bin)%charge + bin_charge(1:n_bin) — at least not usefully. The reason:
The loop is inside !$OMP critical. The purpose of the loop is to merge each thread's private accumulation arrays into the shared struct. The critical section serializes threads so only one writes at a time. The loop body itself is trivial (4 additions per bin, n_bin=2000 → 8000 FLOPs). At ~0.01 microseconds per addition, the entire merge takes ~80μs per thread.
You could write it as array syntax:
csr%slice(1:n_bin)%charge = csr%slice(1:n_bin)%charge + bin_charge(1:n_bin)But this is strided access (152-byte stride on the LHS) being added to contiguous access (8-byte stride on the RHS). The compiler would generate essentially the same loop — no SIMD vectorization is possible because the destination is strided.
The real optimization lever here would be replacing !$OMP critical with !$OMP atomic on each field, or using OMP reduction arrays — but Fortran doesn't support reductions on struct fields, and atomics on 4 fields × 2000 bins would be slower than the critical section for this bin count.
| p_x = particle(i)%vec(1) | ||
| p_y = particle(i)%vec(3) | ||
| ix0 = nint((zp0 - z_min) / csr%dz_slice) | ||
| do j = 0, pbs+1 |
There was a problem hiding this comment.
Perhaps the WHERE construct could be used inplace of the conditionals here, and this loop can be vectorized.
There was a problem hiding this comment.
Per Claude Opus 4.6:
No, this wouldn't help, for two reasons:
-
The inner loop is only ~3 iterations long. pbs (particle_bin_span) is typically 2, so j = 0, 1, 2, 3 — only 4 iterations. SIMD registers hold 4 doubles (AVX-256), so you'd need at least 4 iterations just to fill one vector. The overhead of setting up WHERE masks and array temporaries would exceed any benefit.
-
The conditionals are cheap — it's the scatter-add that blocks vectorization. The real vectorization barrier is:
where ib varies per particle. This is a scatter-add (histogram) pattern — different particles write to different (potentially overlapping) bins. SIMD can't handle this because two particles might update the same bin simultaneously within a vector lane. No amount of WHERE or array syntax changes that fundamental dependency.
You could make the conditionals branchless with merge():
overlap = merge(2 * (...) * inv_dz2, 0.0_rp, z2_over > z1_over)But the compiler already does this optimization (conditional move instruction) for such simple branches, so it wouldn't change the generated code.
The real performance limiter in this loop is memory latency — each particle reads from csr%slice(ib) (152-byte strided struct) and scatters into 4 different accumulation arrays. The OpenMP parallelization over the outer particle loop is already the correct approach here.
| p_x = particle(i)%vec(1) | ||
| p_y = particle(i)%vec(3) | ||
| ix0 = nint((zp0 - z_min) / csr%dz_slice) | ||
| do j = 0, pbs+1 |
There was a problem hiding this comment.
Consider use of WHERE and vectorization to remove loop.
There was a problem hiding this comment.
Per Claude Opus 4.6:
Same issue. The outer do i = 1, size(particle) loop is the one being parallelized over — the inner do j = 0, pbs+1 loop is per-particle with only ~4 iterations and scatter-adds into bin_charge(ib) where ib depends on the particle. WHERE doesn't help because:
It operates on arrays, but j = 0..pbs+1 is a loop over 4 bin indices per particle, not an array operation
Even if you expressed it as array syntax over a 4-element index vector, the scatter-add (bin_charge(ib) += ... for variable ib) can't be vectorized regardless
The inner loop is already as fast as it can get — 4 iterations of arithmetic + indexed writes. The compiler unrolls it.
|
|
||
| kick1 => csr%kick1(i) | ||
| kick1%dz_particles = i * csr%dz_slice | ||
| if (csr%y_source == 0) then |
There was a problem hiding this comment.
Assess for WHERE and vectorization.
There was a problem hiding this comment.
Per Claude Opus 4.6:
Each iteration depends on the previous one because:
ix_ele_source chains forward: kick1%ix_ele_source = csr%kick1(i-1)%ix_ele_source — iteration i reads from iteration i-1
dr_match is carried forward: it's updated by s_source_calc and reused as the initial guess for the next bin's root-finding (which makes Brent's method converge faster since adjacent bins have similar source positions)
s_source_calc is a root-finding call (Brent's method, ~10-15 iterations of dspline_len evaluations) — not a simple arithmetic expression that could be vectorized
At n_bin=2000, this loop takes ~0.04s total at 10T (from the profiling data). That's already well under 1% of wall time. The FFT convolution below it already eliminated the O(n_bin²) bottleneck that used to dominate. This serial root-finding loop is the irreducible O(n_bin) cost — there's no way to break the dependency chain without changing the algorithm entirely.
|
This report from gfortran on vectorization failures might be worth considering. |
|
Claude Opus 4.6: Vectorization Report AnalysisBuild flags: Successfully Vectorized (3 loops)
Missed Vectorizations — By Category1. Noise (~95% of the output)The vast majority of "missed" messages are not about computational loops at all:
2. AoS (Array-of-Structs) strided access — the real blockerThe pervasive gfortran cannot vectorize loads/stores through struct arrays because the fields are at non-unit stride. The PR already works around this in the hot paths — e.g., 3. Loops that genuinely can't be vectorized (and don't need to be)
4. Allocatable descriptor indirection in
|
MichaelEhrlichman
left a comment
There was a problem hiding this comment.
The code is sound. The performance improvement is real. And there are sufficient unit tests.
My only remaining (optional) suggestion is to post to Bmad Slack channel asking if anybody has recent CSR use cases you can validate on.
|
@ChristopherMayes I have not had a chance to go over this in any detail but if you are fine with the PR I can merge it. |
Optimize 1D CSR and slice space charge performance
Speeds up 1D CSR and slice space charge tracking in
csr_and_space_charge_mod.f90using OpenMP parallelization and algorithmic improvements, measured on a 100K particle / 2000 bin benchmark (bmad-doc/tao_examples/csr_beam_tracking).Benchmark results
Apples-to-apples comparison: Old = pre-branch code (
74a7b68a5~1), New = this PR. Both built from the same repo usingCompile.bash(gfortran + OpenMP). macOS, 1M particles,n_shield_images = 0,n_bin = 2000.At 1 thread the results match within run-to-run noise (no regression). The big improvements are at 10 threads: slice space charge goes from ~32s to ~10s (3.3x), and CSR-only from 8.8s to 6.1s (1.4x). The fft_3d path also benefits when combined with CSR (1.3x).
100K particles, CSR only (various n_bin):
The FFT convolution eliminates the O(n_bin²) scaling of the CSR kick accumulation. Wall time now grows as O(n_bin log n_bin) instead of O(n_bin²), with diminishing returns from root-finding in
s_source_calc(which remains O(n_bin)).Wall-time breakdown by subroutine (1M particles, n_bin=2000, CSR=on SC=off):
Profiled with
omp_get_wtime()insidetrack1_bunch_csr, summed over all 5 elements. Test case:bmad-doc/tao_examples/csr_beam_trackingwithbeam_init%n_particle = 1000000,csr_method = 1_dim,space_charge_method = off,n_bin = 2000.track1_bunch_homcsr_bin_particlescsr_bin_kickscsr_and_sc_apply_kicksOn main,
csr_bin_particlesandcsr_and_sc_apply_kickshave no OpenMP parallelization — they run at the same speed regardless of thread count. This PR parallelizes both, yielding 2.6x and 4.6x improvements at 10T.csr_bin_kicksdrops 6.5x thanks to the FFT convolution (O(n_bin log n_bin) vs O(n_bin²)) and negative-bin skip.track1_bunch_hom(element-slice tracking, outside this PR) is unchanged and now dominates at 60% of 10T wall time.Changes
All changes are in
bmad/space_charge/csr_and_space_charge_mod.f90.1. Skip negative bins in
csr_bin_kickswhen no image chargesWhen
y_source == 0(no image charges — the most common case), the loop overkick1bins now runs from1ton_bininstead of-n_binton_bin. The negative bins represent source points ahead of the kick point, for whichI_csrreturns zero. This cuts the number of expensives_source_calcroot-finding calls in half. When image charges are active (n_shield_images > 0), the full range is still used.2. OpenMP parallelization of
csr_bin_particlesBoth particle-to-bin loops (charge accumulation and sigma computation) are now parallelized with
!$OMP parallel do. Each thread accumulates into private arrays, then merges results in a critical section. The contained functionparticle_overlap_in_binis inlined to avoid host-association issues with OpenMP. This is the largest wall-time improvement for CSR-only since these loops iterate over all particles (100K) at every sub-step (~49 per element).3. FFT-based CSR kick convolution
The O(n_bin²) CSR kick accumulation
sum_{k=1}^{i} I_int_csr(k) * edge_dcharge_density_dz(i+1-k)is a linear convolution. It is now computed via FFT using FFTW (fft_1d): zero-pad both arrays to the next power of 2 above 2·n_bin, forward-FFT both, multiply in frequency domain, inverse-FFT, extract first n_bin elements. This reduces the convolution from O(n_bin²) to O(n_bin log n_bin). The struct fields are first extracted into contiguousreal(rp)arrays to avoid strided access. Impact scales with n_bin: negligible at n_bin=200, 2.28x total speedup at n_bin=4000 (100K particles).4. OpenMP parallelization of
lsc_kick_params_calcThe O(n_bin²) longitudinal space charge kick calculation is parallelized with
!$OMP parallel/!$OMP do. Each thread allocates its own work arrays once and reuses them across iterations. Shared precomputed arrays (z_center_v,charge_v,dcdz_v,a_v,b_v,radix_v,sr_v) are read-only. This is the dominant optimization for slice space charge — the LSC kernel is the most expensive computation, scaling as O(n_bin²). Thelsc_kick_transverse_dependencepath is also parallelized over the outer (kick-slice) loop; the inner (source-slice) DA2 harmonic-mean accumulation is serial per kick slice but different kick slices are independent.5. OpenMP parallelization of
csr_and_sc_apply_kicksBoth particle kick paths are now parallelized:
!$OMP parallel doover particles — each reads shared slice data and updates only its ownvec(6).!$OMP parallel doover particles with private local variables including Fortran pointers (p,slice). Each particle independently computes its interpolated CSR kick, LSC kick, transverse space charge kick (viabbi_kick), and energy/beta update (viaconvert_pc_to). All called subroutines are thread-safe (read-only shared data, no module-level state).Correctness
csr_and_space_charge_testregression test passes: differences are at the 1e-18 to 1e-20 level (well within theABS 1e-14tolerance), caused by floating-point summation ordering in OpenMP reductions.OMP_NUM_THREADS=1) and 10-thread results are bit-for-bit identical.n_shield_images > 0.particle_overlap_in_bininlining reproduces the exact same arithmetic as the original contained function.lsc_kick_transverse_dependencepath is now parallelized over the outer (kick-slice) loop. The inner DA2 harmonic-mean accumulation is serial per kick slice, but each kick slice writes to its owncoef_lsc_plus/coef_lsc_minus, so different slices are independent. Theda2_multandda2_divfunctions are pure (no module state).