Skip to content

Optimize 1D CSR and slice space charge performance#1990

Open
ChristopherMayes wants to merge 13 commits into
mainfrom
csr_optimizations
Open

Optimize 1D CSR and slice space charge performance#1990
ChristopherMayes wants to merge 13 commits into
mainfrom
csr_optimizations

Conversation

@ChristopherMayes

@ChristopherMayes ChristopherMayes commented Mar 21, 2026

Copy link
Copy Markdown
Contributor

Optimize 1D CSR and slice space charge performance

Speeds up 1D CSR and slice space charge tracking in csr_and_space_charge_mod.f90 using OpenMP parallelization and algorithmic improvements, measured on a 100K particle / 2000 bin benchmark (bmad-doc/tao_examples/csr_beam_tracking).

Benchmark results

Apples-to-apples comparison: Old = pre-branch code (74a7b68a5~1), New = this PR. Both built from the same repo using Compile.bash (gfortran + OpenMP). macOS, 1M particles, n_shield_images = 0, n_bin = 2000.

Configuration Old 1T Old 10T New 1T New 10T Speedup (Old 10T → New 10T)
CSR=off SC=off 4.61s 2.87s 4.61s 2.86s 1.00x
CSR=on SC=off 17.17s 8.82s 16.98s 6.13s 1.44x
CSR=off SC=slice 39.96s 32.18s 41.33s 9.88s 3.26x
CSR=on SC=slice 39.97s 33.03s 40.58s 9.90s 3.34x
CSR=off SC=fft_3d 22.37s 8.40s 22.84s 8.14s 1.03x
CSR=on SC=fft_3d 25.98s 12.16s 26.08s 9.31s 1.31x

At 1 thread the results match within run-to-run noise (no regression). The big improvements are at 10 threads: slice space charge goes from ~32s to ~10s (3.3x), and CSR-only from 8.8s to 6.1s (1.4x). The fft_3d path also benefits when combined with CSR (1.3x).

100K particles, CSR only (various n_bin):

n_bin Old (wall) New (wall) Speedup
200 0.84s 0.69s 1.22x
500 0.88s 0.66s 1.33x
1000 0.94s 0.67s 1.40x
2000 1.11s 0.70s 1.59x
4000 1.69s 0.74s 2.28x
8000 0.86s
16000 1.04s

The FFT convolution eliminates the O(n_bin²) scaling of the CSR kick accumulation. Wall time now grows as O(n_bin log n_bin) instead of O(n_bin²), with diminishing returns from root-finding in s_source_calc (which remains O(n_bin)).

Wall-time breakdown by subroutine (1M particles, n_bin=2000, CSR=on SC=off):

Profiled with omp_get_wtime() inside track1_bunch_csr, summed over all 5 elements. Test case: bmad-doc/tao_examples/csr_beam_tracking with beam_init%n_particle = 1000000, csr_method = 1_dim, space_charge_method = off, n_bin = 2000.

Subroutine Main 1T Main 10T PR 1T PR 10T Speedup (Main 10T → PR 10T)
track1_bunch_hom 10.68 2.18 10.66 2.22 1.0x
csr_bin_particles 2.73 2.78 2.60 1.05 2.6x
csr_bin_kicks 0.26 0.26 0.04 0.04 6.5x
csr_and_sc_apply_kicks 0.73 0.74 0.69 0.16 4.6x
Total 14.64 6.21 14.23 3.71 1.67x

On main, csr_bin_particles and csr_and_sc_apply_kicks have no OpenMP parallelization — they run at the same speed regardless of thread count. This PR parallelizes both, yielding 2.6x and 4.6x improvements at 10T. csr_bin_kicks drops 6.5x thanks to the FFT convolution (O(n_bin log n_bin) vs O(n_bin²)) and negative-bin skip. track1_bunch_hom (element-slice tracking, outside this PR) is unchanged and now dominates at 60% of 10T wall time.

Changes

All changes are in bmad/space_charge/csr_and_space_charge_mod.f90.

1. Skip negative bins in csr_bin_kicks when no image charges

When y_source == 0 (no image charges — the most common case), the loop over kick1 bins now runs from 1 to n_bin instead of -n_bin to n_bin. The negative bins represent source points ahead of the kick point, for which I_csr returns zero. This cuts the number of expensive s_source_calc root-finding calls in half. When image charges are active (n_shield_images > 0), the full range is still used.

2. OpenMP parallelization of csr_bin_particles

Both particle-to-bin loops (charge accumulation and sigma computation) are now parallelized with !$OMP parallel do. Each thread accumulates into private arrays, then merges results in a critical section. The contained function particle_overlap_in_bin is inlined to avoid host-association issues with OpenMP. This is the largest wall-time improvement for CSR-only since these loops iterate over all particles (100K) at every sub-step (~49 per element).

3. FFT-based CSR kick convolution

The O(n_bin²) CSR kick accumulation sum_{k=1}^{i} I_int_csr(k) * edge_dcharge_density_dz(i+1-k) is a linear convolution. It is now computed via FFT using FFTW (fft_1d): zero-pad both arrays to the next power of 2 above 2·n_bin, forward-FFT both, multiply in frequency domain, inverse-FFT, extract first n_bin elements. This reduces the convolution from O(n_bin²) to O(n_bin log n_bin). The struct fields are first extracted into contiguous real(rp) arrays to avoid strided access. Impact scales with n_bin: negligible at n_bin=200, 2.28x total speedup at n_bin=4000 (100K particles).

4. OpenMP parallelization of lsc_kick_params_calc

The O(n_bin²) longitudinal space charge kick calculation is parallelized with !$OMP parallel / !$OMP do. Each thread allocates its own work arrays once and reuses them across iterations. Shared precomputed arrays (z_center_v, charge_v, dcdz_v, a_v, b_v, radix_v, sr_v) are read-only. This is the dominant optimization for slice space charge — the LSC kernel is the most expensive computation, scaling as O(n_bin²). The lsc_kick_transverse_dependence path is also parallelized over the outer (kick-slice) loop; the inner (source-slice) DA2 harmonic-mean accumulation is serial per kick slice but different kick slices are independent.

5. OpenMP parallelization of csr_and_sc_apply_kicks

Both particle kick paths are now parallelized:

  • CSR-only path: Simple !$OMP parallel do over particles — each reads shared slice data and updates only its own vec(6).
  • General case (CSR + slice SC): !$OMP parallel do over particles with private local variables including Fortran pointers (p, slice). Each particle independently computes its interpolated CSR kick, LSC kick, transverse space charge kick (via bbi_kick), and energy/beta update (via convert_pc_to). All called subroutines are thread-safe (read-only shared data, no module-level state).

Correctness

  • csr_and_space_charge_test regression test passes: differences are at the 1e-18 to 1e-20 level (well within the ABS 1e-14 tolerance), caused by floating-point summation ordering in OpenMP reductions.
  • Single-thread (OMP_NUM_THREADS=1) and 10-thread results are bit-for-bit identical.
  • Image charge path is unchanged — full negative-to-positive bin range is preserved when n_shield_images > 0.
  • The particle_overlap_in_bin inlining reproduces the exact same arithmetic as the original contained function.
  • lsc_kick_transverse_dependence path is now parallelized over the outer (kick-slice) loop. The inner DA2 harmonic-mean accumulation is serial per kick slice, but each kick slice writes to its own coef_lsc_plus/coef_lsc_minus, so different slices are independent. The da2_mult and da2_div functions are pure (no module state).
  • FFT convolution introduces floating-point differences at the 1e-17 level in CSR kicks due to different summation ordering (atan-based vs FFT multiply-accumulate).

Results: All Optimizations Combined
Wall time comparison (primary metric with 10 threads)
n_bin	Before (wall)	After (wall)	Speedup
200	0.84s	0.68s	1.24x
500	0.88s	0.68s	1.29x
1000	0.94s	0.72s	1.31x
2000	1.11s	0.83s	1.34x
4000	1.69s	1.25s	1.35x

Changes made to csr_and_space_charge_mod.f90:
Skip negative bins (line ~680): When y_source == 0 (no image charges — the common case), the csr_bin_kicks loop now starts at i=1 instead of i=-n_bin, cutting root-finding calls in half.

OpenMP csr_and_sc_apply_kicks (line ~1370): CSR-only kick application parallelized across particles with !$OMP parallel do.

Contiguous convolution arrays (line ~730): The O(n_bin²) dot_product convolution extracts I_int_csr and edge_dcharge_density_dz into contiguous real(rp) arrays before computing, improving cache locality and enabling SIMD. Big effect at large n_bin.

OpenMP csr_bin_particles (line ~500): Both particle-to-bin loops (charge accumulation + sigma computation) parallelized using thread-private arrays with critical-section reduction. Biggest wall-time impact.
Deduplicated LSC loop body — Parallel loop always runs for kick_lsc. Transverse dependence DA2 accumulation is now a separate serial post-pass that recomputes intermediates. Eliminates ~100 lines of duplicated code.

default(none) on CSR-only parallel do — Compiler will now catch any missing private declarations if the loop body is modified.

Deleted dead particle_overlap_in_bin — The contained function was no longer called after inlining for OpenMP.

Replaced print * with out_io in !$OMP critical — Both the CSR-only and general-case error paths now use the thread-safe out_io inside a critical section.

Added bounds check to CSR-only path — Same r1/i0 validation as the general-case path, consistent error handling.
@ChristopherMayes ChristopherMayes marked this pull request as ready for review March 22, 2026 18:14
@ChristopherMayes ChristopherMayes changed the title CSR optimizations Optimize 1D CSR and slice space charge performance Mar 22, 2026

@MichaelEhrlichman MichaelEhrlichman left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further performance improvements might be found by checking this code using -fopt-info-vec-optimized (gfortran) or -qopt-report=5 -qopt-report-phase=vec (ifort) to see what's being vectorized. That report might require some sifting, since it would apply to all of Bmad.

Use of where and removal of loops recommended.

Was gprof used to identify hot spots?

!$OMP end do

!$OMP critical
do ib = 1, n_bin

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this be vectorized, thus removing the loop?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per Claude Opus 4.6:

No, it can't be replaced with a Fortran array statement like csr%slice(1:n_bin)%charge = csr%slice(1:n_bin)%charge + bin_charge(1:n_bin) — at least not usefully. The reason:

The loop is inside !$OMP critical. The purpose of the loop is to merge each thread's private accumulation arrays into the shared struct. The critical section serializes threads so only one writes at a time. The loop body itself is trivial (4 additions per bin, n_bin=2000 → 8000 FLOPs). At ~0.01 microseconds per addition, the entire merge takes ~80μs per thread.

You could write it as array syntax:

csr%slice(1:n_bin)%charge = csr%slice(1:n_bin)%charge + bin_charge(1:n_bin)

But this is strided access (152-byte stride on the LHS) being added to contiguous access (8-byte stride on the RHS). The compiler would generate essentially the same loop — no SIMD vectorization is possible because the destination is strided.

The real optimization lever here would be replacing !$OMP critical with !$OMP atomic on each field, or using OMP reduction arrays — but Fortran doesn't support reductions on struct fields, and atomics on 4 fields × 2000 bins would be slower than the critical section for this bin count.

p_x = particle(i)%vec(1)
p_y = particle(i)%vec(3)
ix0 = nint((zp0 - z_min) / csr%dz_slice)
do j = 0, pbs+1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the WHERE construct could be used inplace of the conditionals here, and this loop can be vectorized.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per Claude Opus 4.6:

No, this wouldn't help, for two reasons:

  1. The inner loop is only ~3 iterations long. pbs (particle_bin_span) is typically 2, so j = 0, 1, 2, 3 — only 4 iterations. SIMD registers hold 4 doubles (AVX-256), so you'd need at least 4 iterations just to fill one vector. The overhead of setting up WHERE masks and array temporaries would exceed any benefit.

  2. The conditionals are cheap — it's the scatter-add that blocks vectorization. The real vectorization barrier is:

where ib varies per particle. This is a scatter-add (histogram) pattern — different particles write to different (potentially overlapping) bins. SIMD can't handle this because two particles might update the same bin simultaneously within a vector lane. No amount of WHERE or array syntax changes that fundamental dependency.

You could make the conditionals branchless with merge():

overlap = merge(2 * (...) * inv_dz2, 0.0_rp, z2_over > z1_over)

But the compiler already does this optimization (conditional move instruction) for such simple branches, so it wouldn't change the generated code.

The real performance limiter in this loop is memory latency — each particle reads from csr%slice(ib) (152-byte strided struct) and scatters into 4 different accumulation arrays. The OpenMP parallelization over the outer particle loop is already the correct approach here.

p_x = particle(i)%vec(1)
p_y = particle(i)%vec(3)
ix0 = nint((zp0 - z_min) / csr%dz_slice)
do j = 0, pbs+1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider use of WHERE and vectorization to remove loop.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per Claude Opus 4.6:

Same issue. The outer do i = 1, size(particle) loop is the one being parallelized over — the inner do j = 0, pbs+1 loop is per-particle with only ~4 iterations and scatter-adds into bin_charge(ib) where ib depends on the particle. WHERE doesn't help because:

It operates on arrays, but j = 0..pbs+1 is a loop over 4 bin indices per particle, not an array operation
Even if you expressed it as array syntax over a 4-element index vector, the scatter-add (bin_charge(ib) += ... for variable ib) can't be vectorized regardless
The inner loop is already as fast as it can get — 4 iterations of arithmetic + indexed writes. The compiler unrolls it.


kick1 => csr%kick1(i)
kick1%dz_particles = i * csr%dz_slice
if (csr%y_source == 0) then

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assess for WHERE and vectorization.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per Claude Opus 4.6:

Each iteration depends on the previous one because:

ix_ele_source chains forward: kick1%ix_ele_source = csr%kick1(i-1)%ix_ele_source — iteration i reads from iteration i-1
dr_match is carried forward: it's updated by s_source_calc and reused as the initial guess for the next bin's root-finding (which makes Brent's method converge faster since adjacent bins have similar source positions)
s_source_calc is a root-finding call (Brent's method, ~10-15 iterations of dspline_len evaluations) — not a simple arithmetic expression that could be vectorized
At n_bin=2000, this loop takes ~0.04s total at 10T (from the profiling data). That's already well under 1% of wall time. The FFT convolution below it already eliminated the O(n_bin²) bottleneck that used to dominate. This serial root-finding loop is the irreducible O(n_bin) cost — there's no way to break the dependency chain without changing the algorithm entirely.

@MichaelEhrlichman

Copy link
Copy Markdown
Contributor

This report from gfortran on vectorization failures might be worth considering.
vec_report.txt

@ChristopherMayes

Copy link
Copy Markdown
Contributor Author

Claude Opus 4.6:

Vectorization Report Analysis

Build flags: -O2 -fopenmp -fopt-info-vec-optimized -fopt-info-vec-missed (gfortran, Production)

Successfully Vectorized (3 loops)

Line Location Description
138 track1_bunch_csr init Trivial memset-like zeroing
786 csr_bin_kicks FFT convolution fft_a = fft_a * fft_bonly perf-relevant vectorized loop
1871 track1_bunch_csr3d init Trivial memset-like zeroing

Missed Vectorizations — By Category

1. Noise (~95% of the output)

The vast majority of "missed" messages are not about computational loops at all:

  • __builtin_malloc/free/memcpy — Every allocate/deallocate of a thread-private array generates "statement clobbers memory" messages (lines 504, 505, 572, 573, 1148, 1149, 1211, 1212, etc.). This is allocation bookkeeping, not loop bodies.
  • _gfortran_runtime_error_at — Every allocatable generates runtime checks ("Attempting to allocate already allocated", "Attempt to DEALLOCATE unallocated"). Pure error-handling code.
  • __builtin_GOMP_barrier/critical_start/critical_end — OpenMP synchronization primitives. Expected and unavoidable.
  • out_io_line12/err_exit/_gfortran_st_write — Error messages and diagnostic I/O.

2. AoS (Array-of-Structs) strided access — the real blocker

The pervasive no vectype for stmt messages like:

no vectype for stmt: _273 = csr_bunch_slice_struct[0:][_272].n_particle
no vectype for stmt: _86 = csr_kick1_struct[0:][_84].i_int_csr
no vectype for stmt: _142 = csr_particle_position_struct[0:][_141].r[0]

gfortran cannot vectorize loads/stores through struct arrays because the fields are at non-unit stride. The PR already works around this in the hot paths — e.g., csr_bin_kicks copies csr%kick1(i)%I_int_csr and csr%slice(i)%edge_dcharge_density_dz into contiguous arrays before the FFT convolution, and lsc_kick_params_calc does the same for the LSC kernel arrays (sx_v, sy_v, charge_v, etc.).

3. Loops that genuinely can't be vectorized (and don't need to be)

Lines Subroutine Why Impact
~510–535, 578–603 csr_bin_particles inner loop do j = 0, pbs+1 is only ~4 iterations with scatter-add via data-dependent ib = j + ix0 Irrelevant — outer !$OMP do over particles is where parallelism lives
716, 734 csr_bin_kicks / s_source_calc Serial dependency chain with function calls (spline1, dspline_len, super_zbrent) Only 0.04s at 10T (<1% wall time)
1510, 1531 csr_and_sc_apply_kicks Opaque function calls (da2_evaluate, convert_pc_to, bbi_kick) Already !$OMP parallel do
~1156–1263 lsc_kick_params_calc inner arrays gfortran sees allocatable descriptors ((*_ptr)[idx]) and can't prove contiguity/no-alias Already !$OMP parallel do over outer loop; inner arrays are ~2000 elements

4. Allocatable descriptor indirection in lsc_kick_params_calc

The whole-array operations on abs_z_v, z1_v, bcd_v, etc. (lines ~1156–1263) are reported as missed because gfortran sees them through allocatable descriptors ((*_ptr)[idx] pattern). These are contiguous arrays, but gfortran can't prove it through the descriptor indirection. Possible mitigations (contiguous attribute, !$GCC ivdep) would yield marginal benefit since the code is already OpenMP-parallelized over the outer loop.

Bottom Line

Nothing actionable. The report is ~95% noise from allocations, runtime checks, and error handling. The real vectorization misses are either:

  1. AoS struct access — already mitigated by copying to contiguous arrays in hot paths
  2. Allocatable descriptor indirection — marginal benefit since already OpenMP-parallelized
  3. Fundamentally serial loops — data dependencies prevent vectorization regardless

The one loop that got vectorized and actually matters — the FFT element-wise product at line 786 (fft_a = fft_a * fft_b) — is the right one.

@MichaelEhrlichman MichaelEhrlichman left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is sound.  The performance improvement is real.  And there are sufficient unit tests.
My only remaining (optional) suggestion is to post to Bmad Slack channel asking if anybody has recent CSR use cases you can validate on.

@DavidSagan

Copy link
Copy Markdown
Member

@ChristopherMayes I have not had a chance to go over this in any detail but if you are fine with the PR I can merge it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants