Improve speed of fsum by TylerSagendorf · Pull Request #828 · fastverse/collapse

TylerSagendorf · 2026-05-06T03:35:48Z

Description

Utilized multiple accumulators in fsum to enhance instruction throughput. This led to a 2x performance improvement in certain cases.

Closes #824. Similar to #826 and #827.

Main Changes

Improved speed of fsum by using multiple accumulators

Checklist

I have performed a self-review of my code.
I have commented on my code, particularly in hard-to-understand areas.
I have updated the documentation where applicable.

Additional Context

I tried to modify the integer-based sums and the grouped sums, but their loop bodies were too complex to benefit from this technique, so I reverted my changes. Similarly, there was no improvement in the speed of the weighted sums when na.rm = TRUE.

Benchmarks were performed on an AMD Ryzen 5 7600X CPU (x86_64).

library(collapse)

n <- 1e6L

set.seed(0L)
x <- rnorm(n)
x_int <- sample.int(1e3L, size = n, replace = TRUE)
w <- runif(n)

bench::mark(
  # Doubles
  sum(x, na.rm = TRUE),
  sum(x, na.rm = FALSE),
  fsum(x),
  fsum(x, na.rm = FALSE),
  fsum(x, nthreads = 4),
  fsum(x, na.rm = FALSE, nthreads = 4),
  # Doubles, weighted
  fsum(x, w = w),
  fsum(x, w = w, na.rm = FALSE),
  fsum(x, w = w, nthreads = 4),
  fsum(x, w = w, na.rm = FALSE, nthreads = 4),
  # Integers
  sum(x_int, na.rm = TRUE),
  sum(x_int, na.rm = FALSE),
  fsum(x_int),
  fsum(x_int, na.rm = FALSE),
  fsum(x_int, nthreads = 4),
  fsum(x_int, na.rm = FALSE, nthreads = 4),
  iterations = 1e3L,
  check = FALSE
)

Only showing iterations/second for comparison purposes:

                                                Before  After Improvement
 1 sum(x, na.rm = TRUE)                            738.     -
 2 sum(x, na.rm = FALSE)                           700.     -
 3 fsum(x)                                        3070.  5942.      1.94x
 4 fsum(x, na.rm = FALSE)                         3074.  6070.      1.97x
 5 fsum(x, nthreads = 4)                         11570. 21874.      1.89x
 6 fsum(x, na.rm = FALSE, nthreads = 4)          11764. 22379.      1.90x
 7 fsum(x, w = w)                                 1543.     -
 8 fsum(x, w = w, na.rm = FALSE)                  3046.  5883.      1.93x
 9 fsum(x, w = w, nthreads = 4)                   5995.     -
10 fsum(x, w = w, na.rm = FALSE, nthreads = 4)   11695. 22041.      1.88x
11 sum(x_int, na.rm = TRUE)                       2331.     -
12 sum(x_int, na.rm = FALSE)                      2333.     -
13 fsum(x_int)                                    2318.     -
14 fsum(x_int, na.rm = FALSE)                     2325.     -
15 fsum(x_int, nthreads = 4)                     18980.     -
16 fsum(x_int, na.rm = FALSE, nthreads = 4)      32257.     -

TylerSagendorf · 2026-05-08T04:30:09Z

Tested on a Macbook with an M4 Max chip (arm64). The use of multiple accumulators in fsum increased performance (iterations/second) by a factor of 1.77 to 3.84.

                                            Original   New
1 fsum(x)                                      2008.  4930.  2.46x
2 fsum(x, na.rm = FALSE)                       2104.  8077.  3.84x
3 fsum(x, nthreads = 4L)                       2006.  4856.  2.42x
4 fsum(x, na.rm = FALSE, nthreads = 4L)        2111.  7856.  3.72x
5 fsum(x, w = w, na.rm = FALSE)                2031.  3606.  1.78x
6 fsum(x, w = w, na.rm = FALSE, nthreads = 4L) 2028.  3585.  1.77x

However, after enabling OpenMP (https://mac.r-project.org/openmp/) by updating the global Makevars to include the flags below, performance decreased by ~20-40% relative to the original fsum code.

CPPFLAGS += -Xclang -fopenmp
LDFLAGS += -lomp -fexperimental-library

                                             Original    New
1 fsum(x)                                      10458.   6422.  0.61x
2 fsum(x, na.rm = FALSE)                       13098.  10775.  0.82x
3 fsum(x, nthreads = 4L)                       23276.  16448.  0.71x
4 fsum(x, na.rm = FALSE, nthreads = 4L)        26826.  23375.  0.87x
5 fsum(x, w = w, na.rm = FALSE)                 7513.   5399.  0.72x
6 fsum(x, w = w, na.rm = FALSE, nthreads = 4L) 18392.  15335.  0.83x

SebKrantz · 2026-05-12T20:04:13Z

Thanks @TylerSagendorf! You should have tagged me to see this earlier. Just one question before I merge: Any reason you did no include the weighted case with missing values?

TylerSagendorf · 2026-05-13T03:31:04Z

No problem @SebKrantz. Sorry, I thought that you would be automatically notified when I opened the pull request. I skipped fsum(x, w = w, na.rm = TRUE) because the runtime didn't improve. Although fsum(x, w = w, na.rm = FALSE) was ~2x faster with multiple accumulators, the na.rm = TRUE version has to do about twice as much work by checking for nonmissing values and weights, so that cancels out the speed up. Here is a version of fsum_weights_impl that uses multiple accumulators if you want to test it out:

double fsum_weights_impl(const double *restrict px, const double *restrict pw, const int narm, const int l) {
  double sum, partial_sums[N_ACC] = {0.0};
  int remainder;
  if(narm == 1) {
    int j = 0, end = l-1;
    while((ISNAN(px[j]) || ISNAN(pw[j])) && j!=end) ++j;
    sum = px[j] * pw[j];
    if(j != end) {
      ++j;
      remainder = j + (l - j) % N_ACC;
      for(int i = j; i < remainder; ++i) sum += (NISNAN(px[i]) && NISNAN(pw[i])) ? px[i] * pw[i] : 0.0;
      #pragma omp simd reduction(+:partial_sums[:N_ACC])
      for(int i = remainder; i < l; i += N_ACC) {
        for(int k = 0; k < N_ACC; ++k) partial_sums[k] += (NISNAN(px[i + k]) && NISNAN(pw[i + k])) ? px[i + k] * pw[i + k] : 0.0;
      }
      for(int k = 0; k < N_ACC; ++k) sum += partial_sums[k];
    }
  } else {
    sum = 0;
    remainder = l % N_ACC;
    if(narm) {
      for(int i = 0; i < remainder; ++i) sum += (NISNAN(px[i]) && NISNAN(pw[i])) ? px[i] * pw[i] : 0.0;
      #pragma omp simd reduction(+:partial_sums[:N_ACC])
      for(int i = remainder; i < l; i += N_ACC) {
        for(int k = 0; k < N_ACC; ++k) partial_sums[k] += (NISNAN(px[i + k]) && NISNAN(pw[i + k])) ? px[i + k] * pw[i + k] : 0.0;
      }
    } else {
      // Also here speed is key...
      for(int i = 0; i < remainder; ++i) sum += px[i] * pw[i];
      #pragma omp simd reduction(+:partial_sums[:N_ACC])
      for(int i = remainder; i < l; i += N_ACC) {
        for(int k = 0; k < N_ACC; ++k) partial_sums[k] += px[i + k] * pw[i + k];
      }
    }
    for(int k = 0; k < N_ACC; ++k) sum += partial_sums[k];
  }
  return sum;
}

Also, should I update fmean in this PR, as well?

Speed up fsum with multiple accumulators.

064248c

SebKrantz merged commit be973cf into fastverse:development May 13, 2026

This was referenced May 13, 2026

[FEAT] Improve speed of fsum when OpenMP support is unavailable (macOS) #824

Closed

perf: apply multiple-accumulator SIMD optimization to fmean.c (#824) #829

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve speed of fsum#828

Improve speed of fsum#828
SebKrantz merged 1 commit into
fastverse:developmentfrom
TylerSagendorf:development

TylerSagendorf commented May 6, 2026

Uh oh!

TylerSagendorf commented May 8, 2026

Uh oh!

SebKrantz commented May 12, 2026

Uh oh!

TylerSagendorf commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TylerSagendorf commented May 6, 2026

Description

Main Changes

Checklist

Additional Context

Uh oh!

TylerSagendorf commented May 8, 2026

Uh oh!

SebKrantz commented May 12, 2026

Uh oh!

TylerSagendorf commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants