Skip to content

Improve speed of fsum#828

Merged
SebKrantz merged 1 commit into
fastverse:developmentfrom
TylerSagendorf:development
May 13, 2026
Merged

Improve speed of fsum#828
SebKrantz merged 1 commit into
fastverse:developmentfrom
TylerSagendorf:development

Conversation

@TylerSagendorf
Copy link
Copy Markdown
Contributor

Description

Utilized multiple accumulators in fsum to enhance instruction throughput. This led to a 2x performance improvement in certain cases.

Closes #824. Similar to #826 and #827.

Main Changes

  • Improved speed of fsum by using multiple accumulators

Checklist

  • I have performed a self-review of my code.
  • I have commented on my code, particularly in hard-to-understand areas.
  • I have updated the documentation where applicable.

Additional Context

I tried to modify the integer-based sums and the grouped sums, but their loop bodies were too complex to benefit from this technique, so I reverted my changes. Similarly, there was no improvement in the speed of the weighted sums when na.rm = TRUE.

Benchmarks were performed on an AMD Ryzen 5 7600X CPU (x86_64).

library(collapse)

n <- 1e6L

set.seed(0L)
x <- rnorm(n)
x_int <- sample.int(1e3L, size = n, replace = TRUE)
w <- runif(n)

bench::mark(
  # Doubles
  sum(x, na.rm = TRUE),
  sum(x, na.rm = FALSE),
  fsum(x),
  fsum(x, na.rm = FALSE),
  fsum(x, nthreads = 4),
  fsum(x, na.rm = FALSE, nthreads = 4),
  # Doubles, weighted
  fsum(x, w = w),
  fsum(x, w = w, na.rm = FALSE),
  fsum(x, w = w, nthreads = 4),
  fsum(x, w = w, na.rm = FALSE, nthreads = 4),
  # Integers
  sum(x_int, na.rm = TRUE),
  sum(x_int, na.rm = FALSE),
  fsum(x_int),
  fsum(x_int, na.rm = FALSE),
  fsum(x_int, nthreads = 4),
  fsum(x_int, na.rm = FALSE, nthreads = 4),
  iterations = 1e3L,
  check = FALSE
)

Only showing iterations/second for comparison purposes:

                                                Before  After Improvement
 1 sum(x, na.rm = TRUE)                            738.     -
 2 sum(x, na.rm = FALSE)                           700.     -
 3 fsum(x)                                        3070.  5942.      1.94x
 4 fsum(x, na.rm = FALSE)                         3074.  6070.      1.97x
 5 fsum(x, nthreads = 4)                         11570. 21874.      1.89x
 6 fsum(x, na.rm = FALSE, nthreads = 4)          11764. 22379.      1.90x
 7 fsum(x, w = w)                                 1543.     -
 8 fsum(x, w = w, na.rm = FALSE)                  3046.  5883.      1.93x
 9 fsum(x, w = w, nthreads = 4)                   5995.     -
10 fsum(x, w = w, na.rm = FALSE, nthreads = 4)   11695. 22041.      1.88x
11 sum(x_int, na.rm = TRUE)                       2331.     -
12 sum(x_int, na.rm = FALSE)                      2333.     -
13 fsum(x_int)                                    2318.     -
14 fsum(x_int, na.rm = FALSE)                     2325.     -
15 fsum(x_int, nthreads = 4)                     18980.     -
16 fsum(x_int, na.rm = FALSE, nthreads = 4)      32257.     -

@TylerSagendorf
Copy link
Copy Markdown
Contributor Author

Tested on a Macbook with an M4 Max chip (arm64). The use of multiple accumulators in fsum increased performance (iterations/second) by a factor of 1.77 to 3.84.

                                            Original   New
1 fsum(x)                                      2008.  4930.  2.46x
2 fsum(x, na.rm = FALSE)                       2104.  8077.  3.84x
3 fsum(x, nthreads = 4L)                       2006.  4856.  2.42x
4 fsum(x, na.rm = FALSE, nthreads = 4L)        2111.  7856.  3.72x
5 fsum(x, w = w, na.rm = FALSE)                2031.  3606.  1.78x
6 fsum(x, w = w, na.rm = FALSE, nthreads = 4L) 2028.  3585.  1.77x

However, after enabling OpenMP (https://mac.r-project.org/openmp/) by updating the global Makevars to include the flags below, performance decreased by ~20-40% relative to the original fsum code.

CPPFLAGS += -Xclang -fopenmp
LDFLAGS += -lomp -fexperimental-library
                                             Original    New
1 fsum(x)                                      10458.   6422.  0.61x
2 fsum(x, na.rm = FALSE)                       13098.  10775.  0.82x
3 fsum(x, nthreads = 4L)                       23276.  16448.  0.71x
4 fsum(x, na.rm = FALSE, nthreads = 4L)        26826.  23375.  0.87x
5 fsum(x, w = w, na.rm = FALSE)                 7513.   5399.  0.72x
6 fsum(x, w = w, na.rm = FALSE, nthreads = 4L) 18392.  15335.  0.83x

@SebKrantz
Copy link
Copy Markdown
Member

Thanks @TylerSagendorf! You should have tagged me to see this earlier. Just one question before I merge: Any reason you did no include the weighted case with missing values?

@TylerSagendorf
Copy link
Copy Markdown
Contributor Author

No problem @SebKrantz. Sorry, I thought that you would be automatically notified when I opened the pull request. I skipped fsum(x, w = w, na.rm = TRUE) because the runtime didn't improve. Although fsum(x, w = w, na.rm = FALSE) was ~2x faster with multiple accumulators, the na.rm = TRUE version has to do about twice as much work by checking for nonmissing values and weights, so that cancels out the speed up. Here is a version of fsum_weights_impl that uses multiple accumulators if you want to test it out:

double fsum_weights_impl(const double *restrict px, const double *restrict pw, const int narm, const int l) {
  double sum, partial_sums[N_ACC] = {0.0};
  int remainder;
  if(narm == 1) {
    int j = 0, end = l-1;
    while((ISNAN(px[j]) || ISNAN(pw[j])) && j!=end) ++j;
    sum = px[j] * pw[j];
    if(j != end) {
      ++j;
      remainder = j + (l - j) % N_ACC;
      for(int i = j; i < remainder; ++i) sum += (NISNAN(px[i]) && NISNAN(pw[i])) ? px[i] * pw[i] : 0.0;
      #pragma omp simd reduction(+:partial_sums[:N_ACC])
      for(int i = remainder; i < l; i += N_ACC) {
        for(int k = 0; k < N_ACC; ++k) partial_sums[k] += (NISNAN(px[i + k]) && NISNAN(pw[i + k])) ? px[i + k] * pw[i + k] : 0.0;
      }
      for(int k = 0; k < N_ACC; ++k) sum += partial_sums[k];
    }
  } else {
    sum = 0;
    remainder = l % N_ACC;
    if(narm) {
      for(int i = 0; i < remainder; ++i) sum += (NISNAN(px[i]) && NISNAN(pw[i])) ? px[i] * pw[i] : 0.0;
      #pragma omp simd reduction(+:partial_sums[:N_ACC])
      for(int i = remainder; i < l; i += N_ACC) {
        for(int k = 0; k < N_ACC; ++k) partial_sums[k] += (NISNAN(px[i + k]) && NISNAN(pw[i + k])) ? px[i + k] * pw[i + k] : 0.0;
      }
    } else {
      // Also here speed is key...
      for(int i = 0; i < remainder; ++i) sum += px[i] * pw[i];
      #pragma omp simd reduction(+:partial_sums[:N_ACC])
      for(int i = remainder; i < l; i += N_ACC) {
        for(int k = 0; k < N_ACC; ++k) partial_sums[k] += px[i + k] * pw[i + k];
      }
    }
    for(int k = 0; k < N_ACC; ++k) sum += partial_sums[k];
  }
  return sum;
}

Also, should I update fmean in this PR, as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants