Improve speed of fsum#828
Conversation
|
Tested on a Macbook with an M4 Max chip (arm64). The use of multiple accumulators in However, after enabling OpenMP (https://mac.r-project.org/openmp/) by updating the global Makevars to include the flags below, performance decreased by ~20-40% relative to the original |
|
Thanks @TylerSagendorf! You should have tagged me to see this earlier. Just one question before I merge: Any reason you did no include the weighted case with missing values? |
|
No problem @SebKrantz. Sorry, I thought that you would be automatically notified when I opened the pull request. I skipped double fsum_weights_impl(const double *restrict px, const double *restrict pw, const int narm, const int l) {
double sum, partial_sums[N_ACC] = {0.0};
int remainder;
if(narm == 1) {
int j = 0, end = l-1;
while((ISNAN(px[j]) || ISNAN(pw[j])) && j!=end) ++j;
sum = px[j] * pw[j];
if(j != end) {
++j;
remainder = j + (l - j) % N_ACC;
for(int i = j; i < remainder; ++i) sum += (NISNAN(px[i]) && NISNAN(pw[i])) ? px[i] * pw[i] : 0.0;
#pragma omp simd reduction(+:partial_sums[:N_ACC])
for(int i = remainder; i < l; i += N_ACC) {
for(int k = 0; k < N_ACC; ++k) partial_sums[k] += (NISNAN(px[i + k]) && NISNAN(pw[i + k])) ? px[i + k] * pw[i + k] : 0.0;
}
for(int k = 0; k < N_ACC; ++k) sum += partial_sums[k];
}
} else {
sum = 0;
remainder = l % N_ACC;
if(narm) {
for(int i = 0; i < remainder; ++i) sum += (NISNAN(px[i]) && NISNAN(pw[i])) ? px[i] * pw[i] : 0.0;
#pragma omp simd reduction(+:partial_sums[:N_ACC])
for(int i = remainder; i < l; i += N_ACC) {
for(int k = 0; k < N_ACC; ++k) partial_sums[k] += (NISNAN(px[i + k]) && NISNAN(pw[i + k])) ? px[i + k] * pw[i + k] : 0.0;
}
} else {
// Also here speed is key...
for(int i = 0; i < remainder; ++i) sum += px[i] * pw[i];
#pragma omp simd reduction(+:partial_sums[:N_ACC])
for(int i = remainder; i < l; i += N_ACC) {
for(int k = 0; k < N_ACC; ++k) partial_sums[k] += px[i + k] * pw[i + k];
}
}
for(int k = 0; k < N_ACC; ++k) sum += partial_sums[k];
}
return sum;
}Also, should I update |
Description
Utilized multiple accumulators in
fsumto enhance instruction throughput. This led to a 2x performance improvement in certain cases.Closes #824. Similar to #826 and #827.
Main Changes
fsumby using multiple accumulatorsChecklist
Additional Context
I tried to modify the integer-based sums and the grouped sums, but their loop bodies were too complex to benefit from this technique, so I reverted my changes. Similarly, there was no improvement in the speed of the weighted sums when
na.rm = TRUE.Benchmarks were performed on an AMD Ryzen 5 7600X CPU (x86_64).
Only showing iterations/second for comparison purposes: