-
Couldn't load subscription status.
- Fork 3
Bandwidth and latency
Before getting into details, here is some benchmark data from development system that has 3 low-end old graphics cards (pcie v2 x4 + x8 + x4) and 8 cpu cores at 2.1GHz(fx8150) (source code is at the end of this page):
-
random write bandwidth (object size = 512kB, 8 threads): 4233 MB/s
-
random read bandwidth (object size = 512kB, 64 threads): 6055 MB/s
-
random write latency (object size = 8 bytes, 64 threads): 25697 nanoseconds
-
random read latency (object size = 8 bytes, 64 threads): 17765 nanoseconds
Not all object types have same read/write efficiency, not all objects have same size, not all algorithms have same access pattern, so there is always a balance between bandwidth and latency which can be optimized by tuning the parameters of virtual array.
Sequential access pattern has some locality after the first access. Once any element of a page is fetched or written, whole page of elements are cached inside RAM (and L1/L2/L3 caches if page/object is small enough). Following elements are accessed at the cost of only a page-locking which is only used for calculating page index, element index, graphics card index, etc, until relevant data is reached, at constant time complexity. Pci-e data copies are also constant-time but system topology causes different pages to have different latencies/bandwidths. To confine this per-page access performance differency as local as possible, all graphics cards are interleaved for every sequential page index. First page is in first card, second page is in second card. Since there are multiple opencl channels per card, this interleaving repeats on much bigger cycles. If there are a,b,c cards each with 2 data channels, then cycling is: a1 b1 c1 a2 b2 c2 a1 b1 c1 a2 b2 c2 ... it takes 6 pages to reach same graphics card again. This enables high volume threading for I/O and distributes access time differencies on all virtual array region instead of causing a slow-down on one part of array.
Caching is made on page-level and higher page size means lesser element access latency on average. But in multi-threaded access, page-locking causes heavy contention and performance decreases. To overcome this, each thread should access a different region of array instead of same page. For example, a thread 1 can access first page while a second thread can access second page concurrently. If number of concurrent accesses are needed to be increased, then memMult parameter of constructor of virtual array needs higher values. For example, if memMult contains {10,10,20} then it can have 40 concurrent pages for updating/accessing.
Similar to a scalar access, bulk access is made of pci-e data transfer and page locking but with multiple of them at once that reduces average latency greatly. In the tiled nbody example, average latency per x,y,z position float data is 0.08 nanoseconds. This is orders of magnitude faster than scalar get/set methods. Since each bulk access is still single-threaded, it is scalable to multiple threads and can hide i/o latencies behind math operations in a many-threaded application.
Sequential access is capable of achieving both moderate bandwidth and moderate latency performances regardless of object/element size. But it needs proper tuning of page size.
Random access pattern is the weak point of caching of this virtual array. If randomly indexing does not land on close-enough elements, then paging with more than 1 elements causes latency problems because of fetching unnecessary elements from cards. But random access pattern is strong point for VRAM (video memory). A card can access any element within its VRAM in constant time regardless of position of element. So, randomly accessing is not much of a problem if object(element type) has enough bytes to make up for the minimum pcie access latency.
Random-access-pattern can not have both high bandwidth and low latency at the same time. Object size decides the performance of random-access. Just single char elements are randomly-accessed at minimum pcie latency (several or more microseconds) and 1-2 MB/s bandwidth. Objects of 512kB size are accessed with 5-10 GB/s bandwidth and 100 - 150 microseconds latency. For randomly accessing 4kB objects (as in HDD benchmarking), it is very close to the low-latency side (~16 microseconds & 260MB/s on the development computer). Usually the pcie bandwidth-effiency-gain diminishes at 256kB page size. No need to be bigger than this besides benchmarking.
This is not very different than scalar version except the reduction of lock-latency and increased bandwidth. Since it does multiple data copies, it works better with page size > 1. With page size > 1 setting, it is a combination of sequential access and random access and has good performance. But not as good as sequential bulk access. In sequential version, a thread may update a cache and next thread can directly benefit from that cache, especially when bulk access regions overlap.
If virtual array is needed as a read-only data source for both sequential and random access patterns concurrently, then it can be duplicated with different settings, one with M number of elements per page (sequential-access) and another with only 1 element per page (random-access).
Source code for random-access max bandwidth & min latency test:
#include "GraphicsCardSupplyDepot.h"
#include "VirtualMultiArray.h"
#include "PcieBandwidthBenchmarker.h"
#include "CpuBenchmarker.h"
// testing
#include <random>
#include <iostream>
#include "omp.h"
constexpr bool TEST_BANDWIDTH=true;
constexpr bool TEST_LATENCY=false;
constexpr bool testType = TEST_BANDWIDTH;
class Object
{
public:
Object():id(-1){}
Object(int p):id(p){}
const int getId() const {return id;}
private:
char data[testType?(1024*512 - 4):(4)];
int id;
};
int main()
{
const long long pageSize = 1;
const long long n = pageSize*(testType?1000:100000);
const int numTestsPerThread = 25;
VirtualMultiArray<Object> test(n,GraphicsCardSupplyDepot().requestGpus(),pageSize,3,PcieBandwidthBenchmarker().bestBandwidth(10));
#pragma omp parallel for
for(long long j=0;j<n;j++)
{
test.set(j,Object(j));
}
for(int i=1;i<=64;i++)
{
{
CpuBenchmarker bench(i*numTestsPerThread*sizeof(Object),std::string("scalar set, ")+std::to_string(i)+std::string("threads"),i*numTestsPerThread);
#pragma omp parallel for num_threads(i)
for(long long j=0;j<i;j++)
{
std::random_device rd;
std::mt19937 rng(rd());
std::uniform_real_distribution<float> rnd(0,n-1);
for(int k=0;k<numTestsPerThread;k++)
{
int rndv = rnd(rng);
test.set(rndv,Object(rndv));
}
}
}
{
CpuBenchmarker bench(i*numTestsPerThread*sizeof(Object),std::string("scalar get, ")+std::to_string(i)+std::string("threads"),i*numTestsPerThread);
#pragma omp parallel for num_threads(i)
for(long long j=0;j<i;j++)
{
std::random_device rd;
std::mt19937 rng(rd());
std::uniform_real_distribution<float> rnd(0,n-1);
for(int k=0;k<numTestsPerThread;k++)
{
int rndv = rnd(rng);
const auto obj = test.get(rndv);
if(obj.getId()!=rndv)
{
throw std::invalid_argument("Error: set/get");
}
}
}
}
std::cout<<"==================================================================="<<std::endl;
}
return 0;
}output:
(cpu has 8 logical cores but benchmark code doesn't use math so i/o latencies can be optimized by using many more threads)
scalar set, 1threads: 12182103 nanoseconds (bandwidth = 1075.94 MB/s) (throughput = 487284.12 nanoseconds per iteration)
scalar get, 1threads: 10874521 nanoseconds (bandwidth = 1205.31 MB/s) (throughput = 434980.84 nanoseconds per iteration)
===================================================================
scalar set, 2threads: 15642385 nanoseconds (bandwidth = 1675.86 MB/s) (throughput = 312847.70 nanoseconds per iteration)
scalar get, 2threads: 16902450 nanoseconds (bandwidth = 1550.92 MB/s) (throughput = 338049.00 nanoseconds per iteration)
===================================================================
scalar set, 3threads: 15004388 nanoseconds (bandwidth = 2620.67 MB/s) (throughput = 200058.51 nanoseconds per iteration)
scalar get, 3threads: 16687201 nanoseconds (bandwidth = 2356.39 MB/s) (throughput = 222496.01 nanoseconds per iteration)
===================================================================
scalar set, 4threads: 15365448 nanoseconds (bandwidth = 3412.12 MB/s) (throughput = 153654.48 nanoseconds per iteration)
scalar get, 4threads: 16627608 nanoseconds (bandwidth = 3153.12 MB/s) (throughput = 166276.08 nanoseconds per iteration)
===================================================================
scalar set, 5threads: 17260459 nanoseconds (bandwidth = 3796.89 MB/s) (throughput = 138083.67 nanoseconds per iteration)
scalar get, 5threads: 19474227 nanoseconds (bandwidth = 3365.27 MB/s) (throughput = 155793.82 nanoseconds per iteration)
===================================================================
scalar set, 6threads: 20665345 nanoseconds (bandwidth = 3805.56 MB/s) (throughput = 137768.97 nanoseconds per iteration)
scalar get, 6threads: 18258798 nanoseconds (bandwidth = 4307.14 MB/s) (throughput = 121725.32 nanoseconds per iteration)
===================================================================
scalar set, 7threads: 27059813 nanoseconds (bandwidth = 3390.65 MB/s) (throughput = 154627.50 nanoseconds per iteration)
scalar get, 7threads: 25572897 nanoseconds (bandwidth = 3587.80 MB/s) (throughput = 146130.84 nanoseconds per iteration)
===================================================================
scalar set, 8threads: 24771182 nanoseconds (bandwidth = 4233.05 MB/s) (throughput = 123855.91 nanoseconds per iteration)
scalar get, 8threads: 26381672 nanoseconds (bandwidth = 3974.64 MB/s) (throughput = 131908.36 nanoseconds per iteration)
===================================================================
scalar set, 9threads: 32832302 nanoseconds (bandwidth = 3592.95 MB/s) (throughput = 145921.34 nanoseconds per iteration)
scalar get, 9threads: 28048196 nanoseconds (bandwidth = 4205.79 MB/s) (throughput = 124658.65 nanoseconds per iteration)
===================================================================
scalar set, 10threads: 31755705 nanoseconds (bandwidth = 4127.51 MB/s) (throughput = 127022.82 nanoseconds per iteration)
scalar get, 10threads: 27368949 nanoseconds (bandwidth = 4789.08 MB/s) (throughput = 109475.80 nanoseconds per iteration)
===================================================================
scalar set, 11threads: 36046360 nanoseconds (bandwidth = 3999.83 MB/s) (throughput = 131077.67 nanoseconds per iteration)
scalar get, 11threads: 32916058 nanoseconds (bandwidth = 4380.21 MB/s) (throughput = 119694.76 nanoseconds per iteration)
===================================================================
scalar set, 12threads: 39184425 nanoseconds (bandwidth = 4014.00 MB/s) (throughput = 130614.75 nanoseconds per iteration)
scalar get, 12threads: 34810264 nanoseconds (bandwidth = 4518.39 MB/s) (throughput = 116034.21 nanoseconds per iteration)
===================================================================
scalar set, 13threads: 39739881 nanoseconds (bandwidth = 4287.72 MB/s) (throughput = 122276.56 nanoseconds per iteration)
scalar get, 13threads: 36382637 nanoseconds (bandwidth = 4683.38 MB/s) (throughput = 111946.58 nanoseconds per iteration)
===================================================================
scalar set, 14threads: 41955657 nanoseconds (bandwidth = 4373.68 MB/s) (throughput = 119873.31 nanoseconds per iteration)
scalar get, 14threads: 42688385 nanoseconds (bandwidth = 4298.61 MB/s) (throughput = 121966.81 nanoseconds per iteration)
===================================================================
scalar set, 15threads: 49674598 nanoseconds (bandwidth = 3957.92 MB/s) (throughput = 132465.59 nanoseconds per iteration)
scalar get, 15threads: 37840712 nanoseconds (bandwidth = 5195.67 MB/s) (throughput = 100908.57 nanoseconds per iteration)
===================================================================
scalar set, 16threads: 49234693 nanoseconds (bandwidth = 4259.50 MB/s) (throughput = 123086.73 nanoseconds per iteration)
scalar get, 16threads: 41954226 nanoseconds (bandwidth = 4998.67 MB/s) (throughput = 104885.57 nanoseconds per iteration)
===================================================================
scalar set, 17threads: 58050902 nanoseconds (bandwidth = 3838.40 MB/s) (throughput = 136590.36 nanoseconds per iteration)
scalar get, 17threads: 44725579 nanoseconds (bandwidth = 4981.99 MB/s) (throughput = 105236.66 nanoseconds per iteration)
===================================================================
scalar set, 18threads: 62093088 nanoseconds (bandwidth = 3799.61 MB/s) (throughput = 137984.64 nanoseconds per iteration)
scalar get, 18threads: 47545793 nanoseconds (bandwidth = 4962.16 MB/s) (throughput = 105657.32 nanoseconds per iteration)
===================================================================
scalar set, 19threads: 66391697 nanoseconds (bandwidth = 3751.02 MB/s) (throughput = 139771.99 nanoseconds per iteration)
scalar get, 19threads: 46291924 nanoseconds (bandwidth = 5379.70 MB/s) (throughput = 97456.68 nanoseconds per iteration)
===================================================================
scalar set, 20threads: 61831319 nanoseconds (bandwidth = 4239.66 MB/s) (throughput = 123662.64 nanoseconds per iteration)
scalar get, 20threads: 53938296 nanoseconds (bandwidth = 4860.07 MB/s) (throughput = 107876.59 nanoseconds per iteration)
===================================================================
scalar set, 21threads: 65719846 nanoseconds (bandwidth = 4188.25 MB/s) (throughput = 125180.66 nanoseconds per iteration)
scalar get, 21threads: 53187908 nanoseconds (bandwidth = 5175.07 MB/s) (throughput = 101310.30 nanoseconds per iteration)
===================================================================
scalar set, 22threads: 74213840 nanoseconds (bandwidth = 3885.51 MB/s) (throughput = 134934.25 nanoseconds per iteration)
scalar get, 22threads: 59970725 nanoseconds (bandwidth = 4808.32 MB/s) (throughput = 109037.68 nanoseconds per iteration)
===================================================================
scalar set, 23threads: 80761504 nanoseconds (bandwidth = 3732.79 MB/s) (throughput = 140454.79 nanoseconds per iteration)
scalar get, 23threads: 67459480 nanoseconds (bandwidth = 4468.84 MB/s) (throughput = 117320.83 nanoseconds per iteration)
===================================================================
scalar set, 24threads: 83123313 nanoseconds (bandwidth = 3784.41 MB/s) (throughput = 138538.86 nanoseconds per iteration)
scalar get, 24threads: 58627497 nanoseconds (bandwidth = 5365.62 MB/s) (throughput = 97712.49 nanoseconds per iteration)
===================================================================
scalar set, 25threads: 89617121 nanoseconds (bandwidth = 3656.44 MB/s) (throughput = 143387.39 nanoseconds per iteration)
scalar get, 25threads: 64073021 nanoseconds (bandwidth = 5114.16 MB/s) (throughput = 102516.83 nanoseconds per iteration)
===================================================================
scalar set, 26threads: 94729104 nanoseconds (bandwidth = 3597.49 MB/s) (throughput = 145737.08 nanoseconds per iteration)
scalar get, 26threads: 64473181 nanoseconds (bandwidth = 5285.72 MB/s) (throughput = 99189.51 nanoseconds per iteration)
===================================================================
scalar set, 27threads: 88653671 nanoseconds (bandwidth = 3991.88 MB/s) (throughput = 131338.77 nanoseconds per iteration)
scalar get, 27threads: 65464920 nanoseconds (bandwidth = 5405.86 MB/s) (throughput = 96985.07 nanoseconds per iteration)
===================================================================
scalar set, 28threads: 106957871 nanoseconds (bandwidth = 3431.27 MB/s) (throughput = 152796.96 nanoseconds per iteration)
scalar get, 28threads: 68406113 nanoseconds (bandwidth = 5365.04 MB/s) (throughput = 97723.02 nanoseconds per iteration)
===================================================================
scalar set, 29threads: 101169295 nanoseconds (bandwidth = 3757.16 MB/s) (throughput = 139543.86 nanoseconds per iteration)
scalar get, 29threads: 72749046 nanoseconds (bandwidth = 5224.93 MB/s) (throughput = 100343.51 nanoseconds per iteration)
===================================================================
scalar set, 30threads: 102969757 nanoseconds (bandwidth = 3818.75 MB/s) (throughput = 137293.01 nanoseconds per iteration)
scalar get, 30threads: 69073363 nanoseconds (bandwidth = 5692.73 MB/s) (throughput = 92097.82 nanoseconds per iteration)
===================================================================
scalar set, 31threads: 102545796 nanoseconds (bandwidth = 3962.36 MB/s) (throughput = 132317.16 nanoseconds per iteration)
scalar get, 31threads: 74178601 nanoseconds (bandwidth = 5477.63 MB/s) (throughput = 95714.32 nanoseconds per iteration)
===================================================================
scalar set, 32threads: 121212725 nanoseconds (bandwidth = 3460.28 MB/s) (throughput = 151515.91 nanoseconds per iteration)
scalar get, 32threads: 75708289 nanoseconds (bandwidth = 5540.09 MB/s) (throughput = 94635.36 nanoseconds per iteration)
===================================================================
scalar set, 33threads: 121707339 nanoseconds (bandwidth = 3553.92 MB/s) (throughput = 147524.05 nanoseconds per iteration)
scalar get, 33threads: 82355109 nanoseconds (bandwidth = 5252.10 MB/s) (throughput = 99824.37 nanoseconds per iteration)
===================================================================
scalar set, 34threads: 115999103 nanoseconds (bandwidth = 3841.80 MB/s) (throughput = 136469.53 nanoseconds per iteration)
scalar get, 34threads: 81548960 nanoseconds (bandwidth = 5464.75 MB/s) (throughput = 95939.95 nanoseconds per iteration)
===================================================================
scalar set, 35threads: 126456682 nanoseconds (bandwidth = 3627.74 MB/s) (throughput = 144521.92 nanoseconds per iteration)
scalar get, 35threads: 82506846 nanoseconds (bandwidth = 5560.17 MB/s) (throughput = 94293.54 nanoseconds per iteration)
===================================================================
scalar set, 36threads: 126509707 nanoseconds (bandwidth = 3729.83 MB/s) (throughput = 140566.34 nanoseconds per iteration)
scalar get, 36threads: 82615055 nanoseconds (bandwidth = 5711.54 MB/s) (throughput = 91794.51 nanoseconds per iteration)
===================================================================
scalar set, 37threads: 124248461 nanoseconds (bandwidth = 3903.20 MB/s) (throughput = 134322.66 nanoseconds per iteration)
scalar get, 37threads: 90909746 nanoseconds (bandwidth = 5334.59 MB/s) (throughput = 98280.81 nanoseconds per iteration)
===================================================================
scalar set, 38threads: 123308020 nanoseconds (bandwidth = 4039.26 MB/s) (throughput = 129797.92 nanoseconds per iteration)
scalar get, 38threads: 86288094 nanoseconds (bandwidth = 5772.22 MB/s) (throughput = 90829.57 nanoseconds per iteration)
===================================================================
scalar set, 39threads: 126580873 nanoseconds (bandwidth = 4038.37 MB/s) (throughput = 129826.54 nanoseconds per iteration)
scalar get, 39threads: 92881646 nanoseconds (bandwidth = 5503.57 MB/s) (throughput = 95263.23 nanoseconds per iteration)
===================================================================
scalar set, 40threads: 129104678 nanoseconds (bandwidth = 4060.95 MB/s) (throughput = 129104.68 nanoseconds per iteration)
scalar get, 40threads: 94784883 nanoseconds (bandwidth = 5531.35 MB/s) (throughput = 94784.88 nanoseconds per iteration)
===================================================================
scalar set, 41threads: 150574384 nanoseconds (bandwidth = 3568.97 MB/s) (throughput = 146901.84 nanoseconds per iteration)
scalar get, 41threads: 95062193 nanoseconds (bandwidth = 5653.09 MB/s) (throughput = 92743.60 nanoseconds per iteration)
===================================================================
scalar set, 42threads: 142750698 nanoseconds (bandwidth = 3856.39 MB/s) (throughput = 135953.05 nanoseconds per iteration)
scalar get, 42threads: 103810699 nanoseconds (bandwidth = 5302.94 MB/s) (throughput = 98867.33 nanoseconds per iteration)
===================================================================
scalar set, 43threads: 148257443 nanoseconds (bandwidth = 3801.56 MB/s) (throughput = 137913.90 nanoseconds per iteration)
scalar get, 43threads: 99934263 nanoseconds (bandwidth = 5639.80 MB/s) (throughput = 92962.11 nanoseconds per iteration)
===================================================================
scalar set, 44threads: 145943812 nanoseconds (bandwidth = 3951.64 MB/s) (throughput = 132676.19 nanoseconds per iteration)
scalar get, 44threads: 97961979 nanoseconds (bandwidth = 5887.15 MB/s) (throughput = 89056.34 nanoseconds per iteration)
===================================================================
scalar set, 45threads: 153050638 nanoseconds (bandwidth = 3853.78 MB/s) (throughput = 136045.01 nanoseconds per iteration)
scalar get, 45threads: 116817171 nanoseconds (bandwidth = 5049.12 MB/s) (throughput = 103837.49 nanoseconds per iteration)
===================================================================
scalar set, 46threads: 153257247 nanoseconds (bandwidth = 3934.11 MB/s) (throughput = 133267.17 nanoseconds per iteration)
scalar get, 46threads: 105978283 nanoseconds (bandwidth = 5689.20 MB/s) (throughput = 92155.03 nanoseconds per iteration)
===================================================================
scalar set, 47threads: 176966061 nanoseconds (bandwidth = 3481.11 MB/s) (throughput = 150609.41 nanoseconds per iteration)
scalar get, 47threads: 111462897 nanoseconds (bandwidth = 5526.85 MB/s) (throughput = 94862.04 nanoseconds per iteration)
===================================================================
scalar set, 48threads: 160817945 nanoseconds (bandwidth = 3912.16 MB/s) (throughput = 134014.95 nanoseconds per iteration)
scalar get, 48threads: 109834806 nanoseconds (bandwidth = 5728.11 MB/s) (throughput = 91529.01 nanoseconds per iteration)
===================================================================
scalar set, 49threads: 203512394 nanoseconds (bandwidth = 3155.84 MB/s) (throughput = 166132.57 nanoseconds per iteration)
scalar get, 49threads: 113705539 nanoseconds (bandwidth = 5648.39 MB/s) (throughput = 92820.85 nanoseconds per iteration)
===================================================================
scalar set, 50threads: 197681320 nanoseconds (bandwidth = 3315.23 MB/s) (throughput = 158145.06 nanoseconds per iteration)
scalar get, 50threads: 109450668 nanoseconds (bandwidth = 5987.72 MB/s) (throughput = 87560.53 nanoseconds per iteration)
===================================================================
scalar set, 51threads: 176160323 nanoseconds (bandwidth = 3794.65 MB/s) (throughput = 138164.96 nanoseconds per iteration)
scalar get, 51threads: 115236569 nanoseconds (bandwidth = 5800.83 MB/s) (throughput = 90381.62 nanoseconds per iteration)
===================================================================
scalar set, 52threads: 191726803 nanoseconds (bandwidth = 3554.92 MB/s) (throughput = 147482.16 nanoseconds per iteration)
scalar get, 52threads: 125493816 nanoseconds (bandwidth = 5431.14 MB/s) (throughput = 96533.70 nanoseconds per iteration)
===================================================================
scalar set, 53threads: 207172128 nanoseconds (bandwidth = 3353.16 MB/s) (throughput = 156356.32 nanoseconds per iteration)
scalar get, 53threads: 124740074 nanoseconds (bandwidth = 5569.03 MB/s) (throughput = 94143.45 nanoseconds per iteration)
===================================================================
scalar set, 54threads: 197679033 nanoseconds (bandwidth = 3580.50 MB/s) (throughput = 146428.91 nanoseconds per iteration)
scalar get, 54threads: 126355723 nanoseconds (bandwidth = 5601.56 MB/s) (throughput = 93596.83 nanoseconds per iteration)
===================================================================
scalar set, 55threads: 187352730 nanoseconds (bandwidth = 3847.80 MB/s) (throughput = 136256.53 nanoseconds per iteration)
scalar get, 55threads: 122236470 nanoseconds (bandwidth = 5897.55 MB/s) (throughput = 88899.25 nanoseconds per iteration)
===================================================================
scalar set, 56threads: 218713924 nanoseconds (bandwidth = 3356.00 MB/s) (throughput = 156224.23 nanoseconds per iteration)
scalar get, 56threads: 131795271 nanoseconds (bandwidth = 5569.27 MB/s) (throughput = 94139.48 nanoseconds per iteration)
===================================================================
scalar set, 57threads: 196726340 nanoseconds (bandwidth = 3797.71 MB/s) (throughput = 138053.57 nanoseconds per iteration)
scalar get, 57threads: 131650873 nanoseconds (bandwidth = 5674.94 MB/s) (throughput = 92386.58 nanoseconds per iteration)
===================================================================
scalar set, 58threads: 216416164 nanoseconds (bandwidth = 3512.76 MB/s) (throughput = 149252.53 nanoseconds per iteration)
scalar get, 58threads: 129668376 nanoseconds (bandwidth = 5862.78 MB/s) (throughput = 89426.47 nanoseconds per iteration)
===================================================================
scalar set, 59threads: 210443814 nanoseconds (bandwidth = 3674.73 MB/s) (throughput = 142673.77 nanoseconds per iteration)
scalar get, 59threads: 138240232 nanoseconds (bandwidth = 5594.06 MB/s) (throughput = 93722.19 nanoseconds per iteration)
===================================================================
scalar set, 60threads: 207668288 nanoseconds (bandwidth = 3786.96 MB/s) (throughput = 138445.53 nanoseconds per iteration)
scalar get, 60threads: 137999012 nanoseconds (bandwidth = 5698.82 MB/s) (throughput = 91999.34 nanoseconds per iteration)
===================================================================
scalar set, 61threads: 220998752 nanoseconds (bandwidth = 3617.84 MB/s) (throughput = 144917.21 nanoseconds per iteration)
scalar get, 61threads: 136010648 nanoseconds (bandwidth = 5878.50 MB/s) (throughput = 89187.31 nanoseconds per iteration)
===================================================================
scalar set, 62threads: 218475192 nanoseconds (bandwidth = 3719.63 MB/s) (throughput = 140951.74 nanoseconds per iteration)
scalar get, 62threads: 142948067 nanoseconds (bandwidth = 5684.91 MB/s) (throughput = 92224.56 nanoseconds per iteration)
===================================================================
scalar set, 63threads: 212283324 nanoseconds (bandwidth = 3889.87 MB/s) (throughput = 134783.06 nanoseconds per iteration)
scalar get, 63threads: 136367146 nanoseconds (bandwidth = 6055.37 MB/s) (throughput = 86582.31 nanoseconds per iteration)
===================================================================
scalar set, 64threads: 229655008 nanoseconds (bandwidth = 3652.70 MB/s) (throughput = 143534.38 nanoseconds per iteration)
scalar get, 64threads: 149184573 nanoseconds (bandwidth = 5622.97 MB/s) (throughput = 93240.36 nanoseconds per iteration)
===================================================================