-
Notifications
You must be signed in to change notification settings - Fork 3
Bandwidth and latency
Before getting into details, here is some benchmark data from development system that has 3 low-end old graphics cards (pcie v2 x4 + x8 + x4) and 8 cpu cores (source code is at the end of this page):
-
random write bandwidth (object size = 512kB, 8 threads): 4233 MB/s
-
random read bandwidth (object size = 512kB, 64 threads): 6055 MB/s
-
random write latency (object size = 8 bytes, 64 threads): 25697 nanoseconds
-
random read latency (object size = 8 bytes, 64 threads): 17765 nanoseconds
-
RAM: 10600 MB/s (single channel ddr3 at 1333 MHz), ~80 nanoseconds latency
-
L1 cache bandwidth: ~200 GB/s (fx8150 at 2.1GHz)
Not all object types have same read/write efficiency, not all objects have same size, not all algorithms have same access pattern, so there is always a balance between bandwidth and latency which can be optimized by tuning the parameters of virtual array.
Sequential access pattern has some locality after the first access. Once any element of a page is fetched or written, whole page of elements are cached inside RAM (and L1/L2/L3 caches if page/object is small enough). Following elements are accessed at the cost of only a page-locking which is only used for calculating page index, element index, graphics card index, etc, until relevant data is reached, at constant time complexity. Pci-e data copies are also constant-time but system topology causes different pages to have different latencies/bandwidths. To confine this per-page access performance differency as local as possible, all graphics cards are interleaved for every sequential page index. First page is in first card, second page is in second card. Since there are multiple opencl channels per card, this interleaving repeats on much bigger cycles. If there are a,b,c cards each with 2 data channels, then cycling is: a1 b1 c1 a2 b2 c2 a1 b1 c1 a2 b2 c2 ... it takes 6 pages to reach same graphics card again. This enables high volume threading for I/O and distributes access time differencies on all virtual array region instead of causing a slow-down on one part of array.
Caching is made on page-level and higher page size means lesser element access latency on average. But in multi-threaded access, page-locking causes heavy contention and performance decreases. To overcome this, each thread should access a different region of array instead of same page. For example, a thread 1 can access first page while a second thread can access second page concurrently. If number of concurrent accesses are needed to be increased, then memMult parameter of constructor of virtual array needs higher values. For example, if memMult contains {10,10,20} then it can have 40 concurrent pages for updating/accessing.
Similar to a scalar access, bulk access is made of pci-e data transfer and page locking but with multiple of them at once that reduces average latency greatly. In the tiled nbody example, average latency per x,y,z position float data is 0.08 nanoseconds. This is orders of magnitude faster than scalar get/set methods. Since each bulk access is still single-threaded, it is scalable to multiple threads and can hide i/o latencies behind math operations in a many-threaded application.
Sequential access is capable of achieving both moderate bandwidth and moderate latency performances regardless of object/element size. But it needs proper tuning of page size.
Random access pattern is the weak point of caching of this virtual array. If randomly indexing does not land on close-enough elements, then paging with more than 1 elements causes latency problems because of fetching unnecessary elements from cards. But random access pattern is strong point for VRAM (video memory). A card can access any element within its VRAM in constant time regardless of position of element. So, randomly accessing is not much of a problem if object(element type) has enough bytes to make up for the minimum pcie access latency.
Random-access-pattern can not have both high bandwidth and low latency at the same time. Object size decides the performance of random-access. Just single char elements are randomly-accessed at minimum pcie latency (several or more microseconds) and 1-2 MB/s bandwidth. Objects of 512kB size are accessed with 5-10 GB/s bandwidth and 100 - 150 microseconds latency. For randomly accessing 4kB objects (as in HDD benchmarking), it is very close to the low-latency side (~16 microseconds & 260MB/s on the development computer). Usually the pcie bandwidth-effiency-gain diminishes at 256kB page size. No need to be bigger than this besides benchmarking.
This is not very different than scalar version except the reduction of lock-latency and increased bandwidth. Since it does multiple data copies, it works better with page size > 1. With page size > 1 setting, it is a combination of sequential access and random access and has good performance. But not as good as sequential bulk access. In sequential version, a thread may update a cache and next thread can directly benefit from that cache, especially when bulk access regions overlap.
If virtual array is needed as a read-only data source for both sequential and random access patterns concurrently, then it can be duplicated with different settings, one with M number of elements per page (sequential-access) and another with only 1 element per page (random-access). Or, index list of random-access can be prepared and sorted before accessing array. For example, if there is a ray-tracer with rays randomly scattered, then they can be sorted on their location or their direction so that neighboring rays on the list can point to similar elements in virtual array.
- Even though the virtual array can serve to many more threads than logical cores of cpu, there is still limited DMA capability per graphics card. Some cards have only 1 async copy engine, some cards have 5 engines. Using too many opencl channels(in
memMultvector elements) may not help much. - Some algorithms may not have enough work to hide the latency of i/o
- There may not be enough RAM bandwidth for all tasks and doing any DMA between RAM and VRAM can only steal bandwidth from other tasks, unless they run in L1/L2/L3 caches. Effects of this is visible on the development computer with 1-channel low frequency RAM. (6GB/s achieved instead of 8GB).
- Too big page and sequential multithreaded access means lock-based contention, lowered performance.
- Very few active pages (in
memMultvector parameter of constructor) can not overlap enough I/O - Virtual arrays have their own opencl contexts so multiple virtual arrays may not enable extra concurrency on same graphics card, unless they are backed by different cards.
Source code for random-access max bandwidth & min latency test:
#include "GraphicsCardSupplyDepot.h"
#include "VirtualMultiArray.h"
#include "PcieBandwidthBenchmarker.h"
#include "CpuBenchmarker.h"
// testing
#include <random>
#include <iostream>
#include "omp.h"
constexpr bool TEST_BANDWIDTH=true;
constexpr bool TEST_LATENCY=false;
constexpr bool testType = TEST_BANDWIDTH;
class Object
{
public:
Object():id(-1){}
Object(int p):id(p){}
const int getId() const {return id;}
private:
char data[testType?(1024*512 - 4):(4)];
int id;
};
int main()
{
const long long pageSize = 1;
const long long n = pageSize*(testType?1000:100000);
const int numTestsPerThread = 25;
VirtualMultiArray<Object> test(n,GraphicsCardSupplyDepot().requestGpus(),pageSize,3,PcieBandwidthBenchmarker().bestBandwidth(10));
#pragma omp parallel for
for(long long j=0;j<n;j++)
{
test.set(j,Object(j));
}
for(int i=1;i<=64;i++)
{
{
CpuBenchmarker bench(i*numTestsPerThread*sizeof(Object),std::string("scalar set, ")+std::to_string(i)+std::string("threads"),i*numTestsPerThread);
#pragma omp parallel for num_threads(i)
for(long long j=0;j<i;j++)
{
std::random_device rd;
std::mt19937 rng(rd());
std::uniform_real_distribution<float> rnd(0,n-1);
for(int k=0;k<numTestsPerThread;k++)
{
int rndv = rnd(rng);
test.set(rndv,Object(rndv));
}
}
}
{
CpuBenchmarker bench(i*numTestsPerThread*sizeof(Object),std::string("scalar get, ")+std::to_string(i)+std::string("threads"),i*numTestsPerThread);
#pragma omp parallel for num_threads(i)
for(long long j=0;j<i;j++)
{
std::random_device rd;
std::mt19937 rng(rd());
std::uniform_real_distribution<float> rnd(0,n-1);
for(int k=0;k<numTestsPerThread;k++)
{
int rndv = rnd(rng);
const auto obj = test.get(rndv);
if(obj.getId()!=rndv)
{
throw std::invalid_argument("Error: set/get");
}
}
}
}
std::cout<<"==================================================================="<<std::endl;
}
return 0;
}output for bandwidth test:
(cpu has 8 logical cores but benchmark code doesn't use math so i/o latencies can be optimized by using many more threads)
scalar set, 1threads: 12182103 nanoseconds (bandwidth = 1075.94 MB/s) (throughput = 487284.12 nanoseconds per iteration)
scalar get, 1threads: 10874521 nanoseconds (bandwidth = 1205.31 MB/s) (throughput = 434980.84 nanoseconds per iteration)
===================================================================
scalar set, 2threads: 15642385 nanoseconds (bandwidth = 1675.86 MB/s) (throughput = 312847.70 nanoseconds per iteration)
scalar get, 2threads: 16902450 nanoseconds (bandwidth = 1550.92 MB/s) (throughput = 338049.00 nanoseconds per iteration)
===================================================================
scalar set, 3threads: 15004388 nanoseconds (bandwidth = 2620.67 MB/s) (throughput = 200058.51 nanoseconds per iteration)
scalar get, 3threads: 16687201 nanoseconds (bandwidth = 2356.39 MB/s) (throughput = 222496.01 nanoseconds per iteration)
===================================================================
scalar set, 4threads: 15365448 nanoseconds (bandwidth = 3412.12 MB/s) (throughput = 153654.48 nanoseconds per iteration)
scalar get, 4threads: 16627608 nanoseconds (bandwidth = 3153.12 MB/s) (throughput = 166276.08 nanoseconds per iteration)
===================================================================
scalar set, 5threads: 17260459 nanoseconds (bandwidth = 3796.89 MB/s) (throughput = 138083.67 nanoseconds per iteration)
scalar get, 5threads: 19474227 nanoseconds (bandwidth = 3365.27 MB/s) (throughput = 155793.82 nanoseconds per iteration)
===================================================================
scalar set, 6threads: 20665345 nanoseconds (bandwidth = 3805.56 MB/s) (throughput = 137768.97 nanoseconds per iteration)
scalar get, 6threads: 18258798 nanoseconds (bandwidth = 4307.14 MB/s) (throughput = 121725.32 nanoseconds per iteration)
===================================================================
scalar set, 7threads: 27059813 nanoseconds (bandwidth = 3390.65 MB/s) (throughput = 154627.50 nanoseconds per iteration)
scalar get, 7threads: 25572897 nanoseconds (bandwidth = 3587.80 MB/s) (throughput = 146130.84 nanoseconds per iteration)
===================================================================
scalar set, 8threads: 24771182 nanoseconds (bandwidth = 4233.05 MB/s) (throughput = 123855.91 nanoseconds per iteration)
scalar get, 8threads: 26381672 nanoseconds (bandwidth = 3974.64 MB/s) (throughput = 131908.36 nanoseconds per iteration)
===================================================================
scalar set, 9threads: 32832302 nanoseconds (bandwidth = 3592.95 MB/s) (throughput = 145921.34 nanoseconds per iteration)
scalar get, 9threads: 28048196 nanoseconds (bandwidth = 4205.79 MB/s) (throughput = 124658.65 nanoseconds per iteration)
===================================================================
scalar set, 10threads: 31755705 nanoseconds (bandwidth = 4127.51 MB/s) (throughput = 127022.82 nanoseconds per iteration)
scalar get, 10threads: 27368949 nanoseconds (bandwidth = 4789.08 MB/s) (throughput = 109475.80 nanoseconds per iteration)
===================================================================
scalar set, 11threads: 36046360 nanoseconds (bandwidth = 3999.83 MB/s) (throughput = 131077.67 nanoseconds per iteration)
scalar get, 11threads: 32916058 nanoseconds (bandwidth = 4380.21 MB/s) (throughput = 119694.76 nanoseconds per iteration)
===================================================================
scalar set, 12threads: 39184425 nanoseconds (bandwidth = 4014.00 MB/s) (throughput = 130614.75 nanoseconds per iteration)
scalar get, 12threads: 34810264 nanoseconds (bandwidth = 4518.39 MB/s) (throughput = 116034.21 nanoseconds per iteration)
===================================================================
scalar set, 13threads: 39739881 nanoseconds (bandwidth = 4287.72 MB/s) (throughput = 122276.56 nanoseconds per iteration)
scalar get, 13threads: 36382637 nanoseconds (bandwidth = 4683.38 MB/s) (throughput = 111946.58 nanoseconds per iteration)
===================================================================
scalar set, 14threads: 41955657 nanoseconds (bandwidth = 4373.68 MB/s) (throughput = 119873.31 nanoseconds per iteration)
scalar get, 14threads: 42688385 nanoseconds (bandwidth = 4298.61 MB/s) (throughput = 121966.81 nanoseconds per iteration)
===================================================================
scalar set, 15threads: 49674598 nanoseconds (bandwidth = 3957.92 MB/s) (throughput = 132465.59 nanoseconds per iteration)
scalar get, 15threads: 37840712 nanoseconds (bandwidth = 5195.67 MB/s) (throughput = 100908.57 nanoseconds per iteration)
===================================================================
scalar set, 16threads: 49234693 nanoseconds (bandwidth = 4259.50 MB/s) (throughput = 123086.73 nanoseconds per iteration)
scalar get, 16threads: 41954226 nanoseconds (bandwidth = 4998.67 MB/s) (throughput = 104885.57 nanoseconds per iteration)
===================================================================
scalar set, 17threads: 58050902 nanoseconds (bandwidth = 3838.40 MB/s) (throughput = 136590.36 nanoseconds per iteration)
scalar get, 17threads: 44725579 nanoseconds (bandwidth = 4981.99 MB/s) (throughput = 105236.66 nanoseconds per iteration)
===================================================================
scalar set, 18threads: 62093088 nanoseconds (bandwidth = 3799.61 MB/s) (throughput = 137984.64 nanoseconds per iteration)
scalar get, 18threads: 47545793 nanoseconds (bandwidth = 4962.16 MB/s) (throughput = 105657.32 nanoseconds per iteration)
===================================================================
scalar set, 19threads: 66391697 nanoseconds (bandwidth = 3751.02 MB/s) (throughput = 139771.99 nanoseconds per iteration)
scalar get, 19threads: 46291924 nanoseconds (bandwidth = 5379.70 MB/s) (throughput = 97456.68 nanoseconds per iteration)
===================================================================
scalar set, 20threads: 61831319 nanoseconds (bandwidth = 4239.66 MB/s) (throughput = 123662.64 nanoseconds per iteration)
scalar get, 20threads: 53938296 nanoseconds (bandwidth = 4860.07 MB/s) (throughput = 107876.59 nanoseconds per iteration)
===================================================================
scalar set, 21threads: 65719846 nanoseconds (bandwidth = 4188.25 MB/s) (throughput = 125180.66 nanoseconds per iteration)
scalar get, 21threads: 53187908 nanoseconds (bandwidth = 5175.07 MB/s) (throughput = 101310.30 nanoseconds per iteration)
===================================================================
scalar set, 22threads: 74213840 nanoseconds (bandwidth = 3885.51 MB/s) (throughput = 134934.25 nanoseconds per iteration)
scalar get, 22threads: 59970725 nanoseconds (bandwidth = 4808.32 MB/s) (throughput = 109037.68 nanoseconds per iteration)
===================================================================
scalar set, 23threads: 80761504 nanoseconds (bandwidth = 3732.79 MB/s) (throughput = 140454.79 nanoseconds per iteration)
scalar get, 23threads: 67459480 nanoseconds (bandwidth = 4468.84 MB/s) (throughput = 117320.83 nanoseconds per iteration)
===================================================================
scalar set, 24threads: 83123313 nanoseconds (bandwidth = 3784.41 MB/s) (throughput = 138538.86 nanoseconds per iteration)
scalar get, 24threads: 58627497 nanoseconds (bandwidth = 5365.62 MB/s) (throughput = 97712.49 nanoseconds per iteration)
===================================================================
scalar set, 25threads: 89617121 nanoseconds (bandwidth = 3656.44 MB/s) (throughput = 143387.39 nanoseconds per iteration)
scalar get, 25threads: 64073021 nanoseconds (bandwidth = 5114.16 MB/s) (throughput = 102516.83 nanoseconds per iteration)
===================================================================
scalar set, 26threads: 94729104 nanoseconds (bandwidth = 3597.49 MB/s) (throughput = 145737.08 nanoseconds per iteration)
scalar get, 26threads: 64473181 nanoseconds (bandwidth = 5285.72 MB/s) (throughput = 99189.51 nanoseconds per iteration)
===================================================================
scalar set, 27threads: 88653671 nanoseconds (bandwidth = 3991.88 MB/s) (throughput = 131338.77 nanoseconds per iteration)
scalar get, 27threads: 65464920 nanoseconds (bandwidth = 5405.86 MB/s) (throughput = 96985.07 nanoseconds per iteration)
===================================================================
scalar set, 28threads: 106957871 nanoseconds (bandwidth = 3431.27 MB/s) (throughput = 152796.96 nanoseconds per iteration)
scalar get, 28threads: 68406113 nanoseconds (bandwidth = 5365.04 MB/s) (throughput = 97723.02 nanoseconds per iteration)
===================================================================
scalar set, 29threads: 101169295 nanoseconds (bandwidth = 3757.16 MB/s) (throughput = 139543.86 nanoseconds per iteration)
scalar get, 29threads: 72749046 nanoseconds (bandwidth = 5224.93 MB/s) (throughput = 100343.51 nanoseconds per iteration)
===================================================================
scalar set, 30threads: 102969757 nanoseconds (bandwidth = 3818.75 MB/s) (throughput = 137293.01 nanoseconds per iteration)
scalar get, 30threads: 69073363 nanoseconds (bandwidth = 5692.73 MB/s) (throughput = 92097.82 nanoseconds per iteration)
===================================================================
scalar set, 31threads: 102545796 nanoseconds (bandwidth = 3962.36 MB/s) (throughput = 132317.16 nanoseconds per iteration)
scalar get, 31threads: 74178601 nanoseconds (bandwidth = 5477.63 MB/s) (throughput = 95714.32 nanoseconds per iteration)
===================================================================
scalar set, 32threads: 121212725 nanoseconds (bandwidth = 3460.28 MB/s) (throughput = 151515.91 nanoseconds per iteration)
scalar get, 32threads: 75708289 nanoseconds (bandwidth = 5540.09 MB/s) (throughput = 94635.36 nanoseconds per iteration)
===================================================================
scalar set, 33threads: 121707339 nanoseconds (bandwidth = 3553.92 MB/s) (throughput = 147524.05 nanoseconds per iteration)
scalar get, 33threads: 82355109 nanoseconds (bandwidth = 5252.10 MB/s) (throughput = 99824.37 nanoseconds per iteration)
===================================================================
scalar set, 34threads: 115999103 nanoseconds (bandwidth = 3841.80 MB/s) (throughput = 136469.53 nanoseconds per iteration)
scalar get, 34threads: 81548960 nanoseconds (bandwidth = 5464.75 MB/s) (throughput = 95939.95 nanoseconds per iteration)
===================================================================
scalar set, 35threads: 126456682 nanoseconds (bandwidth = 3627.74 MB/s) (throughput = 144521.92 nanoseconds per iteration)
scalar get, 35threads: 82506846 nanoseconds (bandwidth = 5560.17 MB/s) (throughput = 94293.54 nanoseconds per iteration)
===================================================================
scalar set, 36threads: 126509707 nanoseconds (bandwidth = 3729.83 MB/s) (throughput = 140566.34 nanoseconds per iteration)
scalar get, 36threads: 82615055 nanoseconds (bandwidth = 5711.54 MB/s) (throughput = 91794.51 nanoseconds per iteration)
===================================================================
scalar set, 37threads: 124248461 nanoseconds (bandwidth = 3903.20 MB/s) (throughput = 134322.66 nanoseconds per iteration)
scalar get, 37threads: 90909746 nanoseconds (bandwidth = 5334.59 MB/s) (throughput = 98280.81 nanoseconds per iteration)
===================================================================
scalar set, 38threads: 123308020 nanoseconds (bandwidth = 4039.26 MB/s) (throughput = 129797.92 nanoseconds per iteration)
scalar get, 38threads: 86288094 nanoseconds (bandwidth = 5772.22 MB/s) (throughput = 90829.57 nanoseconds per iteration)
===================================================================
scalar set, 39threads: 126580873 nanoseconds (bandwidth = 4038.37 MB/s) (throughput = 129826.54 nanoseconds per iteration)
scalar get, 39threads: 92881646 nanoseconds (bandwidth = 5503.57 MB/s) (throughput = 95263.23 nanoseconds per iteration)
===================================================================
scalar set, 40threads: 129104678 nanoseconds (bandwidth = 4060.95 MB/s) (throughput = 129104.68 nanoseconds per iteration)
scalar get, 40threads: 94784883 nanoseconds (bandwidth = 5531.35 MB/s) (throughput = 94784.88 nanoseconds per iteration)
===================================================================
scalar set, 41threads: 150574384 nanoseconds (bandwidth = 3568.97 MB/s) (throughput = 146901.84 nanoseconds per iteration)
scalar get, 41threads: 95062193 nanoseconds (bandwidth = 5653.09 MB/s) (throughput = 92743.60 nanoseconds per iteration)
===================================================================
scalar set, 42threads: 142750698 nanoseconds (bandwidth = 3856.39 MB/s) (throughput = 135953.05 nanoseconds per iteration)
scalar get, 42threads: 103810699 nanoseconds (bandwidth = 5302.94 MB/s) (throughput = 98867.33 nanoseconds per iteration)
===================================================================
scalar set, 43threads: 148257443 nanoseconds (bandwidth = 3801.56 MB/s) (throughput = 137913.90 nanoseconds per iteration)
scalar get, 43threads: 99934263 nanoseconds (bandwidth = 5639.80 MB/s) (throughput = 92962.11 nanoseconds per iteration)
===================================================================
scalar set, 44threads: 145943812 nanoseconds (bandwidth = 3951.64 MB/s) (throughput = 132676.19 nanoseconds per iteration)
scalar get, 44threads: 97961979 nanoseconds (bandwidth = 5887.15 MB/s) (throughput = 89056.34 nanoseconds per iteration)
===================================================================
scalar set, 45threads: 153050638 nanoseconds (bandwidth = 3853.78 MB/s) (throughput = 136045.01 nanoseconds per iteration)
scalar get, 45threads: 116817171 nanoseconds (bandwidth = 5049.12 MB/s) (throughput = 103837.49 nanoseconds per iteration)
===================================================================
scalar set, 46threads: 153257247 nanoseconds (bandwidth = 3934.11 MB/s) (throughput = 133267.17 nanoseconds per iteration)
scalar get, 46threads: 105978283 nanoseconds (bandwidth = 5689.20 MB/s) (throughput = 92155.03 nanoseconds per iteration)
===================================================================
scalar set, 47threads: 176966061 nanoseconds (bandwidth = 3481.11 MB/s) (throughput = 150609.41 nanoseconds per iteration)
scalar get, 47threads: 111462897 nanoseconds (bandwidth = 5526.85 MB/s) (throughput = 94862.04 nanoseconds per iteration)
===================================================================
scalar set, 48threads: 160817945 nanoseconds (bandwidth = 3912.16 MB/s) (throughput = 134014.95 nanoseconds per iteration)
scalar get, 48threads: 109834806 nanoseconds (bandwidth = 5728.11 MB/s) (throughput = 91529.01 nanoseconds per iteration)
===================================================================
scalar set, 49threads: 203512394 nanoseconds (bandwidth = 3155.84 MB/s) (throughput = 166132.57 nanoseconds per iteration)
scalar get, 49threads: 113705539 nanoseconds (bandwidth = 5648.39 MB/s) (throughput = 92820.85 nanoseconds per iteration)
===================================================================
scalar set, 50threads: 197681320 nanoseconds (bandwidth = 3315.23 MB/s) (throughput = 158145.06 nanoseconds per iteration)
scalar get, 50threads: 109450668 nanoseconds (bandwidth = 5987.72 MB/s) (throughput = 87560.53 nanoseconds per iteration)
===================================================================
scalar set, 51threads: 176160323 nanoseconds (bandwidth = 3794.65 MB/s) (throughput = 138164.96 nanoseconds per iteration)
scalar get, 51threads: 115236569 nanoseconds (bandwidth = 5800.83 MB/s) (throughput = 90381.62 nanoseconds per iteration)
===================================================================
scalar set, 52threads: 191726803 nanoseconds (bandwidth = 3554.92 MB/s) (throughput = 147482.16 nanoseconds per iteration)
scalar get, 52threads: 125493816 nanoseconds (bandwidth = 5431.14 MB/s) (throughput = 96533.70 nanoseconds per iteration)
===================================================================
scalar set, 53threads: 207172128 nanoseconds (bandwidth = 3353.16 MB/s) (throughput = 156356.32 nanoseconds per iteration)
scalar get, 53threads: 124740074 nanoseconds (bandwidth = 5569.03 MB/s) (throughput = 94143.45 nanoseconds per iteration)
===================================================================
scalar set, 54threads: 197679033 nanoseconds (bandwidth = 3580.50 MB/s) (throughput = 146428.91 nanoseconds per iteration)
scalar get, 54threads: 126355723 nanoseconds (bandwidth = 5601.56 MB/s) (throughput = 93596.83 nanoseconds per iteration)
===================================================================
scalar set, 55threads: 187352730 nanoseconds (bandwidth = 3847.80 MB/s) (throughput = 136256.53 nanoseconds per iteration)
scalar get, 55threads: 122236470 nanoseconds (bandwidth = 5897.55 MB/s) (throughput = 88899.25 nanoseconds per iteration)
===================================================================
scalar set, 56threads: 218713924 nanoseconds (bandwidth = 3356.00 MB/s) (throughput = 156224.23 nanoseconds per iteration)
scalar get, 56threads: 131795271 nanoseconds (bandwidth = 5569.27 MB/s) (throughput = 94139.48 nanoseconds per iteration)
===================================================================
scalar set, 57threads: 196726340 nanoseconds (bandwidth = 3797.71 MB/s) (throughput = 138053.57 nanoseconds per iteration)
scalar get, 57threads: 131650873 nanoseconds (bandwidth = 5674.94 MB/s) (throughput = 92386.58 nanoseconds per iteration)
===================================================================
scalar set, 58threads: 216416164 nanoseconds (bandwidth = 3512.76 MB/s) (throughput = 149252.53 nanoseconds per iteration)
scalar get, 58threads: 129668376 nanoseconds (bandwidth = 5862.78 MB/s) (throughput = 89426.47 nanoseconds per iteration)
===================================================================
scalar set, 59threads: 210443814 nanoseconds (bandwidth = 3674.73 MB/s) (throughput = 142673.77 nanoseconds per iteration)
scalar get, 59threads: 138240232 nanoseconds (bandwidth = 5594.06 MB/s) (throughput = 93722.19 nanoseconds per iteration)
===================================================================
scalar set, 60threads: 207668288 nanoseconds (bandwidth = 3786.96 MB/s) (throughput = 138445.53 nanoseconds per iteration)
scalar get, 60threads: 137999012 nanoseconds (bandwidth = 5698.82 MB/s) (throughput = 91999.34 nanoseconds per iteration)
===================================================================
scalar set, 61threads: 220998752 nanoseconds (bandwidth = 3617.84 MB/s) (throughput = 144917.21 nanoseconds per iteration)
scalar get, 61threads: 136010648 nanoseconds (bandwidth = 5878.50 MB/s) (throughput = 89187.31 nanoseconds per iteration)
===================================================================
scalar set, 62threads: 218475192 nanoseconds (bandwidth = 3719.63 MB/s) (throughput = 140951.74 nanoseconds per iteration)
scalar get, 62threads: 142948067 nanoseconds (bandwidth = 5684.91 MB/s) (throughput = 92224.56 nanoseconds per iteration)
===================================================================
scalar set, 63threads: 212283324 nanoseconds (bandwidth = 3889.87 MB/s) (throughput = 134783.06 nanoseconds per iteration)
scalar get, 63threads: 136367146 nanoseconds (bandwidth = 6055.37 MB/s) (throughput = 86582.31 nanoseconds per iteration)
===================================================================
scalar set, 64threads: 229655008 nanoseconds (bandwidth = 3652.70 MB/s) (throughput = 143534.38 nanoseconds per iteration)
scalar get, 64threads: 149184573 nanoseconds (bandwidth = 5622.97 MB/s) (throughput = 93240.36 nanoseconds per iteration)
===================================================================
output for latency test:
scalar set, 1threads: 775379 nanoseconds (bandwidth = 0.26 MB/s) (throughput = 31015.16 nanoseconds per iteration)
scalar get, 1threads: 687111 nanoseconds (bandwidth = 0.29 MB/s) (throughput = 27484.44 nanoseconds per iteration)
===================================================================
scalar set, 2threads: 1481587 nanoseconds (bandwidth = 0.27 MB/s) (throughput = 29631.74 nanoseconds per iteration)
scalar get, 2threads: 1340220 nanoseconds (bandwidth = 0.30 MB/s) (throughput = 26804.40 nanoseconds per iteration)
===================================================================
scalar set, 3threads: 1797816 nanoseconds (bandwidth = 0.33 MB/s) (throughput = 23970.88 nanoseconds per iteration)
scalar get, 3threads: 2074213 nanoseconds (bandwidth = 0.29 MB/s) (throughput = 27656.17 nanoseconds per iteration)
===================================================================
scalar set, 4threads: 2395437 nanoseconds (bandwidth = 0.33 MB/s) (throughput = 23954.37 nanoseconds per iteration)
scalar get, 4threads: 2355648 nanoseconds (bandwidth = 0.34 MB/s) (throughput = 23556.48 nanoseconds per iteration)
===================================================================
scalar set, 5threads: 2679745 nanoseconds (bandwidth = 0.37 MB/s) (throughput = 21437.96 nanoseconds per iteration)
scalar get, 5threads: 2531960 nanoseconds (bandwidth = 0.39 MB/s) (throughput = 20255.68 nanoseconds per iteration)
===================================================================
scalar set, 6threads: 3162010 nanoseconds (bandwidth = 0.38 MB/s) (throughput = 21080.07 nanoseconds per iteration)
scalar get, 6threads: 3128207 nanoseconds (bandwidth = 0.38 MB/s) (throughput = 20854.71 nanoseconds per iteration)
===================================================================
scalar set, 7threads: 5707680 nanoseconds (bandwidth = 0.25 MB/s) (throughput = 32615.31 nanoseconds per iteration)
scalar get, 7threads: 4907178 nanoseconds (bandwidth = 0.29 MB/s) (throughput = 28041.02 nanoseconds per iteration)
===================================================================
scalar set, 8threads: 6860253 nanoseconds (bandwidth = 0.23 MB/s) (throughput = 34301.26 nanoseconds per iteration)
scalar get, 8threads: 5315239 nanoseconds (bandwidth = 0.30 MB/s) (throughput = 26576.19 nanoseconds per iteration)
===================================================================
scalar set, 9threads: 6429525 nanoseconds (bandwidth = 0.28 MB/s) (throughput = 28575.67 nanoseconds per iteration)
scalar get, 9threads: 4644863 nanoseconds (bandwidth = 0.39 MB/s) (throughput = 20643.84 nanoseconds per iteration)
===================================================================
scalar set, 10threads: 6719968 nanoseconds (bandwidth = 0.30 MB/s) (throughput = 26879.87 nanoseconds per iteration)
scalar get, 10threads: 5123617 nanoseconds (bandwidth = 0.39 MB/s) (throughput = 20494.47 nanoseconds per iteration)
===================================================================
scalar set, 11threads: 6249698 nanoseconds (bandwidth = 0.35 MB/s) (throughput = 22726.17 nanoseconds per iteration)
scalar get, 11threads: 5820089 nanoseconds (bandwidth = 0.38 MB/s) (throughput = 21163.96 nanoseconds per iteration)
===================================================================
scalar set, 12threads: 6781972 nanoseconds (bandwidth = 0.35 MB/s) (throughput = 22606.57 nanoseconds per iteration)
scalar get, 12threads: 6077518 nanoseconds (bandwidth = 0.39 MB/s) (throughput = 20258.39 nanoseconds per iteration)
===================================================================
scalar set, 13threads: 8207094 nanoseconds (bandwidth = 0.32 MB/s) (throughput = 25252.60 nanoseconds per iteration)
scalar get, 13threads: 6717351 nanoseconds (bandwidth = 0.39 MB/s) (throughput = 20668.77 nanoseconds per iteration)
===================================================================
scalar set, 14threads: 8003595 nanoseconds (bandwidth = 0.35 MB/s) (throughput = 22867.41 nanoseconds per iteration)
scalar get, 14threads: 7916573 nanoseconds (bandwidth = 0.35 MB/s) (throughput = 22618.78 nanoseconds per iteration)
===================================================================
scalar set, 15threads: 10907122 nanoseconds (bandwidth = 0.28 MB/s) (throughput = 29085.66 nanoseconds per iteration)
scalar get, 15threads: 7968167 nanoseconds (bandwidth = 0.38 MB/s) (throughput = 21248.45 nanoseconds per iteration)
===================================================================
scalar set, 16threads: 9558783 nanoseconds (bandwidth = 0.33 MB/s) (throughput = 23896.96 nanoseconds per iteration)
scalar get, 16threads: 8267326 nanoseconds (bandwidth = 0.39 MB/s) (throughput = 20668.31 nanoseconds per iteration)
===================================================================
scalar set, 17threads: 10073492 nanoseconds (bandwidth = 0.34 MB/s) (throughput = 23702.33 nanoseconds per iteration)
scalar get, 17threads: 8438056 nanoseconds (bandwidth = 0.40 MB/s) (throughput = 19854.25 nanoseconds per iteration)
===================================================================
scalar set, 18threads: 10673598 nanoseconds (bandwidth = 0.34 MB/s) (throughput = 23719.11 nanoseconds per iteration)
scalar get, 18threads: 8885470 nanoseconds (bandwidth = 0.41 MB/s) (throughput = 19745.49 nanoseconds per iteration)
===================================================================
scalar set, 19threads: 11552160 nanoseconds (bandwidth = 0.33 MB/s) (throughput = 24320.34 nanoseconds per iteration)
scalar get, 19threads: 9343927 nanoseconds (bandwidth = 0.41 MB/s) (throughput = 19671.43 nanoseconds per iteration)
===================================================================
scalar set, 20threads: 12859166 nanoseconds (bandwidth = 0.31 MB/s) (throughput = 25718.33 nanoseconds per iteration)
scalar get, 20threads: 9409615 nanoseconds (bandwidth = 0.43 MB/s) (throughput = 18819.23 nanoseconds per iteration)
===================================================================
scalar set, 21threads: 13598789 nanoseconds (bandwidth = 0.31 MB/s) (throughput = 25902.46 nanoseconds per iteration)
scalar get, 21threads: 10512361 nanoseconds (bandwidth = 0.40 MB/s) (throughput = 20023.54 nanoseconds per iteration)
===================================================================
scalar set, 22threads: 13383799 nanoseconds (bandwidth = 0.33 MB/s) (throughput = 24334.18 nanoseconds per iteration)
scalar get, 22threads: 10994953 nanoseconds (bandwidth = 0.40 MB/s) (throughput = 19990.82 nanoseconds per iteration)
===================================================================
scalar set, 23threads: 15157911 nanoseconds (bandwidth = 0.30 MB/s) (throughput = 26361.58 nanoseconds per iteration)
scalar get, 23threads: 10704141 nanoseconds (bandwidth = 0.43 MB/s) (throughput = 18615.90 nanoseconds per iteration)
===================================================================
scalar set, 24threads: 14424164 nanoseconds (bandwidth = 0.33 MB/s) (throughput = 24040.27 nanoseconds per iteration)
scalar get, 24threads: 10943030 nanoseconds (bandwidth = 0.44 MB/s) (throughput = 18238.38 nanoseconds per iteration)
===================================================================
scalar set, 25threads: 15539480 nanoseconds (bandwidth = 0.32 MB/s) (throughput = 24863.17 nanoseconds per iteration)
scalar get, 25threads: 11598444 nanoseconds (bandwidth = 0.43 MB/s) (throughput = 18557.51 nanoseconds per iteration)
===================================================================
scalar set, 26threads: 16848910 nanoseconds (bandwidth = 0.31 MB/s) (throughput = 25921.40 nanoseconds per iteration)
scalar get, 26threads: 12475135 nanoseconds (bandwidth = 0.42 MB/s) (throughput = 19192.52 nanoseconds per iteration)
===================================================================
scalar set, 27threads: 17207401 nanoseconds (bandwidth = 0.31 MB/s) (throughput = 25492.45 nanoseconds per iteration)
scalar get, 27threads: 12855699 nanoseconds (bandwidth = 0.42 MB/s) (throughput = 19045.48 nanoseconds per iteration)
===================================================================
scalar set, 28threads: 17406686 nanoseconds (bandwidth = 0.32 MB/s) (throughput = 24866.69 nanoseconds per iteration)
scalar get, 28threads: 13323661 nanoseconds (bandwidth = 0.42 MB/s) (throughput = 19033.80 nanoseconds per iteration)
===================================================================
scalar set, 29threads: 17401004 nanoseconds (bandwidth = 0.33 MB/s) (throughput = 24001.38 nanoseconds per iteration)
scalar get, 29threads: 13861620 nanoseconds (bandwidth = 0.42 MB/s) (throughput = 19119.48 nanoseconds per iteration)
===================================================================
scalar set, 30threads: 18672946 nanoseconds (bandwidth = 0.32 MB/s) (throughput = 24897.26 nanoseconds per iteration)
scalar get, 30threads: 14482698 nanoseconds (bandwidth = 0.41 MB/s) (throughput = 19310.26 nanoseconds per iteration)
===================================================================
scalar set, 31threads: 19652565 nanoseconds (bandwidth = 0.32 MB/s) (throughput = 25358.15 nanoseconds per iteration)
scalar get, 31threads: 14443398 nanoseconds (bandwidth = 0.43 MB/s) (throughput = 18636.64 nanoseconds per iteration)
===================================================================
scalar set, 32threads: 21329163 nanoseconds (bandwidth = 0.30 MB/s) (throughput = 26661.45 nanoseconds per iteration)
scalar get, 32threads: 14717053 nanoseconds (bandwidth = 0.43 MB/s) (throughput = 18396.32 nanoseconds per iteration)
===================================================================
scalar set, 33threads: 20671709 nanoseconds (bandwidth = 0.32 MB/s) (throughput = 25056.62 nanoseconds per iteration)
scalar get, 33threads: 15607363 nanoseconds (bandwidth = 0.42 MB/s) (throughput = 18918.02 nanoseconds per iteration)
===================================================================
scalar set, 34threads: 21466749 nanoseconds (bandwidth = 0.32 MB/s) (throughput = 25255.00 nanoseconds per iteration)
scalar get, 34threads: 15369478 nanoseconds (bandwidth = 0.44 MB/s) (throughput = 18081.74 nanoseconds per iteration)
===================================================================
scalar set, 35threads: 22612456 nanoseconds (bandwidth = 0.31 MB/s) (throughput = 25842.81 nanoseconds per iteration)
scalar get, 35threads: 16459988 nanoseconds (bandwidth = 0.43 MB/s) (throughput = 18811.41 nanoseconds per iteration)
===================================================================
scalar set, 36threads: 24057681 nanoseconds (bandwidth = 0.30 MB/s) (throughput = 26730.76 nanoseconds per iteration)
scalar get, 36threads: 16742066 nanoseconds (bandwidth = 0.43 MB/s) (throughput = 18602.30 nanoseconds per iteration)
===================================================================
scalar set, 37threads: 23602835 nanoseconds (bandwidth = 0.31 MB/s) (throughput = 25516.58 nanoseconds per iteration)
scalar get, 37threads: 17221780 nanoseconds (bandwidth = 0.43 MB/s) (throughput = 18618.14 nanoseconds per iteration)
===================================================================
scalar set, 38threads: 24565287 nanoseconds (bandwidth = 0.31 MB/s) (throughput = 25858.20 nanoseconds per iteration)
scalar get, 38threads: 17165844 nanoseconds (bandwidth = 0.44 MB/s) (throughput = 18069.31 nanoseconds per iteration)
===================================================================
scalar set, 39threads: 25516865 nanoseconds (bandwidth = 0.31 MB/s) (throughput = 26171.14 nanoseconds per iteration)
scalar get, 39threads: 18000459 nanoseconds (bandwidth = 0.43 MB/s) (throughput = 18462.01 nanoseconds per iteration)
===================================================================
scalar set, 40threads: 26223964 nanoseconds (bandwidth = 0.31 MB/s) (throughput = 26223.96 nanoseconds per iteration)
scalar get, 40threads: 18427086 nanoseconds (bandwidth = 0.43 MB/s) (throughput = 18427.09 nanoseconds per iteration)
===================================================================
scalar set, 41threads: 26128392 nanoseconds (bandwidth = 0.31 MB/s) (throughput = 25491.11 nanoseconds per iteration)
scalar get, 41threads: 18313751 nanoseconds (bandwidth = 0.45 MB/s) (throughput = 17867.07 nanoseconds per iteration)
===================================================================
scalar set, 42threads: 26615889 nanoseconds (bandwidth = 0.32 MB/s) (throughput = 25348.47 nanoseconds per iteration)
scalar get, 42threads: 19843875 nanoseconds (bandwidth = 0.42 MB/s) (throughput = 18898.93 nanoseconds per iteration)
===================================================================
scalar set, 43threads: 27027507 nanoseconds (bandwidth = 0.32 MB/s) (throughput = 25141.87 nanoseconds per iteration)
scalar get, 43threads: 20750677 nanoseconds (bandwidth = 0.41 MB/s) (throughput = 19302.96 nanoseconds per iteration)
===================================================================
scalar set, 44threads: 28400287 nanoseconds (bandwidth = 0.31 MB/s) (throughput = 25818.44 nanoseconds per iteration)
scalar get, 44threads: 20710109 nanoseconds (bandwidth = 0.42 MB/s) (throughput = 18827.37 nanoseconds per iteration)
===================================================================
scalar set, 45threads: 29395492 nanoseconds (bandwidth = 0.31 MB/s) (throughput = 26129.33 nanoseconds per iteration)
scalar get, 45threads: 21895064 nanoseconds (bandwidth = 0.41 MB/s) (throughput = 19462.28 nanoseconds per iteration)
===================================================================
scalar set, 46threads: 29247109 nanoseconds (bandwidth = 0.31 MB/s) (throughput = 25432.27 nanoseconds per iteration)
scalar get, 46threads: 21262042 nanoseconds (bandwidth = 0.43 MB/s) (throughput = 18488.73 nanoseconds per iteration)
===================================================================
scalar set, 47threads: 30160650 nanoseconds (bandwidth = 0.31 MB/s) (throughput = 25668.64 nanoseconds per iteration)
scalar get, 47threads: 21535305 nanoseconds (bandwidth = 0.44 MB/s) (throughput = 18327.92 nanoseconds per iteration)
===================================================================
scalar set, 48threads: 30803548 nanoseconds (bandwidth = 0.31 MB/s) (throughput = 25669.62 nanoseconds per iteration)
scalar get, 48threads: 22243446 nanoseconds (bandwidth = 0.43 MB/s) (throughput = 18536.21 nanoseconds per iteration)
===================================================================
scalar set, 49threads: 31935437 nanoseconds (bandwidth = 0.31 MB/s) (throughput = 26069.74 nanoseconds per iteration)
scalar get, 49threads: 22011816 nanoseconds (bandwidth = 0.45 MB/s) (throughput = 17968.83 nanoseconds per iteration)
===================================================================
scalar set, 50threads: 32482311 nanoseconds (bandwidth = 0.31 MB/s) (throughput = 25985.85 nanoseconds per iteration)
scalar get, 50threads: 22274637 nanoseconds (bandwidth = 0.45 MB/s) (throughput = 17819.71 nanoseconds per iteration)
===================================================================
scalar set, 51threads: 32682591 nanoseconds (bandwidth = 0.31 MB/s) (throughput = 25633.40 nanoseconds per iteration)
scalar get, 51threads: 23114274 nanoseconds (bandwidth = 0.44 MB/s) (throughput = 18128.84 nanoseconds per iteration)
===================================================================
scalar set, 52threads: 35620261 nanoseconds (bandwidth = 0.29 MB/s) (throughput = 27400.20 nanoseconds per iteration)
scalar get, 52threads: 22784131 nanoseconds (bandwidth = 0.46 MB/s) (throughput = 17526.25 nanoseconds per iteration)
===================================================================
scalar set, 53threads: 35105602 nanoseconds (bandwidth = 0.30 MB/s) (throughput = 26494.79 nanoseconds per iteration)
scalar get, 53threads: 25174488 nanoseconds (bandwidth = 0.42 MB/s) (throughput = 18999.61 nanoseconds per iteration)
===================================================================
scalar set, 54threads: 35553065 nanoseconds (bandwidth = 0.30 MB/s) (throughput = 26335.60 nanoseconds per iteration)
scalar get, 54threads: 24223966 nanoseconds (bandwidth = 0.45 MB/s) (throughput = 17943.68 nanoseconds per iteration)
===================================================================
scalar set, 55threads: 36305271 nanoseconds (bandwidth = 0.30 MB/s) (throughput = 26403.83 nanoseconds per iteration)
scalar get, 55threads: 26361864 nanoseconds (bandwidth = 0.42 MB/s) (throughput = 19172.26 nanoseconds per iteration)
===================================================================
scalar set, 56threads: 36231076 nanoseconds (bandwidth = 0.31 MB/s) (throughput = 25879.34 nanoseconds per iteration)
scalar get, 56threads: 25167642 nanoseconds (bandwidth = 0.45 MB/s) (throughput = 17976.89 nanoseconds per iteration)
===================================================================
scalar set, 57threads: 36618670 nanoseconds (bandwidth = 0.31 MB/s) (throughput = 25697.31 nanoseconds per iteration)
scalar get, 57threads: 26212777 nanoseconds (bandwidth = 0.43 MB/s) (throughput = 18394.93 nanoseconds per iteration)
===================================================================
scalar set, 58threads: 37641994 nanoseconds (bandwidth = 0.31 MB/s) (throughput = 25960.00 nanoseconds per iteration)
scalar get, 58threads: 27736874 nanoseconds (bandwidth = 0.42 MB/s) (throughput = 19128.88 nanoseconds per iteration)
===================================================================
scalar set, 59threads: 38433436 nanoseconds (bandwidth = 0.31 MB/s) (throughput = 26056.57 nanoseconds per iteration)
scalar get, 59threads: 27168838 nanoseconds (bandwidth = 0.43 MB/s) (throughput = 18419.55 nanoseconds per iteration)
===================================================================
scalar set, 60threads: 40119192 nanoseconds (bandwidth = 0.30 MB/s) (throughput = 26746.13 nanoseconds per iteration)
scalar get, 60threads: 26648771 nanoseconds (bandwidth = 0.45 MB/s) (throughput = 17765.85 nanoseconds per iteration)
===================================================================
scalar set, 61threads: 40107669 nanoseconds (bandwidth = 0.30 MB/s) (throughput = 26300.11 nanoseconds per iteration)
scalar get, 61threads: 27275992 nanoseconds (bandwidth = 0.45 MB/s) (throughput = 17885.90 nanoseconds per iteration)
===================================================================
scalar set, 62threads: 41379521 nanoseconds (bandwidth = 0.30 MB/s) (throughput = 26696.47 nanoseconds per iteration)
scalar get, 62threads: 27047386 nanoseconds (bandwidth = 0.46 MB/s) (throughput = 17449.93 nanoseconds per iteration)
===================================================================
scalar set, 63threads: 41216107 nanoseconds (bandwidth = 0.31 MB/s) (throughput = 26168.96 nanoseconds per iteration)
scalar get, 63threads: 28625041 nanoseconds (bandwidth = 0.44 MB/s) (throughput = 18174.63 nanoseconds per iteration)
===================================================================
scalar set, 64threads: 43973339 nanoseconds (bandwidth = 0.29 MB/s) (throughput = 27483.34 nanoseconds per iteration)
scalar get, 64threads: 30242336 nanoseconds (bandwidth = 0.42 MB/s) (throughput = 18901.46 nanoseconds per iteration)