Skip to content

Bandwidth and latency

Hüseyin Tuğrul BÜYÜKIŞIK edited this page Mar 8, 2021 · 22 revisions

Sequential access pattern and page size > 1 (if object is small, then page size >> 1) lowers minimum access latency to just nanoseconds or a microsecond depending on access method being bulk or scalar.

With random-access pattern and page size = 1, maximum average bandwidth and minimum average latency for the following graphics card topology:

  • card1: pcie v2.0 4x (theoretical maximum bandwidth: 2000 MB/s)
  • card2: pcie v2.0 8x (theoretical maximum bandwidth: 4000 MB/s)
  • card3: pcie v2.0 4x (theoretical maximum bandwidth: 2000 MB/s)

results:

  • write bandwidth: 4233 MB/s (writing to graphics card for page-update means a read after write)

  • read bandwidth: 6055 MB/s

  • write latency: 25697 nanoseconds

  • read latency: 17765 nanoseconds

Due to development computer having only single channel ddr3 RAM at 1333MHz (which has only 10600MB/s bandwidth), construction of too large objects on stack decreases overall DMA bandwidth between graphics card memory and RAM.

Random-access max bandwidth test source code:

#include "GraphicsCardSupplyDepot.h"
#include "VirtualMultiArray.h"
#include "PcieBandwidthBenchmarker.h"
#include "CpuBenchmarker.h"

// testing
#include <random>
#include <iostream>
#include "omp.h"

constexpr bool TEST_BANDWIDTH=true;
constexpr bool TEST_LATENCY=false;
constexpr bool testType = TEST_BANDWIDTH;

class Object
{
public:
	Object():id(-1){}
	Object(int p):id(p){}
	const int getId() const {return id;}
private:
	char data[testType?(1024*512 - 4):(4)];
	int id;
};

int main()
{
	const long long pageSize = 1;
	const long long n = pageSize*(testType?1000:100000);
	const int numTestsPerThread = 25;
	VirtualMultiArray<Object> test(n,GraphicsCardSupplyDepot().requestGpus(),pageSize,3,PcieBandwidthBenchmarker().bestBandwidth(10));

	#pragma omp parallel for
	for(long long j=0;j<n;j++)
	{
			test.set(j,Object(j));
	}


	for(int i=1;i<=64;i++)
	{
		{
			CpuBenchmarker bench(i*numTestsPerThread*sizeof(Object),std::string("scalar set, ")+std::to_string(i)+std::string("threads"),i*numTestsPerThread);
			#pragma omp parallel for num_threads(i)
			for(long long j=0;j<i;j++)
			{
				std::random_device rd;
				std::mt19937 rng(rd());
				std::uniform_real_distribution<float> rnd(0,n-1);
				for(int k=0;k<numTestsPerThread;k++)
				{
					int rndv = rnd(rng);
					test.set(rndv,Object(rndv));
				}
			}
		}

		{
			CpuBenchmarker bench(i*numTestsPerThread*sizeof(Object),std::string("scalar get, ")+std::to_string(i)+std::string("threads"),i*numTestsPerThread);
			#pragma omp parallel for num_threads(i)
			for(long long j=0;j<i;j++)
			{
				std::random_device rd;
				std::mt19937 rng(rd());
				std::uniform_real_distribution<float> rnd(0,n-1);
				for(int k=0;k<numTestsPerThread;k++)
				{
					int rndv = rnd(rng);
					const auto obj = test.get(rndv);
					if(obj.getId()!=rndv)
					{
						throw std::invalid_argument("Error: set/get");
					}
				}
			}
		}

		std::cout<<"==================================================================="<<std::endl;
	}
	return 0;
}

output:

(cpu has 8 logical cores but benchmark code doesn't use math so i/o latencies can be optimized by using many more threads)

scalar set, 1threads: 12182103 nanoseconds     (bandwidth = 1075.94 MB/s)      (throughput = 487284.12 nanoseconds per iteration) 
scalar get, 1threads: 10874521 nanoseconds     (bandwidth = 1205.31 MB/s)      (throughput = 434980.84 nanoseconds per iteration) 
===================================================================
scalar set, 2threads: 15642385 nanoseconds     (bandwidth = 1675.86 MB/s)      (throughput = 312847.70 nanoseconds per iteration) 
scalar get, 2threads: 16902450 nanoseconds     (bandwidth = 1550.92 MB/s)      (throughput = 338049.00 nanoseconds per iteration) 
===================================================================
scalar set, 3threads: 15004388 nanoseconds     (bandwidth = 2620.67 MB/s)      (throughput = 200058.51 nanoseconds per iteration) 
scalar get, 3threads: 16687201 nanoseconds     (bandwidth = 2356.39 MB/s)      (throughput = 222496.01 nanoseconds per iteration) 
===================================================================
scalar set, 4threads: 15365448 nanoseconds     (bandwidth = 3412.12 MB/s)      (throughput = 153654.48 nanoseconds per iteration) 
scalar get, 4threads: 16627608 nanoseconds     (bandwidth = 3153.12 MB/s)      (throughput = 166276.08 nanoseconds per iteration) 
===================================================================
scalar set, 5threads: 17260459 nanoseconds     (bandwidth = 3796.89 MB/s)      (throughput = 138083.67 nanoseconds per iteration) 
scalar get, 5threads: 19474227 nanoseconds     (bandwidth = 3365.27 MB/s)      (throughput = 155793.82 nanoseconds per iteration) 
===================================================================
scalar set, 6threads: 20665345 nanoseconds     (bandwidth = 3805.56 MB/s)      (throughput = 137768.97 nanoseconds per iteration) 
scalar get, 6threads: 18258798 nanoseconds     (bandwidth = 4307.14 MB/s)      (throughput = 121725.32 nanoseconds per iteration) 
===================================================================
scalar set, 7threads: 27059813 nanoseconds     (bandwidth = 3390.65 MB/s)      (throughput = 154627.50 nanoseconds per iteration) 
scalar get, 7threads: 25572897 nanoseconds     (bandwidth = 3587.80 MB/s)      (throughput = 146130.84 nanoseconds per iteration) 
===================================================================
scalar set, 8threads: 24771182 nanoseconds     (bandwidth = 4233.05 MB/s)      (throughput = 123855.91 nanoseconds per iteration) 
scalar get, 8threads: 26381672 nanoseconds     (bandwidth = 3974.64 MB/s)      (throughput = 131908.36 nanoseconds per iteration) 
===================================================================
scalar set, 9threads: 32832302 nanoseconds     (bandwidth = 3592.95 MB/s)      (throughput = 145921.34 nanoseconds per iteration) 
scalar get, 9threads: 28048196 nanoseconds     (bandwidth = 4205.79 MB/s)      (throughput = 124658.65 nanoseconds per iteration) 
===================================================================
scalar set, 10threads: 31755705 nanoseconds     (bandwidth = 4127.51 MB/s)      (throughput = 127022.82 nanoseconds per iteration) 
scalar get, 10threads: 27368949 nanoseconds     (bandwidth = 4789.08 MB/s)      (throughput = 109475.80 nanoseconds per iteration) 
===================================================================
scalar set, 11threads: 36046360 nanoseconds     (bandwidth = 3999.83 MB/s)      (throughput = 131077.67 nanoseconds per iteration) 
scalar get, 11threads: 32916058 nanoseconds     (bandwidth = 4380.21 MB/s)      (throughput = 119694.76 nanoseconds per iteration) 
===================================================================
scalar set, 12threads: 39184425 nanoseconds     (bandwidth = 4014.00 MB/s)      (throughput = 130614.75 nanoseconds per iteration) 
scalar get, 12threads: 34810264 nanoseconds     (bandwidth = 4518.39 MB/s)      (throughput = 116034.21 nanoseconds per iteration) 
===================================================================
scalar set, 13threads: 39739881 nanoseconds     (bandwidth = 4287.72 MB/s)      (throughput = 122276.56 nanoseconds per iteration) 
scalar get, 13threads: 36382637 nanoseconds     (bandwidth = 4683.38 MB/s)      (throughput = 111946.58 nanoseconds per iteration) 
===================================================================
scalar set, 14threads: 41955657 nanoseconds     (bandwidth = 4373.68 MB/s)      (throughput = 119873.31 nanoseconds per iteration) 
scalar get, 14threads: 42688385 nanoseconds     (bandwidth = 4298.61 MB/s)      (throughput = 121966.81 nanoseconds per iteration) 
===================================================================
scalar set, 15threads: 49674598 nanoseconds     (bandwidth = 3957.92 MB/s)      (throughput = 132465.59 nanoseconds per iteration) 
scalar get, 15threads: 37840712 nanoseconds     (bandwidth = 5195.67 MB/s)      (throughput = 100908.57 nanoseconds per iteration) 
===================================================================
scalar set, 16threads: 49234693 nanoseconds     (bandwidth = 4259.50 MB/s)      (throughput = 123086.73 nanoseconds per iteration) 
scalar get, 16threads: 41954226 nanoseconds     (bandwidth = 4998.67 MB/s)      (throughput = 104885.57 nanoseconds per iteration) 
===================================================================
scalar set, 17threads: 58050902 nanoseconds     (bandwidth = 3838.40 MB/s)      (throughput = 136590.36 nanoseconds per iteration) 
scalar get, 17threads: 44725579 nanoseconds     (bandwidth = 4981.99 MB/s)      (throughput = 105236.66 nanoseconds per iteration) 
===================================================================
scalar set, 18threads: 62093088 nanoseconds     (bandwidth = 3799.61 MB/s)      (throughput = 137984.64 nanoseconds per iteration) 
scalar get, 18threads: 47545793 nanoseconds     (bandwidth = 4962.16 MB/s)      (throughput = 105657.32 nanoseconds per iteration) 
===================================================================
scalar set, 19threads: 66391697 nanoseconds     (bandwidth = 3751.02 MB/s)      (throughput = 139771.99 nanoseconds per iteration) 
scalar get, 19threads: 46291924 nanoseconds     (bandwidth = 5379.70 MB/s)      (throughput = 97456.68 nanoseconds per iteration) 
===================================================================
scalar set, 20threads: 61831319 nanoseconds     (bandwidth = 4239.66 MB/s)      (throughput = 123662.64 nanoseconds per iteration) 
scalar get, 20threads: 53938296 nanoseconds     (bandwidth = 4860.07 MB/s)      (throughput = 107876.59 nanoseconds per iteration) 
===================================================================
scalar set, 21threads: 65719846 nanoseconds     (bandwidth = 4188.25 MB/s)      (throughput = 125180.66 nanoseconds per iteration) 
scalar get, 21threads: 53187908 nanoseconds     (bandwidth = 5175.07 MB/s)      (throughput = 101310.30 nanoseconds per iteration) 
===================================================================
scalar set, 22threads: 74213840 nanoseconds     (bandwidth = 3885.51 MB/s)      (throughput = 134934.25 nanoseconds per iteration) 
scalar get, 22threads: 59970725 nanoseconds     (bandwidth = 4808.32 MB/s)      (throughput = 109037.68 nanoseconds per iteration) 
===================================================================
scalar set, 23threads: 80761504 nanoseconds     (bandwidth = 3732.79 MB/s)      (throughput = 140454.79 nanoseconds per iteration) 
scalar get, 23threads: 67459480 nanoseconds     (bandwidth = 4468.84 MB/s)      (throughput = 117320.83 nanoseconds per iteration) 
===================================================================
scalar set, 24threads: 83123313 nanoseconds     (bandwidth = 3784.41 MB/s)      (throughput = 138538.86 nanoseconds per iteration) 
scalar get, 24threads: 58627497 nanoseconds     (bandwidth = 5365.62 MB/s)      (throughput = 97712.49 nanoseconds per iteration) 
===================================================================
scalar set, 25threads: 89617121 nanoseconds     (bandwidth = 3656.44 MB/s)      (throughput = 143387.39 nanoseconds per iteration) 
scalar get, 25threads: 64073021 nanoseconds     (bandwidth = 5114.16 MB/s)      (throughput = 102516.83 nanoseconds per iteration) 
===================================================================
scalar set, 26threads: 94729104 nanoseconds     (bandwidth = 3597.49 MB/s)      (throughput = 145737.08 nanoseconds per iteration) 
scalar get, 26threads: 64473181 nanoseconds     (bandwidth = 5285.72 MB/s)      (throughput = 99189.51 nanoseconds per iteration) 
===================================================================
scalar set, 27threads: 88653671 nanoseconds     (bandwidth = 3991.88 MB/s)      (throughput = 131338.77 nanoseconds per iteration) 
scalar get, 27threads: 65464920 nanoseconds     (bandwidth = 5405.86 MB/s)      (throughput = 96985.07 nanoseconds per iteration) 
===================================================================
scalar set, 28threads: 106957871 nanoseconds     (bandwidth = 3431.27 MB/s)      (throughput = 152796.96 nanoseconds per iteration) 
scalar get, 28threads: 68406113 nanoseconds     (bandwidth = 5365.04 MB/s)      (throughput = 97723.02 nanoseconds per iteration) 
===================================================================
scalar set, 29threads: 101169295 nanoseconds     (bandwidth = 3757.16 MB/s)      (throughput = 139543.86 nanoseconds per iteration) 
scalar get, 29threads: 72749046 nanoseconds     (bandwidth = 5224.93 MB/s)      (throughput = 100343.51 nanoseconds per iteration) 
===================================================================
scalar set, 30threads: 102969757 nanoseconds     (bandwidth = 3818.75 MB/s)      (throughput = 137293.01 nanoseconds per iteration) 
scalar get, 30threads: 69073363 nanoseconds     (bandwidth = 5692.73 MB/s)      (throughput = 92097.82 nanoseconds per iteration) 
===================================================================
scalar set, 31threads: 102545796 nanoseconds     (bandwidth = 3962.36 MB/s)      (throughput = 132317.16 nanoseconds per iteration) 
scalar get, 31threads: 74178601 nanoseconds     (bandwidth = 5477.63 MB/s)      (throughput = 95714.32 nanoseconds per iteration) 
===================================================================
scalar set, 32threads: 121212725 nanoseconds     (bandwidth = 3460.28 MB/s)      (throughput = 151515.91 nanoseconds per iteration) 
scalar get, 32threads: 75708289 nanoseconds     (bandwidth = 5540.09 MB/s)      (throughput = 94635.36 nanoseconds per iteration) 
===================================================================
scalar set, 33threads: 121707339 nanoseconds     (bandwidth = 3553.92 MB/s)      (throughput = 147524.05 nanoseconds per iteration) 
scalar get, 33threads: 82355109 nanoseconds     (bandwidth = 5252.10 MB/s)      (throughput = 99824.37 nanoseconds per iteration) 
===================================================================
scalar set, 34threads: 115999103 nanoseconds     (bandwidth = 3841.80 MB/s)      (throughput = 136469.53 nanoseconds per iteration) 
scalar get, 34threads: 81548960 nanoseconds     (bandwidth = 5464.75 MB/s)      (throughput = 95939.95 nanoseconds per iteration) 
===================================================================
scalar set, 35threads: 126456682 nanoseconds     (bandwidth = 3627.74 MB/s)      (throughput = 144521.92 nanoseconds per iteration) 
scalar get, 35threads: 82506846 nanoseconds     (bandwidth = 5560.17 MB/s)      (throughput = 94293.54 nanoseconds per iteration) 
===================================================================
scalar set, 36threads: 126509707 nanoseconds     (bandwidth = 3729.83 MB/s)      (throughput = 140566.34 nanoseconds per iteration) 
scalar get, 36threads: 82615055 nanoseconds     (bandwidth = 5711.54 MB/s)      (throughput = 91794.51 nanoseconds per iteration) 
===================================================================
scalar set, 37threads: 124248461 nanoseconds     (bandwidth = 3903.20 MB/s)      (throughput = 134322.66 nanoseconds per iteration) 
scalar get, 37threads: 90909746 nanoseconds     (bandwidth = 5334.59 MB/s)      (throughput = 98280.81 nanoseconds per iteration) 
===================================================================
scalar set, 38threads: 123308020 nanoseconds     (bandwidth = 4039.26 MB/s)      (throughput = 129797.92 nanoseconds per iteration) 
scalar get, 38threads: 86288094 nanoseconds     (bandwidth = 5772.22 MB/s)      (throughput = 90829.57 nanoseconds per iteration) 
===================================================================
scalar set, 39threads: 126580873 nanoseconds     (bandwidth = 4038.37 MB/s)      (throughput = 129826.54 nanoseconds per iteration) 
scalar get, 39threads: 92881646 nanoseconds     (bandwidth = 5503.57 MB/s)      (throughput = 95263.23 nanoseconds per iteration) 
===================================================================
scalar set, 40threads: 129104678 nanoseconds     (bandwidth = 4060.95 MB/s)      (throughput = 129104.68 nanoseconds per iteration) 
scalar get, 40threads: 94784883 nanoseconds     (bandwidth = 5531.35 MB/s)      (throughput = 94784.88 nanoseconds per iteration) 
===================================================================
scalar set, 41threads: 150574384 nanoseconds     (bandwidth = 3568.97 MB/s)      (throughput = 146901.84 nanoseconds per iteration) 
scalar get, 41threads: 95062193 nanoseconds     (bandwidth = 5653.09 MB/s)      (throughput = 92743.60 nanoseconds per iteration) 
===================================================================
scalar set, 42threads: 142750698 nanoseconds     (bandwidth = 3856.39 MB/s)      (throughput = 135953.05 nanoseconds per iteration) 
scalar get, 42threads: 103810699 nanoseconds     (bandwidth = 5302.94 MB/s)      (throughput = 98867.33 nanoseconds per iteration) 
===================================================================
scalar set, 43threads: 148257443 nanoseconds     (bandwidth = 3801.56 MB/s)      (throughput = 137913.90 nanoseconds per iteration) 
scalar get, 43threads: 99934263 nanoseconds     (bandwidth = 5639.80 MB/s)      (throughput = 92962.11 nanoseconds per iteration) 
===================================================================
scalar set, 44threads: 145943812 nanoseconds     (bandwidth = 3951.64 MB/s)      (throughput = 132676.19 nanoseconds per iteration) 
scalar get, 44threads: 97961979 nanoseconds     (bandwidth = 5887.15 MB/s)      (throughput = 89056.34 nanoseconds per iteration) 
===================================================================
scalar set, 45threads: 153050638 nanoseconds     (bandwidth = 3853.78 MB/s)      (throughput = 136045.01 nanoseconds per iteration) 
scalar get, 45threads: 116817171 nanoseconds     (bandwidth = 5049.12 MB/s)      (throughput = 103837.49 nanoseconds per iteration) 
===================================================================
scalar set, 46threads: 153257247 nanoseconds     (bandwidth = 3934.11 MB/s)      (throughput = 133267.17 nanoseconds per iteration) 
scalar get, 46threads: 105978283 nanoseconds     (bandwidth = 5689.20 MB/s)      (throughput = 92155.03 nanoseconds per iteration) 
===================================================================
scalar set, 47threads: 176966061 nanoseconds     (bandwidth = 3481.11 MB/s)      (throughput = 150609.41 nanoseconds per iteration) 
scalar get, 47threads: 111462897 nanoseconds     (bandwidth = 5526.85 MB/s)      (throughput = 94862.04 nanoseconds per iteration) 
===================================================================
scalar set, 48threads: 160817945 nanoseconds     (bandwidth = 3912.16 MB/s)      (throughput = 134014.95 nanoseconds per iteration) 
scalar get, 48threads: 109834806 nanoseconds     (bandwidth = 5728.11 MB/s)      (throughput = 91529.01 nanoseconds per iteration) 
===================================================================
scalar set, 49threads: 203512394 nanoseconds     (bandwidth = 3155.84 MB/s)      (throughput = 166132.57 nanoseconds per iteration) 
scalar get, 49threads: 113705539 nanoseconds     (bandwidth = 5648.39 MB/s)      (throughput = 92820.85 nanoseconds per iteration) 
===================================================================
scalar set, 50threads: 197681320 nanoseconds     (bandwidth = 3315.23 MB/s)      (throughput = 158145.06 nanoseconds per iteration) 
scalar get, 50threads: 109450668 nanoseconds     (bandwidth = 5987.72 MB/s)      (throughput = 87560.53 nanoseconds per iteration) 
===================================================================
scalar set, 51threads: 176160323 nanoseconds     (bandwidth = 3794.65 MB/s)      (throughput = 138164.96 nanoseconds per iteration) 
scalar get, 51threads: 115236569 nanoseconds     (bandwidth = 5800.83 MB/s)      (throughput = 90381.62 nanoseconds per iteration) 
===================================================================
scalar set, 52threads: 191726803 nanoseconds     (bandwidth = 3554.92 MB/s)      (throughput = 147482.16 nanoseconds per iteration) 
scalar get, 52threads: 125493816 nanoseconds     (bandwidth = 5431.14 MB/s)      (throughput = 96533.70 nanoseconds per iteration) 
===================================================================
scalar set, 53threads: 207172128 nanoseconds     (bandwidth = 3353.16 MB/s)      (throughput = 156356.32 nanoseconds per iteration) 
scalar get, 53threads: 124740074 nanoseconds     (bandwidth = 5569.03 MB/s)      (throughput = 94143.45 nanoseconds per iteration) 
===================================================================
scalar set, 54threads: 197679033 nanoseconds     (bandwidth = 3580.50 MB/s)      (throughput = 146428.91 nanoseconds per iteration) 
scalar get, 54threads: 126355723 nanoseconds     (bandwidth = 5601.56 MB/s)      (throughput = 93596.83 nanoseconds per iteration) 
===================================================================
scalar set, 55threads: 187352730 nanoseconds     (bandwidth = 3847.80 MB/s)      (throughput = 136256.53 nanoseconds per iteration) 
scalar get, 55threads: 122236470 nanoseconds     (bandwidth = 5897.55 MB/s)      (throughput = 88899.25 nanoseconds per iteration) 
===================================================================
scalar set, 56threads: 218713924 nanoseconds     (bandwidth = 3356.00 MB/s)      (throughput = 156224.23 nanoseconds per iteration) 
scalar get, 56threads: 131795271 nanoseconds     (bandwidth = 5569.27 MB/s)      (throughput = 94139.48 nanoseconds per iteration) 
===================================================================
scalar set, 57threads: 196726340 nanoseconds     (bandwidth = 3797.71 MB/s)      (throughput = 138053.57 nanoseconds per iteration) 
scalar get, 57threads: 131650873 nanoseconds     (bandwidth = 5674.94 MB/s)      (throughput = 92386.58 nanoseconds per iteration) 
===================================================================
scalar set, 58threads: 216416164 nanoseconds     (bandwidth = 3512.76 MB/s)      (throughput = 149252.53 nanoseconds per iteration) 
scalar get, 58threads: 129668376 nanoseconds     (bandwidth = 5862.78 MB/s)      (throughput = 89426.47 nanoseconds per iteration) 
===================================================================
scalar set, 59threads: 210443814 nanoseconds     (bandwidth = 3674.73 MB/s)      (throughput = 142673.77 nanoseconds per iteration) 
scalar get, 59threads: 138240232 nanoseconds     (bandwidth = 5594.06 MB/s)      (throughput = 93722.19 nanoseconds per iteration) 
===================================================================
scalar set, 60threads: 207668288 nanoseconds     (bandwidth = 3786.96 MB/s)      (throughput = 138445.53 nanoseconds per iteration) 
scalar get, 60threads: 137999012 nanoseconds     (bandwidth = 5698.82 MB/s)      (throughput = 91999.34 nanoseconds per iteration) 
===================================================================
scalar set, 61threads: 220998752 nanoseconds     (bandwidth = 3617.84 MB/s)      (throughput = 144917.21 nanoseconds per iteration) 
scalar get, 61threads: 136010648 nanoseconds     (bandwidth = 5878.50 MB/s)      (throughput = 89187.31 nanoseconds per iteration) 
===================================================================
scalar set, 62threads: 218475192 nanoseconds     (bandwidth = 3719.63 MB/s)      (throughput = 140951.74 nanoseconds per iteration) 
scalar get, 62threads: 142948067 nanoseconds     (bandwidth = 5684.91 MB/s)      (throughput = 92224.56 nanoseconds per iteration) 
===================================================================
scalar set, 63threads: 212283324 nanoseconds     (bandwidth = 3889.87 MB/s)      (throughput = 134783.06 nanoseconds per iteration) 
scalar get, 63threads: 136367146 nanoseconds     (bandwidth = 6055.37 MB/s)      (throughput = 86582.31 nanoseconds per iteration) 
===================================================================
scalar set, 64threads: 229655008 nanoseconds     (bandwidth = 3652.70 MB/s)      (throughput = 143534.38 nanoseconds per iteration) 
scalar get, 64threads: 149184573 nanoseconds     (bandwidth = 5622.97 MB/s)      (throughput = 93240.36 nanoseconds per iteration) 
===================================================================

Clone this wiki locally