This file describes the high-level API for both the CUDA and OpenCL back-end of the CLCudaAPI headers. On top of the described API, each class has a constructor which takes the regular OpenCL or CUDA data-type and transforms it into a CLCudaAPI class. Furthermore, each class also implements a () operator which returns the regular OpenCL or CUDA data-type.
Constructor(s):
Event(): Creates a new event, to be used for example when timing kernels.
Public method(s):
-
void WaitForCompletion() const: Waits for completion of an event (OpenCL) or does nothing (CUDA). -
float GetElapsedTime() const: Retrieves the elapsed time in milliseconds of the last recorded event (e.g. a device kernel). This method first makes sure that the last event is finished before computing the elapsed time.
Constructor(s):
Platform(const size_t platform_id): When using the OpenCL back-end, this initializes a new OpenCL platform (e.g. AMD SDK, Intel SDK, NVIDIA SDK) specified by the integerplatform_id. When using the CUDA back-end, this initializes the CUDA driver API. Theplatform_idargument is ignored: there is only one platform.
Public method(s):
-
std::string Name() const: Retrieves the name of the platform. -
std::string Vendor() const: Retrieves the name of the vendor of the platform. -
std::string Version() const: Retrieves which version of an OpenCL platform is used (OpenCL back-end) or which CUDA driver is used (CUDA back-end). -
size_t NumDevices() const: Retrieves the number of devices on this platform.
Non-member function(s):
std::vector<Platform> GetAllPlatforms(): Retrieves a vector containing all available platforms.
Constructor(s):
Device(const Platform &platform, const size_t device_id): Initializes a new OpenCL or CUDA device on the specified platform. Thedevice_iddefines which device should be selected.
Public method(s):
-
RawPlatformID PlatformID() const: Retrieves the rawcl_platform_idID of the platform used (OpenCL back-end) or a 0size_tin case of the CUDA back-end. -
std::string Version() const: Retrieves which version of the OpenCL standard is supported (OpenCL back-end) or which CUDA driver is used (CUDA back-end). -
size_t VersionNumber() const: The same as theVersion()method, but without text, just the numeric value. -
std::string Vendor() const: Retrieves the name of the vendor of the device. -
std::string Name() const: Retrieves the name of the device. -
std::string Type() const: Retrieves the type of the devices. Possible return values are 'CPU', 'GPU', 'accelerator', or 'default'. -
size_t MaxWorkGroupSize() const: Retrieves the maximum total number of threads in an OpenCL work-group or CUDA thread-block. -
size_t MaxWorkItemDimensions() const: Retrieves the maximum number of dimensions (e.g. 2D or 3D) in an OpenCL work-group or CUDA thread-block. -
unsigned long LocalMemSize() const: Retrieves the maximum amount of on-chip scratchpad memory ('local memory') available to a single OpenCL work-group or CUDA thread-block. -
std::string Capabilities() const: In case of the OpenCL back-end, this returns a list of the OpenCL extensions supported. For CUDA, this returns the device capability (e.g. SM 3.5). -
bool HasExtension(const std::string &extension) const: In case of the OpenCL back-end, queries whether a certain extension is present (as reported byCapabilities()). For CUDA, this always returns false. -
bool SupportsFP64() const: Returns whether or not double-precision floating-point 64-bit is supported by the device. -
bool SupportsFP16() const: Returns whether or not half-precision floating-point 16-bit is supported by the device. -
size_t CoreClock() const: Retrieves the device's core clock frequency in MHz. -
size_t ComputeUnits() const: Retrieves the number of compute units (OpenCL terminology) or multi-processors (CUDA terminology) in the device. -
unsigned long MemorySize() const: Retrieves the total global memory size. -
unsigned long MaxAllocSize() const: Retrieves the maximum amount of allocatable global memory per allocation. -
size_t MemoryClock() const: Retrieves the device's memory clock frequency in MHz (CUDA back-end) or 0 (OpenCL back-end). -
size_t MemoryBusWidth() const: Retrieves the device's memory bus-width in bits (CUDA back-end) or 0 (OpenCL back-end). -
bool IsLocalMemoryValid(const size_t local_mem_usage) const: Given a requested amount of local on-chip scratchpad memory, this method returns whether or not this is a valid configuration for this particular device. -
bool IsThreadConfigValid(const std::vector<size_t> &local) const: Given a requested OpenCL work-group or CUDA thread-block configurationlocal, this method returns whether or not this is a valid configuration for this particular device. -
bool IsCPU() const: Determines whether this device is of the CPU type. -
bool IsGPU() const: Determines whether this device is of the GPU type. -
bool IsAMD() const: Determines whether this device is of the AMD brand. -
bool IsNVIDIA() const: Determines whether this device is of the NVIDIA brand. -
bool IsIntel() const: Determines whether this device is of the Intel brand. -
bool IsARM() const: Determines whether this device is of the ARM brand. -
std::string AMDBoardName() const: Returns the value ofCL_DEVICE_BOARD_NAME_AMDif present. For the CUDA back-end, this always returns an empty string. -
std::string NVIDIAComputeCapability() const: Returns the compute capability of an NVIDIA GPU, e.g. SM3.5. For the CUDA back-end, this returns the same as theCapabilities()method.
Constructor(s):
Context(const Device &device): Initializes a new context on a given device. On top of this context, CLCudaAPI can create new programs, queues and buffers.
Constructor(s):
-
Program(const Context &context, std::string source): Creates a new OpenCL or CUDA program on a given context. A program is a collection of one or more device kernels which form a single compilation unit together. The device-code is passed as a string. Such a string can for example be generated, hard-coded, or read from file at run-time. If passed as an r-value (e.g. usingstd::move), the device-code string is moved instead of copied into the class' member variable. -
Program(const Device &device, const Context &context, const std::string& binary): As above, but now the program is constructed based on an already compiled IR or binary of the device kernels. This requires a context corresponding to the binary. This constructor for OpenCL is based on theclCreateProgramWithBinaryfunction.
Public method(s):
-
void Build(const Device &device, std::vector<std::string> &options): This method invokes the OpenCL or CUDA compiler to build the program at run-time for a specific target device. Depending on the back-end, specific options can be passed to the compiler in the form of theoptionsvector. Compilation errors generated by the run-time compiler result in anstd::runtime_errorexception, which can be caught. -
std::string GetBuildInfo(const Device &device) const: Retrieves all compiler warnings and errors generated by the build process. -
std::string GetIR() const: Retrieves the intermediate representation (IR) of the compiled program. When using the CUDA back-end, this returns the PTX-code. For the OpenCL back-end, this returns either an IR (e.g. PTX) or a binary. This is different per OpenCL implementation.
Constructor(s):
Queue(const Context &context, const Device &device): Creates a new queue to enqueue kernel launches and device memory operations. This is analogous to an OpenCL command queue or a CUDA stream.
Public method(s):
-
void Finish(Event &event) constandvoid Finish() const: Completes all tasks in the queue. In the case of the CUDA back-end, the first form additionally synchronizes on the specified event. -
Context GetContext() const: Retrieves the CUDA/OpenCL context associated with this queue. -
Device GetDevice() const: Retrieves the CUDA/OpenCL device associated with this queue.
Constructor(s):
BufferHost(const Context &, const size_t size): Initializes a new linear 1D memory buffer on the host of type T. This buffer is allocated with a fixed number of elements given bysize. Note that the buffer's elements are not initialized. In the case of the CUDA back-end, this host buffer is implemented as page-locked memory. The OpenCL back-end uses a regularstd::vectorcontainer.
Public method(s):
-
size_t GetSize() const: Retrieves the allocated size in bytes. -
Several
std::vectormethods: Adds some compatibility withstd::vectorby implementing thesize,begin,end,operator[], anddatamethods.
Constants(s):
enum class BufferAccess { kReadOnly, kWriteOnly, kReadWrite, kNotOwned }Defines the different access types for the buffers. Writing to a read-only buffer will throw an error, as will reading from a write-only buffer. A buffer which is of typekNotOwnedwill not be automatically freed afterwards.
Constructor(s):
-
Buffer(const Context &context, const BufferAccess access, const size_t size): Initializes a new linear 1D memory buffer on the device of type T. This buffer is allocated with a fixed number of elements given bysize. Note that the buffer's elements are not initialized. The buffer can be read-only, write-only, read-write, or not-owned as specified by theaccessargument. -
Buffer(const Context &context, const size_t size): As above, but now defaults to read-write access. -
template <typename Iterator> Buffer(const Context &context, const Queue &queue, Iterator start, Iterator end): Creates a new buffer based on data in a linear C++ container (such asstd::vector). The size is determined by the difference between the end and start iterators. This method both creates a new buffer and writes data to it. It synchronises the queue before returning.
Public method(s):
-
void ReadAsync(const Queue &queue, const size_t size, T* host) constandvoid ReadAsync(const Queue &queue, const size_t size, std::vector<T> &host)andvoid ReadAsync(const Queue &queue, const size_t size, BufferHost<T> &host): Copiessizeelements from the current device buffer to the target host buffer. The host buffer has to be pre-allocated with a size of at leastsizeelements. This method is a-synchronous: it can return before the copy operation is completed. -
void Read(const Queue &queue, const size_t size, T* host) constandvoid Read(const Queue &queue, const size_t size, std::vector<T> &host)andvoid Read(const Queue &queue, const size_t size, BufferHost<T> &host): As above, but now completes the operation before returning. -
void WriteAsync(const Queue &queue, const size_t size, const T* host)andvoid WriteAsync(const Queue &queue, const size_t size, const std::vector<T> &host)andvoid WriteAsync(const Queue &queue, const size_t size, const BufferHost<T> &host): Copiessizeelements from a host buffer to the current device buffer. The device buffer has to be pre-allocated with a size of at leastsizeelements. This method is a-synchronous: it can return before the copy operation is completed. -
void Write(const Queue &queue, const size_t size, const T* host)andvoid Write(const Queue &queue, const size_t size, const std::vector<T> &host)andvoid Write(const Queue &queue, const size_t size, const BufferHost<T> &host): As above, but now completes the operation before returning. -
void CopyToAsync(const Queue &queue, const size_t size, const Buffer<T> &destination) const: Copiessizeelements from the current device buffer to another device buffer given bydestination. The destination buffer has to be pre-allocated with a size of at leastsizeelements. This method is a-synchronous: it can return before the copy operation is completed. -
void CopyTo(const Queue &queue, const size_t size, const Buffer<T> &destination) const: As above, but now completes the operation before returning. -
size_t GetSize() const: Retrieves the allocated size in bytes.
Constructor(s):
Kernel(const Program &program, const std::string &name): Retrieves a new kernel from a compiled program. The kernel name is given as the stringname.
Public method(s):
-
template <typename T> void SetArgument(const size_t index, const T &value): Method to set a kernel argument (l-value or r-value). The argumentindexspecifies the position in the list of kernel arguments. The argumentvaluecan also be aCLCudaAPI::Buffer. -
template <typename... Args> void SetArguments(Args&... args): As above, but now sets all arguments in one go, starting at index 0. This overwrites any previous arguments (if any). The parameter packargstakes any number of arguments of different types, includingCLCudaAPI::Buffer. -
unsigned long LocalMemUsage(const Device &device) const: Retrieves the amount of on-chip scratchpad memory (local memory in OpenCL, shared memory in CUDA) required by this specific kernel. -
std::string GetFunctionName() const: Retrieves the name of the kernel (OpenCL only). -
Launch(const Queue &queue, const std::vector<size_t> &global, const std::vector<size_t> &local, Event &event): Launches a kernel onto the specified queue. This kernel launch is a-synchronous: this method can return before the device kernel is completed. The total number of threads launched is equal to theglobalvector; the number of threads per OpenCL work-group or CUDA thread-block is given by thelocalvector. The elapsed time is recorded into theeventargument. -
Launch(const Queue &queue, const std::vector<size_t> &global, const std::vector<size_t> &local, Event &event, std::vector<Event>& waitForEvents): As above, but now this kernel is only launched after the other specified events have finished (OpenCL only). Iflocalis empty, the kernel-size is determined automatically (OpenCL only).