- Goal : Development of a 4-bit primitives kernels by using Cutlass
 
- 00_basic_gemm
 - This is kernel computes the general matrix product (GEMM) using single-precision floating-point arithmetic and assumes all matrices have column-major layout.
 
- 01_cutlass_utilities
 - These utilities are intended to be useful supporting components for managing tensor and matrix memory allocations, initializing and comparing results, and computing reference output.
 
- 02_dump_reg_shmem
 - Demonstrate CUTLASS debugging tool for dumping fragments and shared memory
 - dumping : Record the state of memory at a specific point in time
 
- 05_batched_gemm
 - strided batched gemm : By specifying pointers to the first matrices of the batch and the stride between the consecutive matrices of the batch.
 - array gemm : By copying pointers to all matrices of the batch to the device memory.
 
    cd example_{number}
    mkdir build
    cd build
    cmake ..
    make
    ./main
- Cutlass : https://github.com/NVIDIA/cutlass