Select Page

Matrix to Matrix multiplication is a very popular algorithm to show the real performance of any Graphic card. This is by nature due to the fact that this operation is very costly in terms of computing performance as the amount of multiplications required scales with matrix size – N3 . Even a very bad and naive CUDA kernel would perform very well in comparison to any CPU implementation.

The goal of this post is to however show that GPUs have not only the main “global” memory, but other types of memory as well. Namely the shared memory. The main difference between these two is that shared memory is much much smaller and is located directly on the chip (Very similar to a Block RAM in FPGA). As a result, this memory is several times faster than global memory. This is important as some algorithm’s performance are simply bound by the overall memory throughput – so that if somehow, one is able to reduce the global memory load, the algorithm will run faster.

The matrix multiplication is a classical example of “Inefficient memory usage“. For 1 addition and one multiplication, we are doing 2 memory accesses and thus the compute-to-memory ratio is very small. As a result, the kernel will not have optimum performance. On the other hand matrices can be quite huge and cannot fit into the shared memory ~64KB. There are however techniques,which allow us to do some tricks and compute the MM multiplications within the shared memory. Lets first See the naive Kernel example:

``````/***************************************************
*@brief: Naive Matrix Matrix Multiplication in CUDA
**************************************************/
__global__ void MM_Mul_Default(int* a, int* b, int* c, int N)
{
int tix = threadIdx.x + blockDim.x * blockIdx.x;
int tiy = threadIdx.y + blockDim.y * blockIdx.y;

if ((tix < N) && (tiy < N)) {
int tid = tix + tiy * N;

int sum = 0;
for (int i = 0; i < N; i++) {
int temp1 = a[tiy * N + i]; // Row
int temp2 = b[N * i + tix]; // Col
sum += temp1 * temp2;
}

c[tid] = sum;
}

}``````

This kernel is pretty easy to understand and probably even self-explanatory. What we can however do is to tile the operation into several pieces / phases as show below:

IE, we will compute the dot-products for each elements within chunks. How big? That depends. They must fit into the shared memory. In fact finding the right size and thread size per block is not easy and requires testing and tuning. Still, even after tuning, it will be likely tuned to a single architecture, so there is no point to tune-to-maximum unless that is clearly our goal. A very good and recommended start is to use 16×16 thread block (Total of 256 threads per block). Unlike in the previous kernel, the optimized kernel with shared memory has 2 loops: First one that loops across the tiles IE (Dimension / TILE) times and second one that loops across the shared memory. Note the __syncthreads() ,which are extremely important and maintain synchronicity across the thread block.

``````/******************************************************************************
*@brief Optimized Matrix Matrix Multiplication by utilization of Shared Memory
*****************************************************************************/
__global__ void MM_MUl_Optimized(int* a, int* b, int* c, int N,int Phases) {

// -----------------------------
// ------- Grid Indexing -------
unsigned int tix = tx + blockDim.x * blockIdx.x;
unsigned int tiy = ty + blockDim.y * blockIdx.y;

// ------------------------------------------------
// ------ Statically Declared Shared Memory -------
__shared__ float matrix_a[TILE][TILE];
__shared__ float matrix_b[TILE][TILE];

// ---------------------------------------
// ------- Check Processing Offset -------
if ((tix < N) & (tiy < N)) {
unsigned int Lindex = tiy * N + tix;
int sum = 0;

// --------------------------
// ------- Phase Loop -------
for (int i = 0; i < Phases; i++) {

// --------------------------------------------------
// ------- Load Phase Data into Shared Memory -------
matrix_a[ty][tx] = a[tiy * N             + i * TILE + tx];
matrix_b[ty][tx] = b[(i * TILE + ty) * N + tix];

// -----------------------------------------
// ------- Loop Across Shared Memory -------
for (unsigned int j = 0; j < TILE; j++) {
int Temp1 = matrix_a[ty][j]; // Row
int Temp2 = matrix_b[j][tx]; // Col
sum +=  Temp1 * Temp2;
}