Many people are wondering how is it possible that Graphics Processing Units (GPUs) have so much computing power. The answer lies in the history of GPUs. They have always been used as accelerators for graphics and rendering, especially video games. In fact GPUs thanks for their power to all the gamers around the world ?. Seriously, most games need to compute at least 25 frames per second a new scene. Lets say the resolution is FullHD (1920 x 1080). That gives us 50 Million pixels per second to be computed. GPUs are unlike CPUs made to process data in parallel. It means that while CPUs will iterate one-after another across all 50M pixels,GPUs are capable to “Compute all Pixels at the same time” – With some limitations.

There are various architectures for the purpose of data processing acceleration:

  • SIMT – Single Instruction,Multiple Threads
  • SIMD Single Instruction, Multiple Data

CPUs tend to use on a per-core basis SIMD. The most common instructions are SSE/AVX,which basically process several data samples simultaneously and are commonly referenced as vectorized-processing. On the other hand, GPUs with their SIMT work different way – They utilize thousands of threads , each computing the same instruction. That’s why amount of CUDA cores linearly scales with the performance of the GPU. The only common problems of GPUs is the speed of the communication interface – PCIe, which turns out to be the bottleneck for some applications and even the reason, why computing small sets of data might not be efficient on GPUs. In fact the more workload you have for the GPU,the better actually.

Starting with NVIDIA CUDA is quite simple, one basically needs to install CUDA Toolkit and MS Visual Studio (Community Edition is Sufficient). Take care to select proper toolkit version – always check that the toolkit supports your Hardware, otherwise you may end up searching “why the hell it doesn’t work”. I know, that beginners like me (Back in time) will ignore this, but at least they will learn it the hard way.

The most important thing to understand in CUDA is the KERNEL GRID of threads. The kernel grid contains blocks in x,y and z directions (I do rarely ever use z dimension other than 1). Whats more, each block consists of threads that may also span across these 3 dimensions. That’s why most of the kernels starts with calculation of thread indexes (Unique index of each thread within the grid) as follows:

unsigned int tix = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int tiy = blockIdx.y * blockDim.y + threadIdx.y;

Some algorithms do however require linear indexing across the entire grid (IE. Not indexing as [X][Y],but only as [INDEX]). Therefore, one will have to compute a linear Index across the entire thread grid as: tiy + tix * blockDim.x * gridDim.x; Since it happens frequently, that there are more threads running, than data, most of the kernels have checking against the real processing dimensions – This will spare you from illegal memory accesses ?:

if((tix < Nx) && (tiy < Ny)){
  // Do computing

The entire code,that actually computes the vector addition is show below. Note that I am lazy as always. I did not properly calculated the grid size. I have defined “N” – Vector size to be 100 and then I just launched kernel with just one block. This block has dimensions 1024x1x1 and that’s why it runs as expected with this settings. It will however fail, if you increase N to 1025 ?. Also note, that I tend to use cudaHostAlloc function – an alternative to malloc or new[]. The reason is that if you allocate memory this way,the transfer of data will usually be faster as it allocates a non-page able memory.

 *@brief: Vector Addition kernel 
__global__ void Vector_Kernel(int* In_A, int*In_B, int* Output,int N) {
    unsigned int tix = threadIdx.x + blockIdx.x * blockDim.x;

    if (tix < N) {
        Output[tix] = In_A[tix] + In_B[tix];

int main(int argc, char** argv) {

    int N = 100;

    void* Dev_A = nullptr;
    void* Dev_B = nullptr;
    void* Dev_C = nullptr;

    void* Host_A = nullptr;
    void* Host_B = nullptr;
    void* Host_C = nullptr;

    // ------------------------------------
    // -------- Allocate GPU Memory -------
    cudaMalloc((void**)&Dev_A, N * sizeof(float));
    cudaMalloc((void**)&Dev_B, N * sizeof(float));
    cudaMalloc((void**)&Dev_C, N * sizeof(float));

    // ----------------------------------
    // ------- Allocate CPU Memory ------
    cudaHostAlloc((void**)&Host_A, N * sizeof(float), cudaHostAllocDefault);
    cudaHostAlloc((void**)&Host_B, N * sizeof(float), cudaHostAllocDefault);
    cudaHostAlloc((void**)&Host_C, N * sizeof(float), cudaHostAllocDefault);

    // ------------------------------
    // ------- Recast Pointers ------
    int* HA = static_cast<int*>(Host_A);
    int* HB = static_cast<int*>(Host_B);
    int* HC = static_cast<int*>(Host_C);

    // ----------------------------------
    // ------ Randomize Input Data ------
    for (int i = 0; i < N; i++) {
        HA[i] = static_cast<int>(i);
        HB[i] = static_cast<int>(i);

    // -----------------------------------
    // ------- Copy Data To Device -------
    cudaMemcpy(Dev_A, Host_A, N * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(Dev_B, Host_B, N * sizeof(float), cudaMemcpyHostToDevice);

    // -----------------------------
    // ------- Launch Kernel -------
    dim3 Threads = dim3(1024, 1, 1);
    dim3 Blocks = dim3(1, 1, 1);
    int* DA = static_cast<int*>(Dev_A);
    int* DB = static_cast<int*>(Dev_B);
    int* DC = static_cast<int*>(Dev_C);
    Vector_Kernel << <Blocks, Threads >> > (DA,DB,DC,N);

    // -------------------------------------
    // ------ Copy Data back to Host -------
    cudaMemcpy(Host_C, Dev_C, N * sizeof(float), cudaMemcpyDeviceToHost);

    // -----------------------------
    // ------ Verify Results -------
    for (int i = 0; i < N; i++) {
        if (HC[i] != (2 * i)) {
            std::cout << "Error Invalid Result! [ " << HC[i] << " | " << (2*i) << " ]" << std::endl;

    // ------------------------------------
    // ------- Free Alocated Memory -------

    std::cout << "Press Enter to Exit" << std::endl;
    std::cin.ignore(std::numeric_limits<std::streamsize>::max(), '\n');

    return 0;

Finally, for starters, I do always recommend to have a good book around. Personally,I highly doubt that there is a better book than CUDA By Example. I have already read various books on parallel computing and CUDA / OpenCL programming, but I personally believe,that whoever reads that book is at least from 65% a good CUDA programmer ? or if nothing else,than a very solid background to start programming in CUDA. Don’t be disappointed if your first application runs slower than CPU implementation! Happened to me several times,but keep going and learn from mistakes! 🙈