In Chapter 2, Heterogeneous Data Parallel Computing, we learned to write a simple CUDA C 11 program that launches a one-dimensional grid of threads by calling a kernel function to operate on elements of one-dimensional arrays. A kernel specifies the statements that are executed by each individual thread in the grid. In this chapter, we will look more generally at how threads are organized and learn how threads and blocks can be used to process multidimensional arrays. Multiple examples will be used throughout the chapter, including converting a colored image to a grayscale image, blurring an image, and matrix multiplication. These examples also serve to familiarize the reader with reasoning about data parallelism before we proceed to discuss the GPU architecture, memory organization, and performance optimizations in the upcoming chapters

3.1 Multidimensional grid organization

In general a grid is a 3d array of blocks, which are 3d arrays of threads. When we call a kernel we need to specify the size of the grid and blocks in each dimension, This is done in the config params<<< >>>.

The first param specifies the dimensions of the grid in terms of number of blocks, and the second specifies the dimensions of each block in no. threads.

Each param has type dim3, which is an int vector type of three elements, x, y and z.

The programmer can use fewer than 3d by setting the size of the unused dim to 1.

E.g: The following host code can be used to call the vecAddkernel() kernel function and generate a 1D grid that consists of 32 blocks, each of which consists of 128 threads. The total number of threads in the grid is 128*32 = 5 4096:

dim3 dimGrid(32,1,1);
dim3 dimBlock(128,1,1);
vecAddKernel<<<dimGrid, dimBlock>>>();

The variable names don’t particularly matter and we can use other things to calculate our sizes such as:

dim3 dimGrid(ceil(n/256.0), 1, 1);
dim dimBlock(256, 1, 1);
vecAddKernel<<<dimGrid, dimBlock>>>();

3.2 Mapping threads to multidimensional data

3.3 Image blur: a more complex kernel

3.4 Matrix multiplication

3.5 Summary

Exercises