CUDA Programming

[CUDA Programming Information and Resources]

CUDA Tutorial 1: A simple CUDA app

Download:  RIT_CUDA_Tutorial_1.zip

A simple CUDA program:

    This tutorial will walk through a simple CUDA application. To make it as simple as possible, we have chosen to implement element-wise multiplication of two arrays of the same size. Make sure to download the related Visual Studio project to follow along with this document.

    First you may notice that there is no .cpp file in this project, and only a .cu file. CUDA programs are typically written in plain C, but it is indeed possible to interface C++ with CUDA. Any code that runs on a GPU (commonly referred to as a device) is always part of a kernel. A kernel is essentially a function written in C, which also contains some CUDA-specific instructions. It is expected that future versions of CUDA will provide more extensive support for other languages such as C++ and Fortran. Note that kernels must be contained in a .cu file, so to keep things simple we will just have .cu file in this example.

    Starting at the top of the MultiplyArrays.cu file we find a commonly included file, cutil.h. In this program, it provides us with CUDA timing functions (for benchmarking purposes). Following the #includes, there are a few #defines. The first defines the number of threads per block. CUDA introduces the concept of grids of blocks, blocks of threads, and plain threads. In CUDA, many threads within a block execute concurrently. There exist multiple blocks that each have the same number of threads and run the same kernel. In addition, CUDA capable GPUs have several multiprocessors. Each multiprocessor can only run a certain number of threads, so these threads are segmented into blocks (a max of 512 threads per block), where one or more blocks may be run concurrently on a multiprocessor. Blocks are discarded once all its threads are processed. In this particular example, we have chosen 256 threads per block and 262144 data elements to be worked on, which results in 1024 blocks. Note that this is a contrived example, and frequently there will be a variable number of data elements to be worked on. The number of blocks to be used is almost always calculated based on the amount of data to be computed and the number of threads per block. Blocks executing the same kernel are segmented into grids. Each grid contains all of the blocks (of identical configuration, which we will detail in a moment) operating on a particular kernel. This application only has one grid. These threads, blocks, grids and kernels make up the fundamental structure of CUDA. We will continually revisit these terms and elaborate on them as we move forward.

    Skipping to the main() function, there are a number of pointers to the data arrays used in the computation. We use a standard malloc() call to allocate space on the host, and cudaMalloc() calls to allocate space on the device. This CUDA-specific memory allocation function allocates memory in global memory – which represents the DRAM on a GPU’s circuit board, and is also the largest memory space available. However, it also has the slowest access time. Secondly we have shared memory, which is resident of each block. This memory has very quick access time, but is only available in small maximum amounts of 16KB per block. There are other memory spaces, which will be discussed later since they are out of the scope of this particular example.

    After initializing memory on the host, we initialize a timer to determine the GPU computation time.  We then call cudaMemcpy() to copy data from the host to the device’s respective arrays.  Note that we pass a pointer to a pointer (a void**) to cudaMalloc(). This is because cudaMalloc () will set the pointer to a value pointing into the appropriate location in global device memory. Now that the device has the necessary data, we can invoke the kernel.

    The kernel is called by passing two parameter sets. The first set contains CUDA-specific values, and has the form <<<Dg, Db, Ns>>>.  Dg represents grid dimensions, Db represents block dimensions, and Ns represents shared memory size. Note that Ns is an optional parameter. It allows us to declare the amount of shared memory available to the kernel outside of the kernel’s definition, allowing us to use the same kernel for multiple different block sizes. This however is an advanced topic, and this simple application only requires blocks of the same size. The second set of parameters is entirely up to the programmer, and is composed of whatever is to be passed to the kernel – much like a regular C function. In this case, we pass pointers to the device’s output and input arrays.

    Let’s continue with the main function, and then we will cover what happens in the kernel.

    After the call to the kernel, we call cudaThreadSynchronize(), which just prevents the main function from continuing until the device finish executing its kernel. Once the kernel is finished, we cudaMemcpy() the result back to the host, stop the timer and verify the result. We then clean up after ourselves by calling cudaFree() to free up the allocated memory on the device. For the purposes of this tutorial, we also compute and time the same array multiplication on the CPU for comparing the execution time of the CPU and the GPU. Finally, we print the results and finish by freeing memory on the host.

    Kernels are easy to spot in code since their function signature is always preceded by a __global__ qualifier. Within any kernel, it is almost always desired to obtain the current thread ID and use it to map such thread to a portion of the overall computation. Thread IDs are obtained using special CUDA-specific variables blockDim, blockIdx, and threadIdx. These variables have x,y,z components to them, which can be useful when the problem is of multiple dimensions. However, this example only uses the x component. Blocks can be declared with the same number of threads, but different dimensions for x, y and z. Only blocks that have the same exact configuration (x y and z dimensions) are grouped together in the same grid.

    Our particular kernel is named multiplyArrays(), and takes in three integer pointers. The parameters d_in_1 and d_in_2 represent the multiplicand and multiplier arrays, and d_out represents the product array. The first thing we do is find the current thread ID and store it in thread_id. The easiest way for our multiplication to be done is to simply use thread_id as a global memory input and output index. In this manner, each active thread is assigned a pair of elements to multiply. However, this is not an optimum solution and is therefore commented.

    The better way is to take advantage of shared memory, instead of just using global memory. Think of it as performing operations using a CPU’s cache, rather than main memory. We begin by declaring an array of integers, named s_data[], with a __shared__ qualifier. This special qualifier ensures that s_data[] resides within shared memory, which is where we want to do our multiplication. Note that we do not need to specify the size of s_data, since it is determined under the hood if the kernel is called with a value for Ns. If we had not specified a value for Ns, we would have to hardcode a size for s_data within the kernel itself. But back to our problem… we now wish to copy data from the incoming arrays into shared memory, but must pay special attention to the indexing. Since we know that each block has its own chunk of shared memory and contains its own collection of threads, we must map portions of the input arrays to the available space in the block’s shared memory. This is done by indexing into s_data using threadIdx.x (whose values range from 0 to 511), and indexing into the input arrays using thread_id (whose values range from 0 to 262143). Using this scheme, we can fill s_data with data from d_in_1, and multiply it by what’s in d_in_2. Note the special __syncthreads() call, which is used to ensure thread synchronization. Lastly, the results from shared memory are copied to d_out, where they can then be transferred to the host via a cudaMemcpy() call outside of the kernel. This paragraph contains a lot of meat, so it might be worthwhile to read it over and over again!

Compilation:

    Before compiling and running, make sure the appropriate CUTIL DLL files are in your path. You can select the platform (x86 or x64) and the build configuration (Debug, Release, etc) in the selection boxes at the top of Visual Studio. Thorough instructions on how to do this are in the Setup section of the site. When ready, press Shift+F5 to save, compile and run the program, or just F5 to debug. When debugging, the console window will not remain visible after program termination. To have it remain visible, select Debug --> Start Without Debugging.

Oh no! Where is the speedup?

    Why is the performance of the GPU worse than the CPU? Let’s take a look at a couple graphs of CPU versus GPU execution time with varying data sizes.

What happened?

    It seems there was a substantial slow down seen by the GPU. The CPU processed the arrays significantly faster than the GPU was able to process the data and return the results. It turns out that simply parallel data processing does not always imply a large speedup. The combination of the amount of work to be done with temporal locality (reuse of the data elements) is, in this case, not great enough to overcome the overhead of transferring data between the host and device. Lesson learned: memory transfer between the device and the host takes a significant amount of time. Therefore, make sure to maximize the amount of work being done on the device, and minimize the frequency of data transfer between the host and device.  It is important to understand the relationship between your code and the hardware. Profiling code is vital to achieve optimal performance, so make sure to experiment!

    In the next tutorial we’ll take a look at something with a bit more computational complexity and far more spatial locality to make the work worth the GPU’s time.