CUDA Programming

[CUDA Programming Information and Resources]

 What is CUDA:

    Technology trends and advances in video games and graphics techniques have led to a need for extremely powerful dedicated computational hardware to perform the necessary calculations. Graphics hardware companies such as AMD/ATI and NVIDIA have developed graphics processors capable of massively parallel processing, with large throughput and memory bandwidth typically necessary for displaying high resolution graphics. However, these hardware devices have the potential to be re-purposed and used for other non-graphics-related work. NVIDIA provides a programming interface known as CUDA (Compute Unified Device Architecture) which allows direct programming of the NVIDIA hardware. Using NVIDIA devices to execute massively parallel algorithms will yield a many times speedup over sequential implementations on conventional CPUs.

CUDA Architecture:

Thread Organization

    In the CUDA processing paradigm (as well as other paradigms similar to stream processing) there is a notion of a ‘kernel’. A kernel is essentially a mini-program or subroutine. Kernels are the parallel programs to be run on the device (the NVIDIA graphics card inside the host system). A number of primitive ‘threads’ will simultaneously execute a kernel program. Batches of these primitive threads are organized into ‘thread blocks’. A thread block contains a specific number of primitive threads, chosen based on the amount of available shared memory, as well as the memory access latency hiding characteristics desired. The number of threads in a thread block is also limited by the architecture to a total of 512 threads per block. Each thread within a thread block can communicate efficiently using the shared memory scoped to each thread block. Using this shared memory, all threads can also sync within a thread block. Every thread within a thread block has its own thread ID. Thread blocks are conceptually organized into 1D, 2D or 3D arrays of threads for convenience.

    A ‘grid’ is a collection of thread blocks of the same thread dimensionality which all execute the same kernel. Grids are useful for computing a large number of threads in parallel since thread blocks are physically limited to only 512 threads per block. However, thread blocks within a grid may not communicate via shared memory, and consequently may not synchronize with one another.

    The above diagram demonstrates the thread hierarchy described. Here, a given kernel contains a 3x2 grid of thread blocks. Each thread block is a 4x3 block of threads, yielding a total of 72 threads executing said kernel.

Memory Hierarchy

    There are several levels of memory on the GPU device, each with distinct read and write characteristics. Every primitive thread has access to private ‘local’ memory as well as registers. This ‘local’ memory is really a misnomer; the memory is private to the thread, but is not stored local to the thread’s registers but rather off-chip in the global GDDR memory available on the graphics card. Every thread in a thread block also has access to a unified ‘shared memory’, shared among all threads for the life of that thread block. Finally, all threads have read/write access to ‘global memory’, which is located off-chip on the main GDDR memory module which therefore has the largest capacity but is the most costly to interact with. There also exists a read-only ‘constant’ and ‘texture’ memory, in the same location as the global memory.

    The global, constant and texture memory are optimized for different memory usage models. Global memory is not cached, though memory transactions may be ‘coalesced’ to hide the high memory access latency. These coalescence rules and behaviors are dependent on the particular device used. The read-only constant memory resides in the same location as global memory, but this memory may be cached. On a cache hit, regardless of the number of threads reading, the access time is that of a register access for each address being read. The read-only texture memory also resides in the same location as global memory, and is also cached. Texture memory differs from constant memory in that its caching policy specifically exploits 2D spatial locality. This is due to the use of ‘textures’ in 3D graphics; the use of 2D images to ‘texture’ the surface of 3D polygons are frequently read and benefit from caching the texture spatially.

    The above diagram shows the scope of each of the memory segments in the CUDA memory hierarchy. Registers and local memory are unique to a thread, shared memory is unique to a block, and global, constant, and texture memories exist across all blocks.


    CUDA capable GPUs are constructed with the “Tesla” architecture. CUDA applications may be run on any card which supports this architecture, but each GPU device may have different specifications, and therefore a slightly different set of supported features and a different number of available computational resources. When a kernel is invoked, each thread block executes on a ‘multiprocessor’. This multiprocessor contains the resources to support a certain number of threads. Specifically, each multiprocessor consists of:

    One or more thread blocks are assigned to a multiprocessor during the execution of a kernel. The CUDA runtime handles the dynamic scheduling of thread blocks on a group of multiprocessors. The scheduler will only assign a thread block to a multiprocessor when enough resources are available to support the thread block. Each block is split into SIMD (Single-Instruction Multiple-Data) groups of threads called ‘warps’. The SIMD unit creates, manages, schedules and executes 32 threads simultaneously to create a warp. Every warp is synchronous, and therefore care must be taken to ensure that certain threads within a warp do not take abnormally longer compared to other threads in that same warp, because the warp will only execute as fast as the slowest thread. There are a number of programming hints provided in the CUDA programming guide to help prevent warp divergence.

Compute Model

    Every CUDA-enabled device has a compute compatibility number. This number indicates a standard number of registers, memory size, etc. for all devices of that compatibility number. Compute compatibility numbers are backwards compatible.


Num Multiprocessors

Compute Compatibility

Tesla C870



GeForce 9800GT



GeForce GTX260



    The most recent compute model for the GTX200 has a number of significant improvements over previous compute models, including:


CUDA Programming Model:

API and System Variables

    To manage the thread and memory models described above, a set of API commands and system variables are provided by Nvidia. These include the Runtime API, used for managing host/device interfacing, the thread hierarchy identification variables, and several miscellaneous identifiers. The highest level of control is provided by the function type identifiers. Three such identifiers are provided: __host__, __global__, and __device__.

    The __host__ identifier denotes a function to be run on the general purpose CPU, or host. This is the default type, and therefore is not typically used explicitly. Host functions can do anything a normal C++ function can do, but they also call the Runtime API functions. These functions are primarily concerned with memory management and include cudaMalloc, cudaFree, cudaMemcpy, and many deviations to work with texture memory and to provide tailored functionality. To use these functions, standard pointers are used where the pointer value is an address on the device memory rather than the host system memory.

    Another Runtime API function of particular importance is kernel invocation. The syntax KernelFunctionName <<< GridDim, ThreadBlockDim >>> (@params) is used to specify the dimensions of the thread block and grid of thread blocks as described above. The kernel function is then run on the device and any memory allocated using cudaMalloc can be communicated by passing the device pointer as a parameter.

    Kernel functions use the __global__ identifier. This denotes which functions may be called from the host on the device. The final identifier, __device__, denotes functions that run on the device, but may not be called from the host. The most typical use of the __host__ identifier is when a particular function is needed by both the host and device and both the __device__ and __host__ identifiers are used to instruct the compiler to construct both versions.

    Within global and device functions, several system variables are used to manage threads and provide unique identification. threadIdx is a dim3 type with the unique identification of a thread within a thread block. Likewise, blockIdx provides unique identification of a thread block within a grid. With these identifiers and the __syncthreads() primitive that syncs all the threads in a thread block, execution can be effectively managed.

Shared Memory

    The final important identifier is the __shared__ designator that can be applied to variables declared in device functions. This denotes that the variable should be stored in the shared memory space on the multiprocessor, which is much faster to access than any of the memories located off-chip. While using shared memory is therefore often a good idea, there are several pitfalls which can adversely affect performance when shared memory is not used effectively.

    The most direct way that shared memory affects performance is that the number of concurrent thread blocks is limited by the available shared memory on each multiprocessor. The 16 kb of shared memory is split among each thread block, so if the shared memory used by each thread block is 4 kb, 4 thread blocks may run at a time. If that increases to 8 kb, only 2 thread blocks may run. This can be a mute point if the other limiting factors of threads/thread block and registers/thread already limit the number of concurrent blocks however.

    More obscurely, memory access patterns can have a large impact on performance. To ensure fast execution, the shared memory is organized into 16 memory banks that can be accessed in parallel. Since each concurrently executing group of threads, or warp, contains 32 threads, these banks are accessed in two phases, each time by a half-warp of 16 threads. When each of these threads accesses a separate bank or all threads access the same element, the access is as fast as using registers, however when multiple threads attempt to access different values in the same bank, conflicts occur. The result of conflicts is that accesses must be serialized. Therefore if each thread accesses every fourth 32-bit value, all accesses will occur on 4 of the 16 banks, and 4 levels of serialization will be required, effectively decreasing performance by a factor of 4. This can be avoided by ensuring the step size between memory accesses by threads within a half-warp does not divide evenly into 16. This can be accomplished with any odd step size.