What is CUDA:
Technology trends and advances in video games and graphics techniques
have led to a need for extremely powerful dedicated computational
hardware to perform the necessary calculations. Graphics hardware
companies such as AMD/ATI and NVIDIA have developed graphics processors
capable of massively parallel processing, with large throughput and
memory bandwidth typically necessary for displaying high resolution
graphics. However, these hardware devices have the potential to be
re-purposed and used for other non-graphics-related work. NVIDIA
provides a programming interface known as CUDA (Compute Unified Device
Architecture) which allows direct programming of the NVIDIA hardware.
Using NVIDIA devices to execute massively parallel algorithms will yield
a many times speedup over sequential implementations on conventional
CPUs.
CUDA Architecture:
Thread Organization
In the CUDA processing paradigm (as well as other paradigms similar to
stream processing) there is a notion of a ‘kernel’. A kernel is
essentially a mini-program or subroutine. Kernels are the parallel
programs to be run on the device (the NVIDIA graphics card inside the
host system). A number of primitive ‘threads’ will simultaneously
execute a kernel program. Batches of these primitive threads are
organized into ‘thread blocks’. A thread block contains a specific
number of primitive threads, chosen based on the amount of available
shared memory, as well as the memory access latency hiding
characteristics desired. The number of threads in a thread block is also
limited by the architecture to a total of 512 threads per block. Each
thread within a thread block can communicate efficiently using the
shared memory scoped to each thread block. Using this shared memory, all
threads can also sync within a thread block. Every thread within a
thread block has its own thread ID. Thread blocks are conceptually
organized into 1D, 2D or 3D arrays of threads for convenience.
A ‘grid’ is a collection of thread blocks of the same thread
dimensionality which all execute the same kernel. Grids are useful for
computing a large number of threads in parallel since thread blocks are
physically limited to only 512 threads per block. However, thread blocks
within a grid may not communicate via shared memory, and consequently
may not synchronize with one another.
The above diagram demonstrates the thread hierarchy described. Here, a
given kernel contains a 3x2 grid of thread blocks. Each thread block is
a 4x3 block of threads, yielding a total of 72 threads executing said
kernel.
Memory Hierarchy
There are several levels of memory on the GPU device, each with distinct
read and write characteristics. Every primitive thread has access to
private ‘local’ memory as well as registers. This ‘local’ memory is
really a misnomer; the memory is private to the thread, but is not
stored local to the thread’s registers but rather off-chip in the global
GDDR memory available on the graphics card. Every thread in a thread
block also has access to a unified ‘shared memory’, shared among all
threads for the life of that thread block. Finally, all threads have
read/write access to ‘global memory’, which is located off-chip on the
main GDDR memory module which therefore has the largest capacity but is
the most costly to interact with. There also exists a read-only
‘constant’ and ‘texture’ memory, in the same location as the global
memory.
The global, constant and texture memory are optimized for different
memory usage models. Global memory is not cached, though memory
transactions may be ‘coalesced’ to hide the high memory access latency.
These coalescence rules and behaviors are dependent on the particular
device used. The read-only constant memory resides in the same location
as global memory, but this memory may be cached. On a cache hit,
regardless of the number of threads reading, the access time is that of
a register access for each address being read. The read-only texture
memory also resides in the same location as global memory, and is also
cached. Texture memory differs from constant memory in that its caching
policy specifically exploits 2D spatial locality. This is due to the use
of ‘textures’ in 3D graphics; the use of 2D images to ‘texture’ the
surface of 3D polygons are frequently read and benefit from caching the
texture spatially.
The above diagram shows the scope of each of the memory segments in the
CUDA memory hierarchy. Registers and local memory are unique to a
thread, shared memory is unique to a block, and global, constant, and
texture memories exist across all blocks.
Multiprocessors
CUDA capable GPUs are constructed with the “Tesla” architecture. CUDA
applications may be run on any card which supports this architecture,
but each GPU device may have different specifications, and therefore a
slightly different set of supported features and a different number of
available computational resources. When a kernel is invoked, each thread
block executes on a ‘multiprocessor’. This multiprocessor contains the
resources to support a certain number of threads. Specifically, each
multiprocessor consists of:
-
8 Scalar Processor cores
-
2 special function units for transcendentals
-
1 multithreaded instruction unit
-
On-chip shared memory
One or more thread blocks are assigned to a multiprocessor during the
execution of a kernel. The CUDA runtime handles the dynamic scheduling
of thread blocks on a group of multiprocessors. The scheduler will only
assign a thread block to a multiprocessor when enough resources are
available to support the thread block. Each block is split into SIMD
(Single-Instruction Multiple-Data) groups of threads called ‘warps’. The
SIMD unit creates, manages, schedules and executes 32 threads
simultaneously to create a warp. Every warp is synchronous, and
therefore care must be taken to ensure that certain threads within a
warp do not take abnormally longer compared to other threads in that
same warp, because the warp will only execute as fast as the slowest
thread. There are a number of programming hints provided in the CUDA
programming guide to help prevent warp divergence.
Compute Model
Every CUDA-enabled device has a compute compatibility number. This
number indicates a standard number of registers, memory size, etc. for
all devices of that compatibility number. Compute compatibility numbers
are backwards compatible.
|
Num Multiprocessors |
Compute Compatibility |
Tesla C870 |
16 |
1.0 |
GeForce 9800GT |
14 |
1.1 |
GeForce GTX260 |
24 |
1.3 |
The most recent compute model for the GTX200 has a number of significant
improvements over previous compute models, including:
-
Double precision support
-
Higher memory bandwidth
-
Doubled the number of available registers
CUDA Programming Model:
API and System Variables
To manage the thread and memory models described above, a set of API
commands and system variables are provided by Nvidia. These include the
Runtime API, used for managing host/device interfacing, the thread
hierarchy identification variables, and several miscellaneous
identifiers. The highest level of control is provided by the function
type identifiers. Three such identifiers are provided: __host__,
__global__, and __device__.
The __host__ identifier denotes a function to be run on the general
purpose CPU, or host. This is the default type, and therefore is not
typically used explicitly. Host functions can do anything a normal C++
function can do, but they also call the Runtime API functions. These
functions are primarily concerned with memory management and include
cudaMalloc, cudaFree, cudaMemcpy, and many deviations to work with
texture memory and to provide tailored functionality. To use these
functions, standard pointers are used where the pointer value is an
address on the device memory rather than the host system memory.
Another Runtime API function of particular importance is kernel
invocation. The syntax KernelFunctionName <<< GridDim, ThreadBlockDim
>>> (@params) is used to specify the dimensions of the thread block and
grid of thread blocks as described above. The kernel function is then
run on the device and any memory allocated using cudaMalloc can be
communicated by passing the device pointer as a parameter.
Kernel functions use the __global__ identifier. This denotes which
functions may be called from the host on the device. The final
identifier, __device__, denotes functions that run on the device, but
may not be called from the host. The most typical use of the __host__
identifier is when a particular function is needed by both the host and
device and both the __device__ and __host__ identifiers are used to
instruct the compiler to construct both versions.
Within global and device functions, several system variables are used to
manage threads and provide unique identification. threadIdx is a dim3
type with the unique identification of a thread within a thread block.
Likewise, blockIdx provides unique identification of a thread block
within a grid. With these identifiers and the __syncthreads() primitive
that syncs all the threads in a thread block, execution can be
effectively managed.
Shared Memory
The final important identifier is the __shared__ designator that can be
applied to variables declared in device functions. This denotes that the
variable should be stored in the shared memory space on the
multiprocessor, which is much faster to access than any of the memories
located off-chip. While using shared memory is therefore often a good
idea, there are several pitfalls which can adversely affect performance
when shared memory is not used effectively.
The most direct way that shared memory affects performance is that the
number of concurrent thread blocks is limited by the available shared
memory on each multiprocessor. The 16 kb of shared memory is split among
each thread block, so if the shared memory used by each thread block is
4 kb, 4 thread blocks may run at a time. If that increases to 8 kb, only
2 thread blocks may run. This can be a mute point if the other limiting
factors of threads/thread block and registers/thread already limit the
number of concurrent blocks however.
More obscurely, memory access patterns can have a large impact on
performance. To ensure fast execution, the shared memory is organized
into 16 memory banks that can be accessed in parallel. Since each
concurrently executing group of threads, or warp, contains 32 threads,
these banks are accessed in two phases, each time by a half-warp of 16
threads. When each of these threads accesses a separate bank or all
threads access the same element, the access is as fast as using
registers, however when multiple threads attempt to access different
values in the same bank, conflicts occur. The result of conflicts is
that accesses must be serialized. Therefore if each thread accesses
every fourth 32-bit value, all accesses will occur on 4 of the 16 banks,
and 4 levels of serialization will be required, effectively decreasing
performance by a factor of 4. This can be avoided by ensuring the step
size between memory accesses by threads within a half-warp does not
divide evenly into 16. This can be accomplished with any odd step size.