Programming GPUs for general computing
As the programmability and performance of modern GPUs increase, many application developers are looking to graphics hardware to solve computationally intensive problems previously performed on general-purpose CPUs. Despite the promise of general-purpose GPU computing, traditional graphics API still abstract the GPU as a rendering device, involving textures, triangles and pixels. Mapping an algorithm to use these primitives is not a simple operation, even for the most advanced graphics developers.
Fortunately, GPU-based computing is conceptually straightforward, and a variety of high-level languages and software tools can simplify GPU programming. First, however, the developer must understand how a GPU is used in rendering and then identify the components that can be used for computation.
When rendering a frame, the GPU receives geometry data from the host system in the form of triangle vertices. These vertices are processed by a programmable vertex processor that performs any per-triangle computation, such as geometric transformations or lighting calculations. Next, the triangles are converted by a fixed-function rasterizer unit into individual "fragments" to be drawn on the screen. Before they are written to the screen, each fragment goes to a programmable fragment processor that computes the final color value.
The calculations to compute the fragment color typically involve a collection of vector math operations combined with memory fetches from "textures," a type of image that stores the surface material color. The final rendered scene is displayed on the output device or can be copied back to the host processor from the GPU's memory.
Both the programmable vertex and fragment processors offer many of the same capabilities and instruction sets. However, most GPU programmers use the fragment processor only for general-purpose computing, since it generally provides better performance and outputs directly to memory.
A simple example of computing with the fragment processor is adding two vectors. First, we issue a large triangle with the same number of fragments as the vector size. The generated fragments are processed by the fragment processor, which implicitly executes our code in an SIMD parallel fashion. Our vector-add code fetches two elements to add from memory and, based on the fragment's position, adds the values, assigning the output color to the result. The output memory contains the vector sum, which we are free to use in the next computation.
The programmable fragment processor's ISA is similar to DSP or Pentium SSE instruction sets and consists of four-way SIMD instructions and registers. Instructions include standard math operations, memory fetch instructions and a few special-purpose graphics instructions.
GPU vs. DSP
GPUs differ in some major ways from DSP architectures. All computation is performed with floating-point arithmetic; currently, there are no bit or integer math instructions. Also, since the GPU is designed to work with images, the memory system is in effect a 2D segmented memory space—a segment number (i.e. an image to read from) and 2D address (the x,y position in the image).
Moreover, there are no indirect write instructions. The output write address is fixed by the rasterizer and cannot be changed by our program. This can be particularly challenging for algorithms that naturally scatter into memory. Finally, no communication is allowed between the processing of the different fragments. In effect, the fragment processor is an SIMD data-parallel execution unit, independently executing our code on all fragments.
Despite these constraints, a variety of algorithms can be efficiently implemented on the GPU, ranging from linear algebra and signal processing to numerical simulation. Though conceptually simple, computing with GPUs can be frustrating for first-time users because of the need for graphics-specific knowledge. But some software tools can help. Two high-level shading languages, CG and HLSL, let users write C-like code that compiles to fragment program assembly. Compilers for those languages are freely available from the Nvidia and Microsoft Websites. Though the languages greatly simplify the writing of shader assembly code, applications still must use the graphics API to set up and issue computation.
Brook is a high-level language explicitly designed for GPU computing without requiring graphics knowledge. Thus, it is a good starting point for first-time GPU developers. Brook is an extension of C. It incorporates simple data-parallel-programming constructs that map directly to the GPU.
Data stored and operated on by the GPU is expressed as "streams," which are similar to standard C arrays. "Kernels" are functions that operate over the streams. Calling a kernel function on a set of input streams performs an implicit loop over the stream elements, invoking the body of the kernel for each element. Brook also provides a mechanism for reductions such as computing the sum, max or product of all of the elements in a stream.
The Brook compiler is a source-to-source compiler that maps a user's kernel code to the fragment assembly language and generates C++ stub code, which can be linked to a larger application. This permits users to port only the performance-critical portions of their applications to Brook. Brook also completely hides all aspects of the graphics API and virtualizes many of the more unfamiliar aspects of the GPU, such as the 2D-memory system.
Using ATI's X800XT and the Nvidia GeForce 6800 Ultra GPUs, we have seen many of these applications achieve up to 7x speedup over their equivalent cached-blocked assembly SSE-optimized Pentium 4 implementations.
Users interested in computing with the GPU struggled to map algorithms to graphics primitives. The advent of high-level programming languages makes it easier for even novice programmers to capture GPU performance benefits. By providing easy access to the GPU's computational power, the GPU will keep evolving not just as a rendering engine, but as the PC's principal compute engine.
Researcher, Graphics Laboratory
|Related Articles||Editor's Choice|
|Related Articles||Editor's Choice|