Over the last years, there has been a fundamental change in the way manufacturers of general-purpose processors have been improving the performance of their products. Physical and technical limitations no longer allow manufacturers to increase the clock speeds of their chips like they did over the last decades. Performance improvements will have to come mainly from the higher transistor count that smaller chip features are bringing. Since developments in Instruction-Level Parallelism (ILP) are lagging, more parallelism is the only way to go.
Intel believes in many-core processors, supporting tens or hundreds of threads. After testing the water with hyper-threading and dual-core technologies, CPU manufacturers now have irrefutably entered the multi-core era. On the long term, general-purpose processors will consist of tens, hundreds, or even thousands of cores.
nVidia says it is already there, with their graphics processors containing hundreds of cores and supporting thousands of mini-threads. GPUs, currently being seperate chips on motherboards or specialized graphics cards, are increasingly being utilized by application programmers. For specific problems they have found mappings onto these graphics engines that result in speedups by two orders of magnitude. Manufacturers of graphics processors have recognized this opportunity and are increasingly making their products accessible to others than graphics programmers.
From a programmer's perspective, CPUs offer a multi-threaded model allowing a lot of control flow instructions, while GPUs offer a rigid stream processing model putting a large performance penalty on control flow changes. For the first, complexity is in the application logic. Currently, new programming languages and extensions to current languages are developed, supporting both explicit and implicit parallelism. Stream processing only works for problems that contain massive parallelism with limited communications between elements.
AMD's Fusion strategy brings x86 cores and GPU together onto a single die, possibly extending it with other specialized processing engines. Meanwhile, completely new architectures and topologies are being researched by Intel (Larrabee processor), IBM (Cell Broadband Engine), and Sun (UltraSPARC T series), all searching for the next hardware/software paradigm.
In HPC computing, performance-per-Watt has become the most important design parameter. Current commodity GPUs provide a cheap computing resource where the least possible number of transistors are dissipating power without contributing to the actual computation. We are waiting for graphics processors to become a standard part of the product portfolio of manufacturers of high-end computer systems. Then, standard building blocks can be bought, together with support, training and other professional services.
Although these developments in hardware bring along huge advantages for every research field using High-Performance Computing (HPC) in general, it is of particular interest for research in Artifical Intelligence (AI). The speedup of one or two orders of magnitude that is generally reported for all research fields when using GPUs, is also representative for neural networks, natively using massive parallel processing.
For algorithms in AI, more and more mimicing their biological originals, are massively parallel by nature. This goes for all types of neural networks we simulate on computers, but also for visual systems and all sorts of object recognition, feature extraction and other processing that takes place on visual data.
Especially the latter promises to take big advantage of developments in graphics processors. Some researchers in this area report speedups up to three orders of magnitude. In HPC terms this relates to the next step when DARPA (Defense Advanced Research Projects Agency) asks companies like IBM, Cray and SGI for the development of a long-term vision and fundamental research into the next era of HPC.
Another field that can be expected to profit from these developments are robotics. The ability to operate autonomously and independently requires intelligence, compactness and mobility. This relates directly to higher densities (both on silicon and system level), higher performance, and lower power consumption, all driving current developments in hardware.
Even deploying relatively small computer systems, several researchers in this area report now to be able to run applications in real-time or to provide interactive visualization where this could not be done before, presenting not only a quantitative but also a qualitative breakthrough.
In combination with the continuing pressure on power dissipation and density, GPGPU provides tremendous opportunities for robotics, and for related areas like the development of intelligent portable devices or prostheses.
However, at this moment, GPGPU is not yet a mature technology. Over the next years, graphics processors will become better suited to support generic stream processing applications. Work needs to be done in generic memory access and double-precision floating-point operations.
Furthermore, until recently, only proprietary programming toolkits belonging to a specific GPU were available. nVidia's CUDA toolkit has become the de facto standard, but it is not portable. Today, all important players in this market, i.e. AMD, IBM, Intel, and Nvidia, are supporting Apple's OpenCL programming language. However, performance is not yet as good as CUDA's. Furthermore, source code still contains topology-specific programming, inhibiting portability of applications over various hardware platforms.
Despite these limitations, in the near future, OpenCL will be the standard language for GPGPU (and possibly many-core) computing. And even when applications will not be portable, programmers will have a single language and development platform to work with.