EXPERT VOICE
June 2010

Graphical accelerators: the future of Extreme Computing, by Jean-Pierre Panziera, Director of Extreme Computing Product Strategy, Bull

Tweet 

PanzieraA graduate of Ecole des Mines de Paris, Jean-Pierre Panziera began his career at Elf-Aquitaine (now Total) in the geophysics research department. After five years, he moved to Silicon Valley in California as a software developer. In 1990, he joined SGI where he worked in the Applications Group, in various roles: Manager, Principal Engineer and Chief Engineer. Jean-Pierre joined Bull in November 2009, to spearhead the Group’s HPC product strategy.

Even though it is currently enjoying significant expansion, High-Performance Computing (HPC) is still a niche market. Its evolution has been boosted by the adoption of new technologies, sometimes those developed for other areas of professional or even mass-market use. To achieve today’s heights of performance – the most powerful computer in the world at the end of 2009 was 18,000 times more powerful than its 1993 counterpart – engineers have demonstrated not only a great deal of skill and ingenuity, but also a good measure of opportunism.

That’s precisely what happened in the early 1990s, when supercomputers took a giant leap forward by exploiting the potential first of RISC processors (which had initially been designed for workstations), and then the x86 processors found in enterprise servers and even business PCs. Nowadays it is another product seemingly poles apart from the world of scientific and industrial research that its exciting the interest of HPC experts: GPUs, or Graphics Processing Units, initially designed and developed for computer games addicts!

Players of these games are extremely demanding, and they represent such a substantial market that manufacturers are prepared to invest a great deal to keep them happy. Just over a decade ago, processors dedicated to rendering 3D graphics started to appear on the market, offering extraordinary realistic results. Capable of dynamically converting the position of each object into pixels of different colors and textures, they offered much greater pure processing power than general purpose processors; and that’s what has attracted the attention of the HPC specialists.

From GPUs to GP-GPUs

Using GPUs for more traditional processing tasks, other than 3D rendering, has given birth to a whole new acronym, GP-GPUs (General Purpose – Graphics Processing Units). The first experiments involved some impressive feats of programming, because GPUs could only interpret graphical-type instructions written in DirectX or OpenGL. So the data had to be stored in the memory reserved for textures, and the processing tasks had to be broken down into a series of geometric transformations (rotations, translations, dilations…). With the growing demand for better rendering, it became necessary to deal with more sophisticated algorithms, for example to calculate the trajectory, intensity and color of beams of light coming towards the viewer, taking into account the physical laws of refraction and reflection. Rather than adding new instructions to the GPU, it seemed more relevant to standardize the operations performed by the GPU using a broader set of instructions, with the various graphics rendering options programmed by the graphics software. So GPUs became more general-purpose and could potentially be used for computer simulation.

The limitations of earlier generation GPUs

Despite this progress, and as well as being difficult to program, GPUs had other limitations that made them hard to use outside the field for which they were originally designed. Graphics processing operations are carried out at the level of 32-bit or single precision, whereas HPC more often requires double precision (64-bit). What’s more, GPUs do not conform to IEEE (754) standards for floating point calculations. If the resulting approximations are negligible when it comes to graphics rendering, this is far from the case for scientific computing applications. In addition, the smaller internal memory capacity of GPUs was not protected against errors by ECC-type mechanisms. Even if it completely changes the value of one pixel, a random error in a single bit can pass unnoticed; but a single erroneous bit can completely change the result of a computer simulation: an unacceptable risk for HPC applications. At the architecture level, the exchange of data between the main memory and the GPU using a synchronous mechanism means these transfers and processing tasks carried out by the GPU cannot be recovered. And the GPUs are connected to PCIe buses, which can result in bottlenecks. Finally, it’s worth mentioning that GPUs consume much more electrical power than CPUs (around twice as much), which naturally poses problems in terms of power supply, cooling and running costs.

Thankfully, nowadays most of these limitations have been overcome. The latest generations of GPUs have enhanced double-precision processing capacity and are fully compliant with IEEE standards. The GPU’s memory, various caches and registers are now detected by ECC mechanisms. And programming GPUs has become much easier, with the advent of perfectly acceptable development tools. For best results, developers can use NVIDIA’s CUDA environment or the new OpenCL standard. However, there are also now a number of high-level language compilers for C, C++ and Fortran, such as HMPP, developed by French start-up CAPS Entreprise, one of Bull’s partners. To optimize application performance, the programmer can insert instructions that are interpreted by the compiler, which avoids developments that are too specific and means that investments in programming can be re-used.

Impressive performance

So the main weaknesses of GPUs have been corrected, while preserving their remarkable performance. The table below compares the main features of the GPU, with the latest x86 equipping bullx blade systems

CPU

GPU

Frequency (GHz)

2.93

1.15

Number of processing cores

6

448

Double-precision performance (GFlops)

70

515

Single-precision performance (GFlops)

140

1030

Memory bandwidth (GB/s)

32

148

Memory capacity (GB)

48

3

Memory type

DDR3

GDDR5

Power consumption (W)

110 (*)

225

(*) Estimated consumption includes both the processor itself (95 W) and central memory.

As we can see, the peak performance of the GPUs is impressive; over seven times higher than the CPU ones. Memory bandwidth is also much better on GPUs (4.6 times higher). And power consumption is only just about doubled. In practice, however, GPU efficiency is often lower than for equivalent CPUs: for example, a multiplication matrix can achieve 95% efficiency on a CPU compared with 65% on a GPU. This reflects a fundamental difference in structure between CPUs and GPUs. The latter effectively use a very large number of processing units (448 compared with just six), and they only reach their full power thanks to a very high level of parallelism. This is why GPUs are used as accelerators, rather than as first-line processing nodes; since there is no question of completely redeveloping applications just in order to parallelize them, we rather assign a proportion of tasks to them that capitalize fully on their extraordinary capabilities.

bullx B505: a native hybrid supercomputer

Making the most of a technology which is now mature, and is therefore fully operational, Bull offers a unique Extreme Computing solution that natively combines GPUs and CPUs to exploit this potential for acceleration: the bullx B505 blades, derived from the bullx B500, which is itself based solely on CPUs. Figure 1 shows the comparative structures of the two models. Most of the rival platforms that incorporate hardware accelerators (whether they are GPUs or other kinds of processors) connect two GPUs and the interconnection network (InfiniBand) to the same PCIe bus. The bullx B505 hybrid blades feature a dedicated bus for each of the two GPUs and have two InfiniBand interfaces to provide greater bandwidth between the system nodes, twice as many as traditional configurations. The ‘blade’ format is also very compact, which results in a 30% higher density than other solutions available on the market.

A growing number of applications

bullx B505, like all graphics accelerator systems, performs best on applications that involve a substantial number of relatively simple tasks requiring a great deal of ‘brute force’. In particular:

Graphics rendering: paradoxically, the processing needed on the final production stages of animated films is usually carried out on CPU-based systems (such as the bullx supercomputer used for the movie Planet 51). But the newest GPU-based computers can now handle the complex algorithms involved.

Oil and gas exploration: seismic imaging techniques used to analyze the structure of sub-surface strata (using RTM-type methods) can benefit from the use of graphics accelerators. Bull has supplied Petrobras with a system including 264 GPUs. More broadly, seismic simulation and medical imaging are potentially major areas for the use of GPUs.

Molecular dynamics and astrophysics: whether in chemistry or pharmaceuticals, assessing the interactions between atoms requires a large number of calculations that can easily be speeded up using GPUs. The same kinds of equations, and therefore similar uses, are found in astrophysics. One of the ‘big computing challenges’ for ‘Bull-Titane’, the hybrid CPU-GPU supercomputer installed by Bull at CCRT (the Center for Research and Technology Computing) in France, involves simulating the propagation of radiation from stars.

Financial simulations
: all methods used for financial analysis and simulation (Monte-Carlo, Black–Scholes…) are particularly resource-hungry in terms of processing power.

Electromagnetics: simulating electromagnetic phenomena (radars, antennas…) rely on ‘dense’ calculations (many calculations per memory access), which is perfect for GPUs.

Genetics: gene-sequencing algorithms require a large number of like-for-like comparisons. Even though there are virtually no floating points, the intensity of the processing tasks involved mean they are well suited to GPUs, and speeds can be increased by up to 30 times for algorithms such as Smith-Waterman and Blast. This kind of usage is similar to searching for sequences or chains of characters in large-scale databases.

Molecular structures: computing applications aimed at modeling molecular structures using ab initio methods can be accelerated using GPUs, as demonstrated by another of the ‘big challenges’ being tackled by the Bull-Titane supercomputer.

Looking to the future

Despite recent improvements, GPUs are still strongly marked by the graphical origins. A more integrated server architecture should improve their efficiency even further, especially at the level of data transfers. GPU manufacturers have already publicly announced their intention to move in that direction. ATI/AMD has promised to achieve the ‘fusion’ between CPUs and GPUs. Bill Dally, CTO of NVIDIA, has talked about the end of the general-purpose Xeon®-type CPU as we know it, and the advent of highly parallel processors, successors to today’s GPUs. And even though it does not sell yet GPUs, Intel recently unveiled the prototype of a ‘many-core’ processor, derived from this kind of technology. The introduction of GPUs as accelerators has turned the HPC world upside-down. What to some researchers is still just a ‘promising technology’ is at the very heart of a revolution in HPC systems. It’s the way of the future, and it is already available in Bull’s Extreme Computing solutions.

Tweet