POOMA: A C++ Toolkit for High-Performance Parallel Scientific Computing | ||
---|---|---|
Prev | Chapter 1. Introduction | Next |
The goals for the POOMA Toolkit have remained unchanged since its conception in 1994:
Code portability across serial, distributed, and parallel architectures without any change to the source code.
Development of reusable, cross-problem-domain components to enable rapid application development.
Code efficiency for kernels and components relevant to scientific simulation.
Toolkit design and development driven by applications from a diverse set of scientific problem domains.
Shorter time from problem inception to working parallel simulations.
Below, we discuss how POOMA achieves these goals.
The same POOMA programs run on sequential, distributed, and parallel computers. No change in source code is required. Two or three lines specify how each container's data should be distributed among available processors. Using these directives and run-time information about the computer's configuration, the toolkit automatically distributes pieces of the container domains, called patches, among the available processors. If a computation needs values from another patch, POOMA automatically passes the values to the patch where it is needed. The same program, and even the same executable, works regardless of the number of the available processors and the size of the containers' domains. A programmer interested in only sequential execution can omit the two or three lines specifying how the domains are to be distributed.
The POOMA Toolkit is designed to enable rapid development of scientific and distributed applications. For example, its vector, matrix, and tensor classes model the corresponding mathematical concepts. Its Array and Field classes model the discrete spaces and mathematical arrays frequently found in computational science and math. See Figure 1-1. The left column indicates theoretical science and math concepts, the middle column computational science and math concepts, and the right column computer science implementations. For example, theoretical physics frequently uses continuous fields in three-dimension space, while algorithms for a corresponding computational physics problem usually uses discrete fields. POOMA containers, classes, and functions ease engineering computer programs for these algorithms. For example, the POOMA Field container models discrete fields: both map locations in discrete space to values and permit computations of spatial distances and values. The POOMA Array container models the mathematical concept of an array, frequently used in numerical analysis.
Figure 1-1. How POOMA Fits Into the Scientific Process
In the translation from theoretical science to computational science to computer programs, POOMA eases the implementation of algorithms as computer programs.
POOMA containers support a variety of computation modes, easing translation of algorithms into code. For example, many algorithms for solving partial differential equations use stencil-based computations so POOMA supports stencil-based computations on Arrays and Fields. POOMA also supports data-parallel computation similar to Fortran 90 syntax. To ease implementing computations where one Field's values are a function of several other Field's values, the programmer can specify a relation. Relations are lazily evaluated: whenever the dependent Field's values are needed and they are dependent on a Field whose values have changed, the values are computed. Relations also assists correctness by eliminating the frequently forgotten need for a programmer to ensure a Field's values are up-to-date before being used.
POOMA incorporates a variety of techniques to ensure it produces code that executes as quickly as special-case, hand-written code. These techniques include extensive use of templates, out-of-order evaluation, use of guard layers, and production of fast inner loops.
POOMA's uses of C++ templates ensures as much as work as possible occurs at compile time, not run time. This speeds programs' execution. Since more code is produced at compile time, more code is available to the compiler's optimizer, further speeding execution. The POOMA Array container benefits from the use of template parameters. Their use permits the use of specialized data storage classes called engines. An Array's Engine template parameter specifies how data is stored and indexed. Some Arrays expect almost all values to be used, while others might be mostly empty. In the latter case, using a specialized engine storing the few nonzero values greatly reduces storage requirements. Using engines also permits fast creation of container views, known as array sections in Fortran 90. A view's engine is the same as the original container's engine, but the view object's restricted domain is a subset of the original domain. Space requirements and execution time to use views are minimal.
Using templates also permits containers to support polymorphic indexing, e.g., indexing both by integers and by three-dimensional coordinates. A container uses templatized indexing functions that defer indexing operations to its engine's index operators. Since the container uses templates, the Engine can define indexing functions with different function arguments, without the need to add corresponding container functions. Some of these benefits of using templates can be expressed without them, but doing so increases execution time. For example, a container could have a pointer to an engine object, but this requires a pointer dereference for each operation. Implementing polymorphic indexing without templates would require adding virtual functions corresponding to each of the indexing functions.
To ensure multiprocessor POOMA programs execute quickly, it is important that interprocessor communication overlaps with intraprocessor computations as much as possible and that communication is minimized. Asynchronous communication, out-of-order evaluation, and use of guard layers all help achieve these goals. POOMA uses the asynchronous communication facilities of the Cheetah communication library. When a processor needs data that is stored or computed by another processor, a message is sent between the two. If synchronous communication was used, the sender must issue an explicit send, and the recipient must issue an explicit receive, synchronizing the two processors. Cheetah permits the sender to put and get data without synchronizing with the recipient processor, and it also permits invoking functions at remote sites to ensure desired data is up-to-date. Thus, out-of-order evaluation must be supported. Out-of-order evaluation also has another benefit: Only computations directly or indirectly related to values that are printed need occur.
Surrounding a patch with guard layers can help reduce interprocessor communication. For distributed computation, each container's domain is split into pieces distributed among the available processors. Frequently, computing a container value is local, involving just the value itself and a few neighbors, but computing a value near the edge of a processor's domain may require knowing a few values from a neighboring domain. Guard layers permit these values to be copied locally so they need not be repeatedly communicated.
POOMA uses the PETE Library to ensure inner loops involving POOMA's object-oriented containers run as quickly as hand-coded loops. PETE (the Portable Expression Template Engine) uses expression-template technology to convert data-parallel statements into efficient loops without any intermediate computations. For example, consider evaluating the statement
A += -B + 2 * C;where A and C are vector<double>s and B is a vector<int>. Naïve evaluation might introduce intermediaries for -B, 2*C, and their sum. The presence of these intermediaries in inner loops can measurably slow performance. To produce a loop without intermediaries, PETE stores each expression as a parse tree. Using its templates, the parse tree is converted, at compile time, to a loop directly evaluating each component of the result without computing intermediate values. For example, the code corresponding to the statement above is
vector<double>::iterator iterA = A.begin(); vector<int>::const_iterator iterB = B.begin(); vector<double>::const_iterator iterC = C.begin(); while (iterA != A.end()) { *iterA += -*iterB + 2 * *iterC; ++iterA; ++iterB; ++iterC; }Furthermore, since the code is available at compile time, not run time, it can be further optimized, e.g., moving any loop-invariant code out of the loop.
POOMA has been used to solve a wide variety of scientific problems. Most recently, physicists at Los Alamos National Laboratory implemented an entire library of hydrodynamics codes as part of the U.S. government's science-based Stockpile Stewardship Program. Other applications include a matrix solver, an accelerator code simulating the dynamics of high-intensity charged particle beams in linear accelerators, and a Monte Carlo neutron transport code.
POOMA's tools greatly reduce the time to implement applications. As we noted above, POOMA's containers and expression syntax model the computational models and algorithms most frequently found in scientific programs. These high-level tools are known to be correct and reduce the time to debug programs. Since the same programs run on one processor and multiple processors, programmers can write and test programs using their one or two-processor personal computers. With no additional work, the same program runs on computers with hundreds of processors; the code is exactly the same, and the toolkit automatically handles distribution of the data, all data communication, and all synchronization. The net result is a significant reduction in programming time. For example, a team of two physicists and two support people at Los Alamos National Laboratory implemented a suite of hydrodynamics kernels in six months. Their work replaced a previous suite of less-powerful kernels which had taken sixteen people several years to implement and debug. Despite not have previously implemented any of the kernels, they implemented one new kernel every three days, including the time to read the corresponding scientific papers!