POOMA: A C++ Toolkit for High-Performance Parallel Scientific Computing | ||
---|---|---|
Prev | Chapter 3. A Tutorial Introduction | Next |
A POOMA program can execute on one or multiple processors. To convert a program designed for uniprocessor execution to a program designed for multiprocessor execution, the programmer need only specify how each container's domain should be split into "patches". The POOMA Toolkit automatically distributes the data among the available processors and handles any required communication among processors. Example 3-5 illustrates how to write a distributed version of the stencil program (Example 3-4).
Example 3-5. Distributed Stencil Array Implementation of Doof2d
#include <iostream> // has std::cout, ... #include <stdlib.h> // has EXIT_SUCCESS #include "Pooma/Arrays.h" // has POOMA's Array declarations // Doof2d: POOMA Arrays, stencil, multiple // processor implementation // Define the stencil class performing the computation. class DoofNinePt { public: // Initialize the constant average weighting. DoofNinePt() : weight(1.0/9.0) {} // This stencil operator is applied to each interior // domain position (i,j). The "C" template // parameter permits use of this stencil // operator with both Arrays and Fields. template <class C> inline typename C::Element_t operator()(const C& x, int i, int j) const { return weight * (x.read(i+1,j+1)+x.read(i+1,j)+x.read(i+1,j-1) + x.read(i ,j+1)+x.read(i ,j)+x.read(i ,j-1) + x.read(i-1,j+1)+x.read(i-1,j)+x.read(i-1,j-1)); } inline int lowerExtent(int) const { return 1; } inline int upperExtent(int) const { return 1; } private: // In the average, weight elements with this value. const double weight; }; int main(int argc, char *argv[]) { // Prepare the POOMA library for execution. Pooma::initialize(argc,argv); // Since multiple copies of this program may simul- // taneously run, we cannot use standard input and // output. Instead we use command-line arguments, // which are replicated, for input, and we use an // Inform stream for output.Inform output; // Read the program input from the command-line // arguments. if (argc != 4) { // Incorrect number of command-line arguments. output < < argv[0] < < ": number-of-processors number-of-averagings" < < " number-of-values" < < std::endl; return EXIT_FAILURE; } char *tail; // Determine the number of processors. long nuProcessors; nuProcessors = strtol(argv[1], &tail, 0); // Determine the number of averagings. long nuAveragings, nuIterations; nuAveragings = strtol(argv[2], &tail, 0); nuIterations = (nuAveragings+1)/2; // Each iteration performs two averagings. // Ask the user for the number n of values along // one dimension of the grid. long n; n = strtol(argv[3], &tail, 0); // The dimension must be a multiple of the number // of processors since we are using a // UniformGridLayout. n=((n+nuProcessors-1)/nuProcessors)*nuProcessors; // Specify the arrays' domains [0,n) x [0,n). Interval<1> N(0, n-1); Interval<2> vertDomain(N, N); // Set up interior domains [1,n-1) x [1,n-1) // for computation. Interval<1> I(1,n-2); Interval<2> interiorDomain(I,I); // Create the distributed arrays. // Partition the arrays' domains uniformly, i.e., // each patch has the same size. The first para- // meter tells how many patches for each dimension. // Guard layers optimize communication between // patches. Internal guards surround each patch. // External guards surround the entire array // domain.
UniformGridPartition<2> partition(Loc<2>(nuProcessors, nuProcessors), GuardLayers<2>(1), // internal GuardLayers<2>(0)); // external UniformGridLayout<2> layout(vertDomain, partition, DistributedTag()); // The Array template parameters indicate 2 dims // and a 'double' value type. MultiPatch indicates // multiple computation patches, i.e, distributed // computation. The UniformTag indicates the // patches should have the same size. Each patch // has Brick type.
Array<2, double, MultiPatch<UniformTag, Remote<Brick> > > a(layout); Array<2, double, MultiPatch<UniformTag, Remote<Brick> > > b(layout); // Set up the initial conditions. // All grid values should be zero except for the // central value. a = b = 0.0; // Ensure all data-parallel computation finishes // before accessing a value. Pooma::blockAndEvaluate(); b(n/2,n/2) = 1000.0; // Create the stencil performing the computation. Stencil<DoofNinePt> stencil; // Perform the simulation. for (int k = 0; k < nuIterations; ++k) { // Read from b. Write to a.
a(interiorDomain) = stencil(b, interiorDomain); // Read from a. Write to b. b(interiorDomain) = stencil(a, interiorDomain); } // Print out the final central value. Pooma::blockAndEvaluate(); // Ensure all computation has finished. output < < (nuAveragings % 2 ? a(n/2,n/2) : b(n/2,n/2)) < < std::endl; // The arrays are automatically deallocated. // Tell the POOMA library execution has finished. Pooma::finalize(); return EXIT_SUCCESS; }
Supporting distributed computation requires only minor code changes. These changes specify how each container's domain is distributed among the available processors and how input and output occurs. The rest of the program, including all the computations, remains the same. When running, the POOMA executable interacts with the run-time library to determine which processors are available, distributes the containers' domains, and automatically handles all necessary interprocessor communication. The same executable runs on one or many processors. Thus, the programmer can write one program, debugging it on a uniprocessor computer and run it on a supercomputer.
POOMA's distributed computing model separates container domain concepts from computer configuration concepts. See Figure 3-4. The statements in the program indicate how each container's domain will be partitioned. This process is represented in the upper left corner of the figure. A user-specified partition specifies how to split the domain into pieces. For example, the illustrated partition splits the domain into three equal-sized pieces along the x-dimension and two equal-sized pieces along the y-dimension. Applying the partition to the domain creates patches. The partition also specifies external and internal guard layers. A guard layer is a domain surrounding a patch. A patch's computation only reads but does not write these guarded values. An external guard layer conceptually surrounds the entire container domain with boundary values whose presence permits all domain computations to be performed the same way even for computed values along the domain's edge. An internal guard layer duplicates values from adjacent patches so communication need not occur during a patch's computation. The use of guard layers is an optimization; using external guard layers eases programming and using internal guard layers reduces communication among processors. Their use is not required.
Figure 3-4. The POOMA Distributed Computation Model
The POOMA distributed computation model creates a layout by combining a partitioning of the containers' domains and the computer configuration.
The computer configuration of shared memory and processors is determined by the run-time system. See the upper right portion of Figure 3-4. A context is a collection of shared memory and processors that can execute a program or a portion of a program. For example, a two-processor desktop computer might have memory accessible to both processors so it is a context. A supercomputer consisting of desktop computers networked together might have as many contexts as computers. The run-time system, e.g., the Message Passing Interface (MPI) Communications Library or the MM Shared Memory Library (http://www.engelschall.com/sw/mm/), communicates the available contexts to the executable. POOMA must be configured for the particular run-time system in use. See Section A.1.
A layout combines patches with contexts so the program can be executed. If DistributedTag is specified, the patches are distributed among the available contexts. If ReplicatedTag is specified, each set of patches is replicated on each context. Regardless, the containers' domains are now distributed among the contexts so the program can run. When a patch needs data from another patch, the POOMA Toolkit sends messages to the desired patch uses the message-passing library. All such communication is automatically performed by the toolkit with no need for programmer or user input.
Incorporating POOMA's distributed computation model into a program requires writing very few lines of code. Example 3-5 illustrates this. The partition declaration creates a UniformGridPartition splitting each dimension of a container's domain into equally-sized nuProcessors pieces. The first GuardLayers argument specifies each patch will have copy of adjacent patches' outermost values. This may speed computation because a patch need not synchronize its computation with other patches' processors. Since each value's computation requires knowing its surrounding neighbors, this internal guard layer is one layer deep. The second GuardLayers argument specifies no external guard layer. External guard layers simplify computing values along the edges of domains. Since our program already uses only the interior domain for computation, we do not use this feature.
The layout declaration creates a UniformGridLayout layout. As Example 3-5 illustrates, it needs to know a container's domain, a partition, the computer's contexts, and a DistributedTag or ReplicatedTag. These comprise layout's three parameters; the contexts are implicitly supplied by the run-time system.
To create a distributed Array, it should be created using a Layout object and have a MultiPatch Engine rather than using a Domain object and a Brick Engine as we did for the uniprocessor implementations. A distributed implementation uses a Layout object, which conceptually specifies a Domain object and its distribution throughout the computer. A MultiPatch Engine supports computations using multiple patches. The UniformTag indicates the patches all have the same size. Since patches may reside on different contexts, the second template parameter is Remote. Its Brick template parameter specifies the Engine for a particular patch on a particular context. Most distributed programs use
MultiPatch<UniformTag, Remote<Brick> >or
MultiPatch<UniformTag, Remote<CompressibleBrick> >or Engines.
The computations for a distributed implementation are exactly the same as for a sequential implementation. The POOMA Toolkit and a message-passing library automatically perform all the computation.
Input and output for distributed programs is different than for sequential programs. Although the same instructions run on each context, each context may have its own input and output streams. To avoid dealing with multiple input streams, we pass the input via command-line arguments, which are replicated for each context. Using Inform streams avoids having multiple output streams print. Any context can print to an Inform stream but only text sent to context 0 is displayed. At the beginning of the program, we create an Inform object named output. Throughout the rest of the program, we use it instead of std::cout and std::cerr.
The command to run the program is dependent on the run-time system. To use MPI with the Irix 6.5 operating system, one can use the mpirun command. For example, mpirun -np 4 Doof2d-Array-distributed -mpi 2 10 1000 invokes the MPI run-time system with four processors. The -mpi option tells the POOMA executable Doof2d-Array-distributed to use the MPI Library. The remaining arguments specify the number of processors, the number of averagings, and the array size. The first and last values are the same for each dimension. For example, if three processors are specified, then the x-dimension will have three processors and the y-dimension will have three processors, totaling nine processors. The command Doof2d-Array-distributed -shmem -np 4 2 10 1000 uses the MM Shared Memory Library (-shmem) and four processors. As for MPI, the remaining command-line arguments are specified on a per-dimension basis for the two-dimensional program.