POOMA: A C++ Toolkit for High-Performance Parallel Scientific Computing | ||
---|---|---|
Prev | Chapter 3. A Tutorial Introduction | Next |
A POOMA program using Fields can execute on one or more processors. In Section 3.6, we demonstrated how to modify a uniprocessor stencil Array implementation to run on multiple processors. In this section, we demonstrate that the uniprocessor data-parallel Field implementation of the previous section can be similarly converted. Only the container declarations change; the computations do not. Since the changes are exactly analogous to those in Section 3.6, our exposition here will be shorter.
Example 3-7. Distributed Data-Parallel Field Implementation of Doof2d
#include <stdlib.h> // has EXIT_SUCCESS #include "Pooma/Fields.h" // has POOMA's Field declarations // Doof2d: POOMA Fields, data-parallel, multiple // processor implementation int main(int argc, char *argv[]) { // Prepare the POOMA library for execution. Pooma::initialize(argc,argv); // Since multiple copies of this program may // simultaneously run, we canot use standard input // and output. Instead we use command-line // arguments, which are replicated, for input, and we // use an Inform stream for output.Inform output; // Read the program input from the command-line arguments. if (argc != 4) { // Incorrect number of command-line arguments. output < < argv[0] < < ": number-of-processors number-of-averagings" < < " number-of-values" < < std::endl; return EXIT_FAILURE; } char *tail; // Determine the number of processors. long nuProcessors; nuProcessors = strtol(argv[1], &tail, 0); // Determine the number of averagings. long nuAveragings, nuIterations; nuAveragings = strtol(argv[2], &tail, 0); nuIterations = (nuAveragings+1)/2; // Each iteration performs two averagings. // Ask the user for the number n of values along // one dimension of the grid. long n; n = strtol(argv[3], &tail, 0); // The dimension must be a multiple of the number of // processors since we are using a UniformGridLayout. n = ((n+nuProcessors-1) / nuProcessors) * nuProcessors; // Specify the fields' domains [0,n) x [0,n). Interval<1> N(0, n-1); Interval<2> vertDomain(N, N); // Set up interior domains [1,n-1) x [1,n-1) for // computation. Interval<1> I(1,n-2); Interval<1> J(1,n-2); // Partition the fields' domains uniformly, i.e., // each patch has the same size. The first parameter // tells how many patches for each dimension. Guard // layers optimize communication between patches. // Internal guards surround each patch. External // guards surround the entire field domain.
UniformGridPartition<2> partition(Loc<2>(nuProcessors, nuProcessors), GuardLayers<2>(1), // internal GuardLayers<2>(0)); // external UniformGridLayout<2> layout(vertDomain, partition, DistributedTag()); // Specify the fields' mesh, i.e., its spatial // extent, and its centering type.
UniformRectilinearMesh<2> mesh(layout, Vector<2>(0.0), Vector<2>(1.0, 1.0)); Centering<2> cell = canonicalCentering<2>(CellType, Continuous, AllDim); // The Field template parameters indicate a mesh and // a 'double' value type. MultiPatch indicates // multiple computation patches, i.e., distributed // computation. The UniformTag indicates the patches // should have the same size. Each patch has Brick // type.
Field<UniformRectilinearMesh<2>, double, MultiPatch<UniformTag, Remote<Brick> > > a(cell, layout, mesh); Field<UniformRectilinearMesh<2>, double, MultiPatch<UniformTag, Remote<Brick> > > b(cell, layout, mesh); // Set up the initial conditions. // All grid values should be zero except for the // central value. a = b = 0.0; // Ensure all data-parallel computation finishes // before accessing a value. Pooma::blockAndEvaluate(); b(n/2,n/2) = 1000.0; // In the average, weight elements with this value. const double weight = 1.0/9.0; // Perform the simulation. for (int k = 0; k < nuIterations; ++k) { // Read from b. Write to a. a(I,J) = weight * (b(I+1,J+1) + b(I+1,J ) + b(I+1,J-1) + b(I ,J+1) + b(I ,J ) + b(I ,J-1) + b(I-1,J+1) + b(I-1,J ) + b(I-1,J-1)); // Read from a. Write to b. b(I,J) = weight * (a(I+1,J+1) + a(I+1,J ) + a(I+1,J-1) + a(I ,J+1) + a(I ,J ) + a(I ,J-1) + a(I-1,J+1) + a(I-1,J ) + a(I-1,J-1)); } // Print out the final central value. Pooma::blockAndEvaluate(); // Ensure all computation has finished. output < < (nuAveragings % 2 ? a(n/2,n/2) : b(n/2,n/2)) < < std::endl; // The fields are automatically deallocated. // Tell the POOMA library execution has finished. Pooma::finalize(); return EXIT_SUCCESS; }
This program can be viewed as the combination of Example 3-6 and the changes to form the distributed stencil-based Array program from the uniprocessor stencil-based Array program.
Distributed programs may have multiple processes, each with its own input and output streams. To pass input to these processes, this programs uses command-line arguments, which are replicated for each process. An Inform stream accepts data from any context but prints only data from context 0.
A layout for a distributed program specifies a domain, a partition, and a context mapper. A DistributedTag context mapper tag indicates that pieces of the domain should be distributed among patches, while a ReplicatedTag context mapper tag indicates the entire domain should be replicated to each patch.
A MultiPatch Engine supports the use of multiple patches, while a remote engine supports computation distributed among various contexts. Both are usually necessary for distributed computation.
The computation for uniprocessor or distributed implementations remains the same. The POOMA Toolkit automatically handles all communication necessary to ensure up-to-date values are available when needed.
The command to invoke a distributed program is system-dependent. For example, the mpirun -np 4 Doof2d-Field-distributed -mpi 2 10 1000 command might use MPI communication.
Doof2d-Field-distributed -shmem -np 4 2 10 1000might use the MM Shared Memory Library.