3.8. Distributed Field Implementation

A POOMA program using Fields can execute on one or more processors. In Section 3.6, we demonstrated how to modify a uniprocessor stencil Array implementation to run on multiple processors. In this section, we demonstrate that the uniprocessor data-parallel Field implementation of the previous section can be similarly converted. Only the container declarations change; the computations do not. Since the changes are exactly analogous to those in Section 3.6, our exposition here will be shorter.

Example 3-7. Distributed Data-Parallel Field Implementation of Doof2d


#include <stdlib.h>		// has EXIT_SUCCESS
#include "Pooma/Fields.h"
    // has POOMA's Field declarations

// Doof2d: POOMA Fields, data-parallel, multiple
// processor implementation

int main(int argc, char *argv[])
{
  // Prepare the POOMA library for execution.
  Pooma::initialize(argc,argv);
  
  // Since multiple copies of this program may
  // simultaneously run, we canot use standard input
  // and output.  Instead we use command-line
  // arguments, which are replicated, for input, and we
  // use an Inform stream for output.  (1)
  Inform output;

  // Read the program input from the command-line arguments.
  if (argc != 4) {
    // Incorrect number of command-line arguments.
    output < < argv[0] < <
      ": number-of-processors number-of-averagings"
      < < " number-of-values"
      < < std::endl;
    return EXIT_FAILURE;
  }
  char *tail;

  // Determine the number of processors.
  long nuProcessors;
  nuProcessors = strtol(argv[1], &tail, 0);

  // Determine the number of averagings.
  long nuAveragings, nuIterations;
  nuAveragings = strtol(argv[2], &tail, 0);
  nuIterations = (nuAveragings+1)/2;
    // Each iteration performs two averagings.

  // Ask the user for the number n of values along
  // one dimension of the grid.
  long n;
  n = strtol(argv[3], &tail, 0);
  // The dimension must be a multiple of the number of
  // processors since we are using a UniformGridLayout.
  n = ((n+nuProcessors-1) / nuProcessors) * nuProcessors;

  // Specify the fields' domains [0,n) x [0,n).
  Interval<1> N(0, n-1);
  Interval<2> vertDomain(N, N);

  // Set up interior domains [1,n-1) x [1,n-1) for
  // computation.
  Interval<1> I(1,n-2);
  Interval<1> J(1,n-2);

  // Partition the fields' domains uniformly, i.e.,
  // each patch has the same size.  The first parameter
  // tells how many patches for each dimension.  Guard
  // layers optimize communication between patches.
  // Internal guards surround each patch.  External
  // guards surround the entire field domain.  (2)
  UniformGridPartition<2>
    partition(Loc<2>(nuProcessors, nuProcessors),
	     GuardLayers<2>(1),  // internal
	     GuardLayers<2>(0)); // external
  UniformGridLayout<2>
    layout(vertDomain, partition, DistributedTag());

  // Specify the fields' mesh, i.e., its spatial
  // extent, and its centering type.  (3)
  UniformRectilinearMesh<2>
    mesh(layout, Vector<2>(0.0), Vector<2>(1.0, 1.0));
  Centering<2> cell =
    canonicalCentering<2>(CellType, Continuous, AllDim);

  // The Field template parameters indicate a mesh and
  // a 'double' value type.  MultiPatch indicates
  // multiple computation patches, i.e., distributed
  // computation.  The UniformTag indicates the patches
  // should have the same size.  Each patch has Brick
  // type.  (4)
  Field<UniformRectilinearMesh<2>, double,
           MultiPatch<UniformTag, Remote<Brick>  >  >
    a(cell, layout, mesh);
  Field<UniformRectilinearMesh<2>, double,
    MultiPatch<UniformTag, Remote<Brick>  >  >
    b(cell, layout, mesh);

  // Set up the initial conditions.
  // All grid values should be zero except for the
  // central value.
  a = b = 0.0;
  // Ensure all data-parallel computation finishes
  // before accessing a value.
  Pooma::blockAndEvaluate();
  b(n/2,n/2) = 1000.0;

  // In the average, weight elements with this value.
  const double weight = 1.0/9.0;

  // Perform the simulation.
  for (int k = 0; k < nuIterations; ++k) {
    // Read from b.  Write to a.
    a(I,J) = weight *
      (b(I+1,J+1) + b(I+1,J  ) + b(I+1,J-1) +
       b(I  ,J+1) + b(I  ,J  ) + b(I  ,J-1) +
       b(I-1,J+1) + b(I-1,J  ) + b(I-1,J-1));

    // Read from a.  Write to b.
    b(I,J) = weight *
      (a(I+1,J+1) + a(I+1,J  ) + a(I+1,J-1) +
       a(I  ,J+1) + a(I  ,J  ) + a(I  ,J-1) +
       a(I-1,J+1) + a(I-1,J  ) + a(I-1,J-1));
  }

  // Print out the final central value.
  Pooma::blockAndEvaluate();
    // Ensure all computation has finished.
  output < <
    (nuAveragings % 2 ? a(n/2,n/2) : b(n/2,n/2))
    < < std::endl;

  // The fields are automatically deallocated.

  // Tell the POOMA library execution has finished.
  Pooma::finalize();
  return EXIT_SUCCESS;
}
(1)
Multiple copies of a distributed program may simultaneously run, perhaps each having its own input and output. Thus, we use command-line arguments to pass input to the program. Using an Inform stream ensures only one copy produces output.
(2)
The UniformGridPartition declaration specifies how an array's domain will be partitioned, or split, into patches. Guard layers are an optimization that can reduce data communication between patches. The UniformGridLayout declaration applies the partition to the given domain, distributing the resulting patches among various processors.
(3)
The mesh and centering declarations are the same for uniprocessor and multiprocessor implementations.
(4)
The MultiPatch Engine distributes requests for Field values to the associated patch. Since a patch may associated with a different processor, its "remote" engine has type Remote<Brick>. POOMA automatically distributes the patches among available memories and processors.

This program can be viewed as the combination of Example 3-6 and the changes to form the distributed stencil-based Array program from the uniprocessor stencil-based Array program.