POOMA: A C++ Toolkit for High-Performance Parallel Scientific Computing | ||
---|---|---|
Prev | Chapter 7. Data-Parallel Expressions | Next |
Data-parallel statements involving containers occur frequently in the inner loops of scientific programs so their efficient execution is important. A naïve implementation for these statements may create and destroy containers holding intermediate values, slowing execution considerably. In 1995, Todd Veldhuizen and David Vandevoorde each developed an expression-template technique to transform arithmetic expressions involving array-like containers into efficient loops without using temporaries. Despite its perceived complexity, POOMA incorporated the technology. The framework called PETE, the Portable Expression Template Engine framework, is also available separately from POOMA at http://www.acl.lanl.gov/pete/.
In this chapter, we first describe how a naïve implementation may slow execution. Then, we describe PETE's faster implementation. PETE converts a data-parallel statement into a parse tree, rather than immediately evaluating it. The parse tree has two representations. Its run-time representation holds run-time values. Its compile-time representation records the types of the tree's values. After a parse tree for the entire statement is constructed, it is evaluated. Since it is a data-parallel statement, this evaluation involves at least one loop. At run time, for each loop iteration, the value of one container value is computed and assigned. At compile time, when the code for the loop iteration is produced, the parse tree's types are traversed and code is produced without the need for any intermediate values. We present the implementation in Section 7.3.2, but first we explain the difficulties caused by the naïve implementation.
A conventional implementation to evaluate data-parallel expressions might overload arithmetic operator functions. Consider this program fragment:
Interval<1> I(0,3); Array<1, double, Brick> A(I), B(I); A = 1.0; B = 2.0; A += -A + 2*B; std::cout < < A < < std::endl;Our goal is to transform the data-parallel statement A += -A + 2*B into a single loop, preferably without using intermediary containers. To simplify notation, let Ar abbreviate the type Array<1, double, Brick>.
Using overloaded arithmetic operators would require using intermediate containers to evaluate the statement. For example, the sum's left operand -A would be computed by the overloaded unary operator Ar operator-(const Ar&), which would produce an intermediate Array. Ar operator*(double, const Ar&) would produce another intermediate Array holding 2*B. Yet another intermediate container would hold their sum, all before performing the assignment. Thus, three intermediate containers would be created and destroyed. Below, we show these are unnecessary.
POOMA uses PETE, the Portable Expression Template Engine framework, to evaluate data-parallel statements using efficient loops without intermediate values. PETE uses expression-template technology. Instead of evaluating a data-parallel statement's subexpressions at solely at run time, it evaluates the code at both run time and at compile time. At compile time, it builds a parse tree of the required computations. The parse tree's type records the types of each of its subtrees. Then, the parse tree is evaluated at compile time using an evaluator determined by the left-hand side's type. This container type determines how to loop through its domain. Each loop iteration of the resulting run time code, the corresponding value of the right-hand side is evaluated. No intermediate loops or temporary values are needed.
Before explaining the implementation, let us illustrate using our example statement A += -A + 2*B. Evaluating the right-hand side creates a parse tree similar to the one in Figure 7-2. For example, the overloaded unary minus operator yields a tree node representing -A, having a unary-minus function object, and having type
Expression<UnaryNode<OpMinus,Ar> >The binary nodes continue the construction process yielding a parse tree object for the entire right-hand side and having type
Expression< BinaryNode<OpAdd, UnaryNode<OpMinus, Ar>, BinaryNode<OpMultiply<Scalar<int>,Ar> > >Evaluating the left-hand side yields an object representing A.
Figure 7-2. Annotated Parse Tree for -A + 2*B
The parse tree for -A + 2*B with type annotations. The complete type of a node equals the concatenation of the preorder traversal of annotated types.
Finally, the assignment operator += calls the evaluate function corresponding to the left-hand side's type. At compile time, it produces the code for the computation. Since this templated function is specialized on the type of the left-hand side, it generates a loop iterating through the left-hand side's container. To produce the loop body, the forEach function produces code for the right-hand side expression at a specific position using a post-order parse-tree traversal. At a leaf, this evaluation queries the leaf's container for a specified value or extracts a scalar value. At an interior node, its children's results are combined using its function operator. One loop performs the entire assignment. It is important to note that the type of the entire right-hand side is known at compile time. Thus, all of these evaluate, forEach, and function operator function calls can be inlined at compile time to yield simple code without any temporary containers and hopefully as fast as hand-written loops!
To implement this scheme, we need POOMA (really PETE) code to both create the parse tree and to evaluate it. We describe parse tree creation first. Parse trees consist of leaves, UnaryNodes, BinaryNodes, and TrinaryNodes. Since TrinaryNodes are similar to BinaryNodes, we omit describing them. A BinaryNode's three template parameters correspond to the three things it must store:
the type of the node's operation. For example, the OpAdd type represents adding two operands together.
the type of the left child.
the type of the right child.
The node stores the left and right children's nodes.
BinaryNode does not need to store any representation of the node's operation. Instead the Op type is an empty structure defining a function object. For example, OpAdd's function object is declared as
template<class T1, class T2> inline typename BinaryReturn<T1, T2, OpAdd>::Type_t operator()(const T1 &a, const T2 &b) const { return (a + b); }Since it has two template arguments, it can be applied to operands of any type. Because of C++ type conversions, the type of the result is determined using the BinaryReturn traits class. Consider adding an int and a double. BinaryReturn<int, double, OpAdd>::Type_t equals double. Inlining the function ensures all this syntax is eliminated, leaving behind just an addition.
UnaryNodes are similar but have only two template parameters and store only one child.
Parse tree leaves are created by the CreateLeaf class and its specializations. The default leaf is a scalar so it has the most general definition:
template<class T> struct CreateLeaf { typedef Scalar<T> Leaf_t; inline static Leaf_t make(const T &a) { return Scalar<T>(a); } };The Scalar class stores the scalar value. The CreateLeaf's Leaf_t type indicates its type. The static make function is invoked by an overloaded operator function when creating its children.
The CreateLeaf class is specialized for Arrays:
template<int Dim, class T, class EngineTag> struct CreateLeaf<Array<Dim, T, EngineTag> > { typedef Array<Dim, T, EngineTag> Input_t; typedef Reference<Input_t> Leaf_t; typedef Leaf_t Return_t; inline static Return_t make(const Input_t &a) { return Leaf_t(a); } };The Array object is stored as a Reference, rather than directly as for scalars.
To simplify the next step of overloading arithmetic operators, a parse tree's topmost type is an Expression.
Now that we have defined the node classes, the C++ arithmetic operators must be overloaded to return the appropriate parse tree. For example, the unary minus operator operator- is overloaded to accept an Array argument. It should create a UnaryNode having an Array as its child, which will be a leaf:
template<int D1,class T1,class E1> inline typename MakeReturn<UnaryNode<OpUnaryMinus, typename CreateLeaf<Array<D1,T1,E1> >::Leaf_t> >:: Expression_t operator-(const Array<D1,T1,E1> & l) { typedef UnaryNode<OpUnaryMinus, typename CreateLeaf<Array<D1,T1,E1> >::Leaf_t> Tree_t; return MakeReturn<Tree_t>::make(Tree_t( CreateLeaf<Array<D1,T1,E1> >::make(l))); }Tree_t specifies the node's unique type. Constructing the object first involves creating a leaf containing the Array reference through the call to
CreateLeaf<Array<D1,T1,E1> >::makeThe call to MakeReturn<Tree_t>::make permits programmers to store trees in different formats. The POOMA implementation stores them as Expressions. The function's return type is similar to the return statement except it extracts the type from Expression's internal Expression_t type.
Specializing all the operators for Arrays using such complicated functions is likely to be error-prone so PETE provides a way to automate their creation. Using its MakeOperators command with this input:
classes ----- ARG = "int D[n],class T[n],class E[n]" CLASS = "Array<D[n],T[n],E[n]>"automatically generates code for all the needed operators. The "[n]" strings are used to number arguments for binary and ternary operators.
Assignment operators must also be specialized for Array. Inside the Array class definition, each such operator just invokes the assign function with a corresponding function object. For example, operator+= invokes assign(*this, rhs, OpAddAssign()). rhs is the parse tree object for the right-hand side. Calling this function invokes evaluate, which begins the evaluation.
Before we explain the evaluation, let us summarize the effect of the code so far described. If we are considering run time evaluation, parse trees for the left-hand and right-hand sides have been constructed. If we are considering compile time evaluation, the types of these parse trees are known. At compile time, the evaluate function described below will generate a loop iterating through the left-hand side container's domain. The loop's body will have code computing a container's value. At run time, this code will read values from containers, but the run-time parse tree object itself will not traversed!
We now explore the evaluation, concentrating on compile time, not run time. evaluate is an overloaded function specialized on the type of the left-hand side. In our example, the left-hand side is a one-dimensional Array, so evaluate(const Ar& a, const Op& op, const RHS& rhs) is inlined into a loop like
int end = a's domain[0].first() + a's domain[0].length(); for (int i = a's domain[0].first(); i < end; ++i) op(a(i), rhs.read(i));a is the array, op is a function object representing the assignment operation, and rhs is the right-hand side's parse tree.
Evaluating rhs.read(i) inlines into a call to the forEach function. This function performs a compile-time post-order parse-tree traversal. Its general form is
forEach(const Expression& e, const LeafTag& f, const CombineTag& c).That is, it traverses the nodes of the Expression object e. At leaves, it applies the operation specified by LeafTag f. At interior nodes, it combines the results using the CombineTag operator c. It inlines into a call to
ForEach<Expression, LeafTag, CombineTag>::apply(e, f, c)The apply function continues the traversal through the tree. For our example, LeafTag equals EvalLeaf<1>, and CombineTag equals OpCombine. The former indicates that, when reaching a leaf, the leaf should be a one-dimensional container which should be evaluated at the position stored in the EvalLeaf object. The OpCombine class applies an interior node's Op to the results of its children.
ForEach structures are specialized for the various node types. For example, the specialization for UnaryNode is
template<class Op, class A, class FTag, class CTag> struct ForEach<UnaryNode<Op, A>, FTag, CTag> { typedef typename ForEach<A,FTag,CTag>::Type_t TypeA_t; typedef typename Combine1<TypeA_t,Op,CTag>::Type_t Type_t; inline static Type_t apply(const UnaryNode<Op,A>&expr,const FTag&f, const CTag& c) { return Combine1<TypeA_t, Op, CTag>:: combine(ForEach<A, FTag, CTag>:: apply(expr.child(), f, c), c); } };Since this structure is specialized for UnaryNodes, the first parameter of its static apply function is a UnaryNode. After recursively calling its child, it invokes the combination function indicated by the Combine1 traits class. In our example, the c function object should be applied. Other combiners have different roles. For example, using the NullCombine tag indicates the child's result should not be combined but occurs just for side effects.
Leaves are treated as the default behavior so they are not specialized:
template<class Expr, class FTag, class CTag> struct ForEach { typedef typename LeafFunctor<Expr, FTag>::Type_t Type_t; inline static Type_t apply(const Expr&expr,const FTag&f,const CTag&) { return LeafFunctor<Expr, FTag>::apply(expr, f); } };Thus, LeafFunctor's apply member is called. Expr represents the expression type, e.g., an Array, and FTag is the LeafTag, e.g., EvalLeaf. The LeafFunctorspecialization for Array passes the index stored by the EvalLeaf object to the Array's engine, which returns the corresponding value.
If one uses an aggressive optimizing compiler, code resulting from the evaluate function corresponds to this pseudocode:
int end = A.domain[0].first() + A.domain[0].length(); for (int i = A.domain[0].first(); i < end; ++i) A.engine(i) += -A.engine.read(i)+2*B.engine.read(i);The loop iterates through A's domain, using Array's engines to obtain values and assigning values. Notice there is no use of the run-time parse tree so the optimizer can eliminate the code to construct it. All the work to construct the parse tree by overloading operators is unimportant at run time, but it certainly helped the compiler produce improved code.
PETE's expression template technology may be complicated, using parse trees and their types, but the produced code is not. Using the technology is also easy. All data-parallel statements are automatically converted. In the next chapter, we explore views of containers, permitting use of container subsets and making data-parallel expressions even more useful.