**Notice:** The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarity and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.

BY PARALLELIZING COMPILATION

Reiner W. Hartenstein, Karin Schmidt

University of Kaiserslautern, Dept. of Computer Science

Keywords: data-parallel machine, field-programmable logic, structural programming, procedural programming

Emanating from the technology of field-programmable logic (FPL) the new paradigm of structural programming has evolved ([6]). Instead of loading the program code as a sequence of instructions into memory (procedural programming), hardware structures are configured to fulfill the application needs (structural programming). Originally field-programmable logic has been used to accelerate the design of specific hardware. FPL technology is now available in densities that allow the (re-) configuration of complex algorithms on a small set of FPL devices within milliseconds.

A new kind of supercomputer combining the advantages of both structural programming and traditional von Neumann style procedural programming has been introduced by the architectural class of Xputers ([7]), a data-parallel machine with shared memory. One major difference from other data-parallel machines is that Xputers provide a run time customizable instruction set for data manipulations ([8]). Field-programmable logic is used to offer configurable instructions which allow a fully parallel execution in contrast to e.g. vector and other parallel computers which are mostly working in a pipelined manner. The data to be manipulated by the reconfigurable ALU (rALU) is addressed by a special data sequencer hardware containing generic address generators. These address generators provide a rich hardware-based repertory of patterns to access the data from the memory, consequently eliminating most addressing and control overhead. This results in high performance gains ([1]).

From the above several requirements arise for the development of a new Xputer compilation method. As input language the well-known imperative language C has been taken. To achieve the necessary Xputer fine grained parallelism at statement and expression level knowledge out of the supercompiler scene has been adapted. This fine grained parallelism enables the exploitation of the different rALU subnets of the Xputer. A second major issue in Xputer program compilation is the extraction of the program's data and its mapping and distribution in a regular way over the Xputer memory space. This data arrangement together with the extracted data dependences and data accesses determine the required data sequencing and thus substantially contribute to the efficiency and performance of the program execution. The proposed compilation method is working without further user interaction and is flexible in order to be driven by the hardware constraints of the actual prototype hardware target.

Before the parallelizing compilation method is explained, the Xputer target hardware is briefly sketched by introducing the Xputer prototype "MoM 3" (Map-oriented Machine 3).

The MoM 3 consists of three major parts: (1) the data sequencer with seven generic address generators (GAGs), (2) the reconfigurable ALU (rALU) with seven scan windows and several subnets, and (3) the data memory (see figure 1).

.In contrast to the one-dimensional von Neumann memory space, the Xputer's data memory is primarily organized two-dimensional by splitting the memory address into an x- and y-part like coordinates in a two-dimensional map. It holds the data of the user's programs. The data is distributed in a regular fashion over the data memory given by a generic mapping scheme (data map). Each scan window out of the rALU serves as a window to the data memory being the processor-to-memory interface of Xputers. A scan window holds data from a local neighbourhood as a copy out of data memory. It efficiently supports the exploitation of parallelism within an algorithm. Scan windows are adjustable in size during run time. The data sequencer hardware provides accessing sequences for a controlled scan window movement over the memory space. Thus the data sequencer represents the main control part of an Xputer. It consists of seven generic address generators (GAGs) operating in parallel and a control logic, which reconfigures GAGs and rALU when necessary. A generic address generator is able to compute a long sequence of addresses, so-called basic scan patterns, for the data in the data map from a relatively small parameter set (figure 2).

All data manipulations are done by the reconfigurable ALU applied to the data in the scan windows. For the MoM 3 a special reconfigurable datapath architecture (rDPA) supporting word level has been developed ([8]) for the evaluation of any arithmetic and logic expression. Each basic cell of the rDPA serves as an operator and is configured with a fixed ALU and a microprogrammed control. During the compilation the data operations are coded as structural description for further synthesis towards the rDPA.

For Xputer compilation all these problems have to be solved during compile time. First a theory is needed for the program partitioning and restructuring (parallelization). Result of this step is the determination of a partial execution sequence. Secondly the program's data has to be mapped in a regular way onto the 2-dimensionally organized Xputer data map, followed by a computation of the right address accesses (data sequencing) for each variable. Thus far all steps are target-hardware independent. Code generation for the MoM 3 results (1) in a hardware image file containing structural information for the configuration of field-programmable logic, and (2) in a software image file containing the parameter sets for the data sequencer hardware especially the generic address generators.

Partitioning of the Control-Flow. Given is the program graph G = (N, E). The node set N has to be partitioned, resulting in a number of subgraphs G*k*, 1²k²n, and an arc set E*, defining a partial execution order. The partitioning function can be given as

$pi\; :\; N\; RIGHTARROW\; INDEXES\; (not\; implemented)$ , 1²i²n,

with
$INDEXES\; (not\; implemented)\; CAP\; INDEXES\; (not\; implemented)\; =\; emptyset$
, for i j. This function computes a number of subgraphs G*k* = (
$INDEXES\; (not\; implemented)pi,INDEXES\; (not\; implemented)pi$
), 1²k²n, with the node set

$INDEXES\; (not\; implemented)pi=\; \{(n\; IN\; N)|(pi(n)\; RIGHTARROW\; INDEXES\; (not\; implemented))\}$

and the arc set

$INDEXES\; (not\; implemented)pi=\; \{((INDEXES\; (not\; implemented),\; INDEXES\; (not\; implemented))\; IN\; E)|(pi(INDEXES\; (not\; implemented))\; RIGHTARROW\; INDEXES\; (not\; implemented)^pi(INDEXES\; (not\; implemented))\; RIGHTARROW\; INDEXES\; (not\; implemented))\}$

The arcs of the cut set E* given by the original set E and all sets $INDEXES\; (not\; implemented)pi$ , and defined by

$INDEXES\; (not\; implemented)\; =\; E\; CAP\; (\; BIGCUP\; INDEXES\; (not\; implemented)pi)$

then give the partial execution order "²". For the structure of the subgraphs an additional criterion has to be formulated, namely that each subgraph has to be convex.The question is now, how the partitioning of a graph into convex subgraphs can be achieved. This is performed by specifying an equivalence relation on the node set N and building the corresponding equivalence classes which represent convex subgraphs. The definition of a partial order relation for the equivalence classes then gives the execution sequence.

Equivalence Relation Connect. Connect specifies an equivalence relation in the set of nodes N by

$aINDEXES\; (not\; implemented)b\; LRARROW\; \xb9(a)\; =\; \xb9(b)$ , $a,b\; IN\; N$

Connect is reflexive, symmetric and transitive. Connect* *partitions the node set N into disjunct, non-empty equivalence classes.

$INDEXES\; (not\; implemented)==\{b|(b\; IN\; N^aINDEXES\; (not\; implemented)b)\}$

The set of all equivalence classes, the* *quotient of N by connect is

$N/connect==\{INDEXES\; (not\; implemented)|(a\; IN\; N)\}$

In each equivalence class
$INDEXES\; (not\; implemented)$
are the nodes of N which are assigned later to one Xputer execution block B*k*. The node sets given by
$INDEXES\; (not\; implemented)pi$
, 1²k²n, correspond to the associated equivalence classes of connect and thus represent the contents of the execution blocks. A partial order relation defines the partial execution order of the equivalence classes.

The partial order relation sequence over N / connect is defined by

$INDEXES\; (not\; implemented)INDEXES\; (not\; implemented)INDEXES\; (not\; implemented)\; LRARROW\; path(a,\; b)$

It corresponds to the set of arcs E***. The presented method for control flow partitioning results in a coarse grained block sequence with a partial execution order. The blocks are still target-hardware independent.

Partitioning of the Data-Flow. The goal of the next compilation step is to maximally parallelize each of the determined blocks in the sequence. This gives the basis for a good exploitation of the available hardware resources. First the level-1-dependence graph ([3], [11]) for each of the blocks has to be built. Since all kinds of index expressions in array variables are allowed (zero index variable, e.g. A[5]= A[2], single index variable, e.g. A[i]= A[i+2], and random index variable, e.g. A[i]= A[k+j]) a hierarchical framework of according tests is needed to determine flow, anti, or output data dependences ([11]) together with the level where the dependences exist. The level-1-dependence graph is subdivided into convex subgraphs using the same method described earlier. Each of the subgraphs is then maximally vectorized by using the Allen-Kennedy Vectorization Algorithm ([3]). The result is a new sequence of maximally parallelized blocks containing a partial execution order. Vectorization then generates a maximum degree of parallelism in a statement block of a loop nest for statements having no dependences or being not part of a recurrence or member of a cycle ([3]). But this kind of parallelization is performed independent of any resource constraints. This would surely violate the Xputer hardware constraints, since the execution cannot be realized in a pipelined mode like in a vector computer. Therefore a hardware-dependent vectorization factor VF is introduced. This factor is responsible to subdivide the generated vectors into smaller parts such that an optimal target hardware exploitation can be achieved.

Vectorization Factor VF*j*. Given is a normalized loop J with lower limit l*j* and upper limit u*j* and a statement S with a variable a [f (J)], f(J) is an index function. The loop has been vectorized by replacing the index function [f (J)] by [f(l*j*) : f(u*j*)]. This vector has to be partitioned by the introduction of a hardware-dependent vectorization factor VF*j*. The factor has to be chosen such that

(1) $INDEXES\; (not\; implemented)\; IN\; N$ and

(2) (f(u*j*) - f(l*j*) + 1) mod VF*j* = 0

and the vectorized loop has to be rewritten by:

Do J = l*j* to u*j *by *VFj*

S: a [f (J) *: VFj*] ......

The introduction of a vectorization factor VF*j* has the advantage that a vectorized statement can be adapted to optimally exploit the hardware resources. The factor has to exactly divide the upper limit of a loop index function f (u*j*) minus its lower limit f (l*j*). Introducing a vectorization factor means that (1) the step widths of the according loops have to be adapted ("Do J = l*j* to u*j *by* VFj*") and (2) the index function in the variable has to be changed by the number of accessed variables at one time step ("a [f(J)*:* *VFj*]"), e.g. A [i+1, j+1:10] = C [i, j-2:10] / 2 + C [i-1, j:10]. Each vectorized loop L*j*, 1²j²n, may have its own vectorization factor VF*j*. For all variables which are contained in the same vectorized loop the same vectorization factor has to be used. Partial vectorization together with the concept of the vectorization factor transforms each of the former coarse-grained blocks B*k*, 1²k²m, into a new sequence of parallelized blocks. Each of the blocks provides the special Xputer granularity and optimally exploits the target hardware resources. This is the key for achieving a high performance during Xputer program execution.

Planarization. The two-dimensional data map DM is in contrast to the defined arrays which have higher dimensions. This leads to the mapping problem resulting in the definition of a data allocation scheme. The target hardware parameters and constraints (e.g. seven GAGs are available for the MoM 3) have to be fulfilled. This leads to the data alignment problem. Unrolling the dimensions I of a variable A defined to be d-dimensional, d>2, and $d\; IN\; N$ , means to determine a function dmap from the index domain of the associated data object to the two-dimensional index domain of an Xputer data map DM, by

dmap: $INDEXES\; (not\; implemented)ARIGHTARROW\; \{INDEXES\; (not\; implemented),INDEXES\; (not\; implemented)\}$ , with 1² i ² n.

The question is what realization strategy may be chosen for dmap. First the dimensions in the array definition are numbered from the right to the left and are then mapped. Even numbers are mapped onto the x-coordinate (DM*x*), odd numbers onto the y-coordinate (DM*y*) of the Xputers data map DM. Xputer dimension mapping is a kind of planarization.

Function dmap. Given is an n-dimensional array A with its according ranges V*j*, one for each index I*j*, 0²j²n-1. The mapping of array A can be performed dimension after dimension by the following recursive function:

dmap ( $INDEXES\; (not\; implemented)A$ ) = dmap ( $INDEXES\; (not\; implemented)A$ ) * $INDEXES\; (not\; implemented)A$

dmap ( $INDEXES\; (not\; implemented)A$ ) = $INDEXES\; (not\; implemented)A$

dmap ( $INDEXES\; (not\; implemented)A$ ) = $INDEXES\; (not\; implemented)A$

with j is defined

for the computation of M*x* by:
$j\; =\; \{n+-1,\; if\; n\; odd\; ATOP\; n+-2,\; if\; n\; even$

for the computation of M*y *by:
$j\; =\; \{n+-2,\; if\; n\; odd\; ATOP\; n+-1,\; if\; n\; even$

A consequence of the just described dimension mapping is that each variable is treated as a separate data field. This means that one GAG has to be used for each variable. However, the Xputer prototype MoM 3 provides only a limited number of GAGs (seven) which can be used simultaneously. This produces a hard constraint for the mapping. The solution of this problem is performed in phase 2 of the mapping: the alignment phase of arrays. This phase combines the data fields of suitable arrays in some way for a joint application of only one scan pattern and therefore decreases the number of GAGs needed.

Array Alignment. Let A and B1,..., Bn denote arrays. Alignment of array A with the arrays Bk, 1²k²n, means, that the mapped data space of A, called DM*A*, is related with all data spaces of Bk, DMBk. This relation is formulated as:

align: $INDEXES\; (not\; implemented)\times INDEXES\; (not\; implemented)\; \times \xc9\times \; INDEXES\; (not\; implemented)\; RIGHTARROW\; INDEXES\; (not\; implemented)$

The defined relation raises two main questions: (1) What are the possibilities for the realization of this relation? (2) Under what constraints is alignment of different data objects allowed for Xputers? The Xputer suitable alignment possibilities are restricted to block alignment and mesh alignment, where mesh can be further subdivided into column- and row-cyclic. Both issues are not treated here for lack of space.

Alignment is guided by a heuristic to support the selection of the 'best' choice of variables. Hence the Xputer data allocation scheme has been defined resulting in a data map description file. Bytewise initialization of the Xputer two-dimensional data memory is then performed by a loader.

For reasons of the two-dimensional Xputer data map the hardware of a GAG is implemented such that it generates simultaneously an address part for the y-address and an address part for the x-address. Hence the inner part of an access sequence consists of two values for the lower limit (y_lower, x_lower), two values for the upper limit (y_upper, x_upper), and two for the step widths (step_y, step_x). The computed values directly correspond to a fast scan pattern which is called videoscan ([7]). Since a loop nest for a variable a can be of an n-dimensional form "for i*m* = l*m* to u*m*", these basis parameter set has to be changed after each completion of a videoscan. The values are given by f*m*_lower, f*m*_upper, and step_f*m*, l²k²n-1, 2²l²4, and handled by the control part of the data sequencer. In the best case four nested loops can be directly performed without any further control interaction by one scan pattern. All values of an access sequence for a variable a providing random index expressions have to be computed according to the following equations.

Access Sequences. Given is an n-dimensional loop nest with each loop according to "for i*k* := l*k* to u*k* do", 0²k²n-1, with i*n-1* the outermost loop and i*0* the innermost loop, and the n-dimensional array variable a [f*n-1*(i*n-1*),..., f*1*(i*1*), f*0*(i*0*)]. The index function f*k* is of random or linear form. The ranges of variable "a" are defined by a [V*n-1*] [V*n-2*] ... [V*1*] [V*0*]. The parameters for the concept of an access sequence AS can then be computed by:

upper_x (val = u*0*), lower_x (val = l*0*):

$SUM\; ((INDEXES\; (not\; implemented)(INDEXES\; (not\; implemented),...,\; INDEXES\; (not\; implemented),\; val)+-c)$ *

$dmap\; (INDEXES\; (not\; implemented)a)\; /\; INDEXES\; (not\; implemented)\; )$ + $SUM\; INDEXES\; (not\; implemented)$

for $INDEXES\; (not\; implemented)\; IN\; INDEXES\; (not\; implemented)$ , $INDEXES\; (not\; implemented)\; NOTIN\; INDEXES\; (not\; implemented)$ , if $k\; IN\; \{0,1\}$ then c= 0 else 1

step_x:

$SUM\; (OPPARTIAL\; INDEXES\; (not\; implemented)(INDEXES\; (not\; implemented)(INDEXES\; (not\; implemented),...,\; INDEXES\; (not\; implemented),\; INDEXES\; (not\; implemented)))$ * $dmap\; (INDEXES\; (not\; implemented)a)\; /\; INDEXES\; (not\; implemented)\; )$

step_y:

$SUM\; (OPPARTIAL\; INDEXES\; (not\; implemented)(INDEXES\; (not\; implemented)(INDEXES\; (not\; implemented),...,\; INDEXES\; (not\; implemented)))$ * $dmap\; (INDEXES\; (not\; implemented)a)\; /\; INDEXES\; (not\; implemented)\; )$

upper_y (val = u*0*), lower_y (val = l*0*):

$SUM\; ((INDEXES\; (not\; implemented)(INDEXES\; (not\; implemented),...,\; INDEXES\; (not\; implemented),\; val)+-c)$ *

$dmap\; (INDEXES\; (not\; implemented)a)\; /\; INDEXES\; (not\; implemented)\; )$ + $SUM\; INDEXES\; (not\; implemented)$

for $INDEXES\; (not\; implemented)\; IN\; INDEXES\; (not\; implemented)$ , $INDEXES\; (not\; implemented)\; NOTIN\; INDEXES\; (not\; implemented)$ , if $k\; IN\; \{0,1\}$ then c = 0 else 1

f*m*_upper (val = u*k*), f*m*_lower (val = l*k*):

$(INDEXES\; (not\; implemented)(INDEXES\; (not\; implemented),...,\; val,\; ...,\; INDEXES\; (not\; implemented))-\; c\; )$ * $dmap\; (INDEXES\; (not\; implemented)a)\; /\; INDEXES\; (not\; implemented)$

for $INDEXES\; (not\; implemented)\; IN\; INDEXES\; (not\; implemented)$ , if $m\; IN\; \{0,1\}$ then c = 0 else 1

step f*m*:

$OPPARTIAL\; INDEXES\; (not\; implemented)(INDEXES\; (not\; implemented)(INDEXES\; (not\; implemented),...,\; INDEXES\; (not\; implemented),\; \xc9,\; INDEXES\; (not\; implemented)))$ * $dmap\; (INDEXES\; (not\; implemented)a)\; /\; INDEXES\; (not\; implemented)$ for $INDEXES\; (not\; implemented)\; IN\; INDEXES\; (not\; implemented)$

These equations are computed for only one special word (handle) in one scan window. All other words have to be addressed by their offsets to the handle, remaining unchanged during a scan pattern. The equations have to be adapted for the case of aligned variables or variables being vectorized.

For further informations, please contact: R.W. Hartenstein, University of Kaiserslautern, Department of Computer Science, P.O Box 3049, 67653 Kaiserslautern, Germany, email: abakus@informatik.uni-kl.de.

A. Ast, R.W. Hartenstein, H. Reinig, K. Schmidt, M. Weber: "A General Purpose Xputer Architecture Derived from DSP and Image Processing"; in M.A. Bayoumi (ed.): "VLSI Design Methodologies for Digital Signal Processing Architectures"; Kluwer Academic Publishers, pp. 365-394, 1994.

D.I. Moldovan: "Parallel Processing - From Applications to Systems"; Morgan Kaufmann Publishers, 1993.

J.R. Allen, K. Kennedy: "Automatic Translation of FORTRAN Programs to Vector Form"; ACM Transactions on Programming Languages and Systems, Vol. 9, No. 4, pp. 491-542, October 1987.

A.V. Aho, R.Sehti, J.D. Ullman: "Compilers - Principles, Techniques and Tools"; Addison-Wesley Publishing Company, 1986.

U. Banerjee: "Dependence Analysis for Super-computing"; Kluwer, 1988.

S.D. Brown, R.J. Francis, J. Rose, Z.G. Vranesic: "Field-Programmable Gate Arrays"; Kluwer Academic Publishers, 1992.

R.W. Hartenstein, A.G. Hirschbiel, K. Schmidt, M. Weber: "A Novel Paradigm of Parallel Computation and its Use to Implement Simple High-Performance-Hardware"; Future Generation Computer Systems 7, 91/92, pp. 181-198, North Holland, 1992.

R.W. Hartenstein, R. Kress, H. Reinig: "A Reconfigurable Data-Driven ALU for Xputers"; IEEE Workshop on FPGAs for Custom Computing Machines, FCCM«94, Napa, CA., April 1994.

K. Hwang: "Advanced Computer Architecture: Parallelism, Scalability, Programmability"; McGraw-Hill, 1993.

Z. Shen, Z. Li, P.-C. Yew: "An Empirical Study of Fortran Programs for Parallelizing Compiler"; IEEE Transactions on Parallel and Distributed Systems, Vol.1, No.3, pp. 356-364, July 1990.

H.P. Zima, B. Chapman: "Supercompilers for Parallel and Vector Computers"; ACM Press Frontier Series, Addison-Wesley, 1990.

For a printed version, please contact abakus@informatik.uni-kl.de