Notice: The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarity and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.


COMBINING STRUCTURAL AND PROCEDURAL PROGRAMMING
BY PARALLELIZING COMPILATION


Reiner W. Hartenstein, Karin Schmidt

University of Kaiserslautern, Dept. of Computer Science

Keywords: data-parallel machine, field-programmable logic, structural programming, procedural programming

Abstract

A new architectural class of high performance data-parallel machines, called Xputers, is presented which combine structural programming with traditional von Neumann control flow (procedural) programming. From this combination a new programming paradigm arises which is not familiar to the usual software developer. To counteract this deficiency an automatic parallelization and compilation method for Xputers has been developed for the input language C. Sources are restructured and partitioned into an Xputer-suitable execution sequence providing parallelism at expression and at statement level. Data is mapped in a regular form onto the Xputer memory space to be accessible by the Xputers data sequencer hardware which provides a generic set of fast address sequences. The data operations within each part of the derived execution sequence are coded as a structural description for further synthesis towards the reconfigurable ALU which is based on field-programmable logic. Additionally, assembly code is produced in order to control program execution through the data sequencer hardware. The entire method performing the paradigm shift works without further user interaction and all steps are driven by parameters describing the actual target hardware configuration.

Introduction

Today we are facing increasingly complex tasks to be performed by computers. Many of these tasks are computation-intensive requiring a huge amount of high data throughput and high performance. From empirical studies ([10]) it can be concluded that the major amount in computation time is due to rather simple loop constructs. Since additionally these loop constructs are combined with indexed array data structures, ordinary von Neumann style computers are burdened with mainly addressing computations rather than actual data manipulations. First efforts to reduce addressing overhead and to introduce parallelism have been undertaken by the development of pipelined and vector supercomputers ([2], [9]). Together with the achievements in supercomputer technology parallelizing compilers have been developed, where compilation is based on data dependence analysis ([5], [11]). They have to be able to extract the parallelism from a program source and to produce executable code for different parallel target machines. Unfortunately, the hardware structures are not reflecting the structure of the algorithms very well. Therefore, the compiler's task to direct the algorithm to the machine resources restricts the exploitation of inherent parallelism in the algorithm to a large extent.

Emanating from the technology of field-programmable logic (FPL) the new paradigm of structural programming has evolved ([6]). Instead of loading the program code as a sequence of instructions into memory (procedural programming), hardware structures are configured to fulfill the application needs (structural programming). Originally field-programmable logic has been used to accelerate the design of specific hardware. FPL technology is now available in densities that allow the (re-) configuration of complex algorithms on a small set of FPL devices within milliseconds.

A new kind of supercomputer combining the advantages of both structural programming and traditional von Neumann style procedural programming has been introduced by the architectural class of Xputers ([7]), a data-parallel machine with shared memory. One major difference from other data-parallel machines is that Xputers provide a run time customizable instruction set for data manipulations ([8]). Field-programmable logic is used to offer configurable instructions which allow a fully parallel execution in contrast to e.g. vector and other parallel computers which are mostly working in a pipelined manner. The data to be manipulated by the reconfigurable ALU (rALU) is addressed by a special data sequencer hardware containing generic address generators. These address generators provide a rich hardware-based repertory of patterns to access the data from the memory, consequently eliminating most addressing and control overhead. This results in high performance gains ([1]).

From the above several requirements arise for the development of a new Xputer compilation method. As input language the well-known imperative language C has been taken. To achieve the necessary Xputer fine grained parallelism at statement and expression level knowledge out of the supercompiler scene has been adapted. This fine grained parallelism enables the exploitation of the different rALU subnets of the Xputer. A second major issue in Xputer program compilation is the extraction of the program's data and its mapping and distribution in a regular way over the Xputer memory space. This data arrangement together with the extracted data dependences and data accesses determine the required data sequencing and thus substantially contribute to the efficiency and performance of the program execution. The proposed compilation method is working without further user interaction and is flexible in order to be driven by the hardware constraints of the actual prototype hardware target.

Before the parallelizing compilation method is explained, the Xputer target hardware is briefly sketched by introducing the Xputer prototype "MoM 3" (Map-oriented Machine 3).

The Xputer Prototype "MoM 3"

Many applications out of the areas of digital signal processing, image processing, electronic design automation, and others require intensively iterative data manipulations to be performed on a large amount of data, e.g. statement blocks in nested loops. The new architectural class of Xputers ([7], [1]) with its third prototype "MoM 3" (Map-oriented Machine 3) is especially designed to reduce the von Neumann bottleneck of repetitive decoding and address interpreting. This bottleneck contributes a significant amount to the run time of algorithms out of these areas (90% in image processing, 58% in DSP [1]). Although the MoM 3 may serve as stand-alone machine it is currently embedded as a general-purpose co-processor in a VMEbus based workstation. After setup, the MoM 3 runs independently from the host computer until the complete application is processed. Setup in this case means, that the host software has to load the application data into the MoM 3 data memory, load the GAG parameter sets, the rALU configuration code and the program for the instruction sequencer into the MoM 3 control memory and initiate execution.

The MoM 3 consists of three major parts: (1) the data sequencer with seven generic address generators (GAGs), (2) the reconfigurable ALU (rALU) with seven scan windows and several subnets, and (3) the data memory (see figure 1).

.In contrast to the one-dimensional von Neumann memory space, the Xputer's data memory is primarily organized two-dimensional by splitting the memory address into an x- and y-part like coordinates in a two-dimensional map. It holds the data of the user's programs. The data is distributed in a regular fashion over the data memory given by a generic mapping scheme (data map). Each scan window out of the rALU serves as a window to the data memory being the processor-to-memory interface of Xputers. A scan window holds data from a local neighbourhood as a copy out of data memory. It efficiently supports the exploitation of parallelism within an algorithm. Scan windows are adjustable in size during run time. The data sequencer hardware provides accessing sequences for a controlled scan window movement over the memory space. Thus the data sequencer represents the main control part of an Xputer. It consists of seven generic address generators (GAGs) operating in parallel and a control logic, which reconfigures GAGs and rALU when necessary. A generic address generator is able to compute a long sequence of addresses, so-called basic scan patterns, for the data in the data map from a relatively small parameter set (figure 2).

All data manipulations are done by the reconfigurable ALU applied to the data in the scan windows. For the MoM 3 a special reconfigurable datapath architecture (rDPA) supporting word level has been developed ([8]) for the evaluation of any arithmetic and logic expression. Each basic cell of the rDPA serves as an operator and is configured with a fixed ALU and a microprogrammed control. During the compilation the data operations are coded as structural description for further synthesis towards the rDPA.

The Compilation Method

A partitioning, restructuring, and mapping method is needed to translate a sequential C program into code which can be executed on an Xputer. This paradigm switch shall be performed without further user interaction. The method itself deals with the fundamental problems similar to those in compiling a program for parallel execution on a multiprocessor system. These problems are: (1) Identify and extract potential parallelism, (2) partition the program into a sequence of execution units according to the granularity of the architecture and the hardware constraints, (3) compute an efficient allocation scheme for the data in the Xputer data map, and (4) generate efficient and fast code.

For Xputer compilation all these problems have to be solved during compile time. First a theory is needed for the program partitioning and restructuring (parallelization). Result of this step is the determination of a partial execution sequence. Secondly the program's data has to be mapped in a regular way onto the 2-dimensionally organized Xputer data map, followed by a computation of the right address accesses (data sequencing) for each variable. Thus far all steps are target-hardware independent. Code generation for the MoM 3 results (1) in a hardware image file containing structural information for the configuration of field-programmable logic, and (2) in a software image file containing the parameter sets for the data sequencer hardware especially the generic address generators.

Determination of a Program Execution Sequence

The result of the parsing of the program is a graphical representation G = (N, E) with node set N and arc set E ([4]). The control flow of the program has to be partitioned first in order to find parallelizable subgraphs. This step is followed by a data partitioning by partial vectorization based on the level-k-dependence graph ([11]).

Partitioning of the Control-Flow. Given is the program graph G = (N, E). The node set N has to be partitioned, resulting in a number of subgraphs Gk, 1kn, and an arc set E*, defining a partial execution order. The partitioning function can be given as

pi : N RIGHTARROW INDEXES (not implemented) , 1in,

with INDEXES (not implemented) CAP INDEXES (not implemented) = emptyset , for i j. This function computes a number of subgraphs Gk = ( INDEXES (not implemented)pi,INDEXES (not implemented)pi ), 1kn, with the node set

INDEXES (not implemented)pi = {(n IN N)|(pi(n) RIGHTARROW INDEXES (not implemented))}

and the arc set

INDEXES (not implemented)pi = {((INDEXES (not implemented), INDEXES (not implemented)) IN E)|(pi(INDEXES (not implemented)) RIGHTARROW INDEXES (not implemented)^pi(INDEXES (not implemented)) RIGHTARROW INDEXES (not implemented))}

The arcs of the cut set E* given by the original set E and all sets INDEXES (not implemented)pi , and defined by

INDEXES (not implemented) = E CAP ( BIGCUP INDEXES (not implemented)pi)

then give the partial execution order "". For the structure of the subgraphs an additional criterion has to be formulated, namely that each subgraph has to be convex.The question is now, how the partitioning of a graph into convex subgraphs can be achieved. This is performed by specifying an equivalence relation on the node set N and building the corresponding equivalence classes which represent convex subgraphs. The definition of a partial order relation for the equivalence classes then gives the execution sequence.

Equivalence Relation Connect. Connect specifies an equivalence relation in the set of nodes N by

aINDEXES (not implemented)b LRARROW (a) = (b) , a,b IN N

Connect is reflexive, symmetric and transitive. Connect partitions the node set N into disjunct, non-empty equivalence classes.

INDEXES (not implemented)=={b|(b IN N^aINDEXES (not implemented)b)}

The set of all equivalence classes, the quotient of N by connect is

N/connect=={INDEXES (not implemented)|(a IN N)}

In each equivalence class INDEXES (not implemented) are the nodes of N which are assigned later to one Xputer execution block Bk. The node sets given by INDEXES (not implemented)pi , 1kn, correspond to the associated equivalence classes of connect and thus represent the contents of the execution blocks. A partial order relation defines the partial execution order of the equivalence classes.

The partial order relation sequence over N / connect is defined by

INDEXES (not implemented)INDEXES (not implemented)INDEXES (not implemented) LRARROW path(a, b)

It corresponds to the set of arcs E*. The presented method for control flow partitioning results in a coarse grained block sequence with a partial execution order. The blocks are still target-hardware independent.

Partitioning of the Data-Flow. The goal of the next compilation step is to maximally parallelize each of the determined blocks in the sequence. This gives the basis for a good exploitation of the available hardware resources. First the level-1-dependence graph ([3], [11]) for each of the blocks has to be built. Since all kinds of index expressions in array variables are allowed (zero index variable, e.g. A[5]= A[2], single index variable, e.g. A[i]= A[i+2], and random index variable, e.g. A[i]= A[k+j]) a hierarchical framework of according tests is needed to determine flow, anti, or output data dependences ([11]) together with the level where the dependences exist. The level-1-dependence graph is subdivided into convex subgraphs using the same method described earlier. Each of the subgraphs is then maximally vectorized by using the Allen-Kennedy Vectorization Algorithm ([3]). The result is a new sequence of maximally parallelized blocks containing a partial execution order. Vectorization then generates a maximum degree of parallelism in a statement block of a loop nest for statements having no dependences or being not part of a recurrence or member of a cycle ([3]). But this kind of parallelization is performed independent of any resource constraints. This would surely violate the Xputer hardware constraints, since the execution cannot be realized in a pipelined mode like in a vector computer. Therefore a hardware-dependent vectorization factor VF is introduced. This factor is responsible to subdivide the generated vectors into smaller parts such that an optimal target hardware exploitation can be achieved.

Vectorization Factor VFj. Given is a normalized loop J with lower limit lj and upper limit uj and a statement S with a variable a [f (J)], f(J) is an index function. The loop has been vectorized by replacing the index function [f (J)] by [f(lj) : f(uj)]. This vector has to be partitioned by the introduction of a hardware-dependent vectorization factor VFj. The factor has to be chosen such that

(1) INDEXES (not implemented) IN N and

(2) (f(uj) - f(lj) + 1) mod VFj = 0
and the vectorized loop has to be rewritten by:
Do J = lj to uj by VFj

S: a [f (J) : VFj] ......

The introduction of a vectorization factor VFj has the advantage that a vectorized statement can be adapted to optimally exploit the hardware resources. The factor has to exactly divide the upper limit of a loop index function f (uj) minus its lower limit f (lj). Introducing a vectorization factor means that (1) the step widths of the according loops have to be adapted ("Do J = lj to uj by VFj") and (2) the index function in the variable has to be changed by the number of accessed variables at one time step ("a [f(J): VFj]"), e.g. A [i+1, j+1:10] = C [i, j-2:10] / 2 + C [i-1, j:10]. Each vectorized loop Lj, 1jn, may have its own vectorization factor VFj. For all variables which are contained in the same vectorized loop the same vectorization factor has to be used. Partial vectorization together with the concept of the vectorization factor transforms each of the former coarse-grained blocks Bk, 1km, into a new sequence of parallelized blocks. Each of the blocks provides the special Xputer granularity and optimally exploits the target hardware resources. This is the key for achieving a high performance during Xputer program execution.

Data Mapping and Data Aligning

The next step in compilation is to decide how the program data (variables, arrays, ...) can be mapped onto the two-dimensionally organized Xputer data map DM = {DMx, DMy} in a regular fashion, and how the data fields of differently mapped data variables can be aligned to a combined data field in order to use only one scan pattern.

Planarization. The two-dimensional data map DM is in contrast to the defined arrays which have higher dimensions. This leads to the mapping problem resulting in the definition of a data allocation scheme. The target hardware parameters and constraints (e.g. seven GAGs are available for the MoM 3) have to be fulfilled. This leads to the data alignment problem. Unrolling the dimensions I of a variable A defined to be d-dimensional, d>2, and d IN N , means to determine a function dmap from the index domain of the associated data object to the two-dimensional index domain of an Xputer data map DM, by

dmap: INDEXES (not implemented)A RIGHTARROW {INDEXES (not implemented),INDEXES (not implemented)} , with 1 i n.

The question is what realization strategy may be chosen for dmap. First the dimensions in the array definition are numbered from the right to the left and are then mapped. Even numbers are mapped onto the x-coordinate (DMx), odd numbers onto the y-coordinate (DMy) of the Xputers data map DM. Xputer dimension mapping is a kind of planarization.

Function dmap. Given is an n-dimensional array A with its according ranges Vj, one for each index Ij, 0jn-1. The mapping of array A can be performed dimension after dimension by the following recursive function:

dmap ( INDEXES (not implemented)A ) = dmap ( INDEXES (not implemented)A ) * INDEXES (not implemented)A

dmap ( INDEXES (not implemented)A ) = INDEXES (not implemented)A

dmap ( INDEXES (not implemented)A ) = INDEXES (not implemented)A

with j is defined

for the computation of Mx by: j = {n+-1, if n odd ATOP n+-2, if n even

for the computation of My by: j = {n+-2, if n odd ATOP n+-1, if n even

A consequence of the just described dimension mapping is that each variable is treated as a separate data field. This means that one GAG has to be used for each variable. However, the Xputer prototype MoM 3 provides only a limited number of GAGs (seven) which can be used simultaneously. This produces a hard constraint for the mapping. The solution of this problem is performed in phase 2 of the mapping: the alignment phase of arrays. This phase combines the data fields of suitable arrays in some way for a joint application of only one scan pattern and therefore decreases the number of GAGs needed.

Array Alignment. Let A and B1,..., Bn denote arrays. Alignment of array A with the arrays Bk, 1kn, means, that the mapped data space of A, called DMA, is related with all data spaces of Bk, DMBk. This relation is formulated as:

align: INDEXES (not implemented)×INDEXES (not implemented) ×× INDEXES (not implemented) RIGHTARROW INDEXES (not implemented)

The defined relation raises two main questions: (1) What are the possibilities for the realization of this relation? (2) Under what constraints is alignment of different data objects allowed for Xputers? The Xputer suitable alignment possibilities are restricted to block alignment and mesh alignment, where mesh can be further subdivided into column- and row-cyclic. Both issues are not treated here for lack of space.

Alignment is guided by a heuristic to support the selection of the 'best' choice of variables. Hence the Xputer data allocation scheme has been defined resulting in a data map description file. Bytewise initialization of the Xputer two-dimensional data memory is then performed by a loader.

Determination of the Data Sequencing

The accessing of the program data variables by their indices is needed for the generation of scan patterns from which the parameter sets for the data sequencer are derived. This results in the determination of an access sequence for each variable according to their indices together with their data fields which have been mapped into a two-dimensional form. The access sequences then can be used for a computation of corresponding scan patterns and parameter sets. The computation of an access sequence is influenced by the mapping, the alignment, the index expressions, and the according loop limits (upper, lower, step width).

For reasons of the two-dimensional Xputer data map the hardware of a GAG is implemented such that it generates simultaneously an address part for the y-address and an address part for the x-address. Hence the inner part of an access sequence consists of two values for the lower limit (y_lower, x_lower), two values for the upper limit (y_upper, x_upper), and two for the step widths (step_y, step_x). The computed values directly correspond to a fast scan pattern which is called videoscan ([7]). Since a loop nest for a variable a can be of an n-dimensional form "for im = lm to um", these basis parameter set has to be changed after each completion of a videoscan. The values are given by fm_lower, fm_upper, and step_fm, lkn-1, 2l4, and handled by the control part of the data sequencer. In the best case four nested loops can be directly performed without any further control interaction by one scan pattern. All values of an access sequence for a variable a providing random index expressions have to be computed according to the following equations.

Access Sequences. Given is an n-dimensional loop nest with each loop according to "for ik := lk to uk do", 0kn-1, with in-1 the outermost loop and i0 the innermost loop, and the n-dimensional array variable a [fn-1(in-1),..., f1(i1), f0(i0)]. The index function fk is of random or linear form. The ranges of variable "a" are defined by a [Vn-1] [Vn-2] ... [V1] [V0]. The parameters for the concept of an access sequence AS can then be computed by:

upper_x (val = u0), lower_x (val = l0):

SUM ((INDEXES (not implemented)(INDEXES (not implemented),..., INDEXES (not implemented), val)+-c) *

dmap (INDEXES (not implemented)a) / INDEXES (not implemented) ) + SUM INDEXES (not implemented)

for INDEXES (not implemented) IN INDEXES (not implemented) , INDEXES (not implemented) NOTIN INDEXES (not implemented) , if k IN {0,1} then c= 0 else 1

step_x:

SUM (OPPARTIAL INDEXES (not implemented)(INDEXES (not implemented)(INDEXES (not implemented),..., INDEXES (not implemented), INDEXES (not implemented))) * dmap (INDEXES (not implemented)a) / INDEXES (not implemented) )

step_y:

SUM (OPPARTIAL INDEXES (not implemented)(INDEXES (not implemented)(INDEXES (not implemented),..., INDEXES (not implemented))) * dmap (INDEXES (not implemented)a) / INDEXES (not implemented) )

upper_y (val = u0), lower_y (val = l0):

SUM ((INDEXES (not implemented)(INDEXES (not implemented),..., INDEXES (not implemented), val)+-c) *

dmap (INDEXES (not implemented)a) / INDEXES (not implemented) ) + SUM INDEXES (not implemented)

for INDEXES (not implemented) IN INDEXES (not implemented) , INDEXES (not implemented) NOTIN INDEXES (not implemented) , if k IN {0,1} then c = 0 else 1

fm_upper (val = uk), fm_lower (val = lk):

(INDEXES (not implemented)(INDEXES (not implemented),..., val, ..., INDEXES (not implemented))- c ) * dmap (INDEXES (not implemented)a) / INDEXES (not implemented)

for INDEXES (not implemented) IN INDEXES (not implemented) , if m IN {0,1} then c = 0 else 1

step fm:

OPPARTIAL INDEXES (not implemented)(INDEXES (not implemented)(INDEXES (not implemented),..., INDEXES (not implemented), , INDEXES (not implemented))) * dmap (INDEXES (not implemented)a) / INDEXES (not implemented) for INDEXES (not implemented) IN INDEXES (not implemented)

These equations are computed for only one special word (handle) in one scan window. All other words have to be addressed by their offsets to the handle, remaining unchanged during a scan pattern. The equations have to be adapted for the case of aligned variables or variables being vectorized.

Code Generation

For each block out of the generated sequence according scan windows and access sequences have been generated, which can be now transformed into structural and procedural code for the MoM 3 hardware. The hardware image file containing the structural information for the configuration of the rDPA ([8]) consists first of a set of scan window definitions together with their types (e.g. integer, float). The second part solely consists of a sequence of assignments, which can be computed in parallel. The software image file for the data sequencer hardware comprises an initialization part (e.g. memory size and segments, number and names of the rALU subnets), a load and write strategy for the data of every scan window being used, and the parameter sets for the data sequencing of every GAG.

Conclusions

The paper has sketched a new data-parallel machine, the Xputer, and proposed a parallelizing compilation method for this machine. The method combines structural and procedural programming, according to the Xputer paradigm of a data sequencer hardware and an FPGA-based reconfigurable ALU. Due to the data sequencing, avoiding repetitive address computation, high performance factors can be achieved ([1]). The parallelizing compilation method realizes the paradigm shift from von Neumann paradigm (imposed by the choice of C as input language) to the Xputer computing principles without further user interaction. This allows the programmer to use the advantages of a new machine paradigm without learning a new programming language. The method compiles such that the special Xputer fine granularity is achieved together with a high hardware exploitation of the available resources. This fact guarantees high acceleration gains. The implementation of the proposed compilation method is completed and the performance factors are currently being evaluated.

For further informations, please contact: R.W. Hartenstein, University of Kaiserslautern, Department of Computer Science, P.O Box 3049, 67653 Kaiserslautern, Germany, email: abakus@informatik.uni-kl.de.

References

[1]

A. Ast, R.W. Hartenstein, H. Reinig, K. Schmidt, M. Weber: "A General Purpose Xputer Architecture Derived from DSP and Image Processing"; in M.A. Bayoumi (ed.): "VLSI Design Methodologies for Digital Signal Processing Architectures"; Kluwer Academic Publishers, pp. 365-394, 1994.

[2]

D.I. Moldovan: "Parallel Processing - From Applications to Systems"; Morgan Kaufmann Publishers, 1993.

[3]

J.R. Allen, K. Kennedy: "Automatic Translation of FORTRAN Programs to Vector Form"; ACM Transactions on Programming Languages and Systems, Vol. 9, No. 4, pp. 491-542, October 1987.

[4]

A.V. Aho, R.Sehti, J.D. Ullman: "Compilers - Principles, Techniques and Tools"; Addison-Wesley Publishing Company, 1986.

[5]

U. Banerjee: "Dependence Analysis for Super-computing"; Kluwer, 1988.

[6]

S.D. Brown, R.J. Francis, J. Rose, Z.G. Vranesic: "Field-Programmable Gate Arrays"; Kluwer Academic Publishers, 1992.

[7]

R.W. Hartenstein, A.G. Hirschbiel, K. Schmidt, M. Weber: "A Novel Paradigm of Parallel Computation and its Use to Implement Simple High-Performance-Hardware"; Future Generation Computer Systems 7, 91/92, pp. 181-198, North Holland, 1992.

[8]

R.W. Hartenstein, R. Kress, H. Reinig: "A Reconfigurable Data-Driven ALU for Xputers"; IEEE Workshop on FPGAs for Custom Computing Machines, FCCM94, Napa, CA., April 1994.

[9]

K. Hwang: "Advanced Computer Architecture: Parallelism, Scalability, Programmability"; McGraw-Hill, 1993.

[10]

Z. Shen, Z. Li, P.-C. Yew: "An Empirical Study of Fortran Programs for Parallelizing Compiler"; IEEE Transactions on Parallel and Distributed Systems, Vol.1, No.3, pp. 356-364, July 1990.

[11]

H.P. Zima, B. Chapman: "Supercompilers for Parallel and Vector Computers"; ACM Press Frontier Series, Addison-Wesley, 1990.


For a printed version, please contact abakus@informatik.uni-kl.de




Webmaster