Notice: The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarity and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.
Reiner W. Hartenstein, Jürgen Becker, Rainer Kress, Helmut Reinig, Karin Schmidt
University of Kaiserslautern
Erwin-Schrödinger-Straße, D-67663 Kaiserslautern, Germany
Fax: ++49 631 205 2640, email: firstname.lastname@example.org
A new machine combining the advantages of both structural programming and traditional von Neumann style procedural programming has been introduced by the architectural class of Xputers , a data-parallel machine with shared memory. Field-programmable logic is used to offer configurable instructions which allow a fully parallel execution in contrast to e.g. vector and other parallel computers which are mostly working in a pipelined manner.
Thus several requirements arise for the development of a new Xputer compilation method. As input language the imperative language C has been taken. A fine grained parallelism at statement and expression level has been achieved in order to enable the exploitation of the reconfigurable ALU (rALU). A second major issue in Xputer program compilation is the extraction of the program's data and its mapping in a regular way over the Xputer memory space. This data arrangement together with the extracted data dependencies determine the required data sequencing and thus substantially contribute to the efficiency and performance of the program execution . A third issue is the synthesis of the structural code onto the rALU.
Before the parallelizing compilation method is explained, the target hardware is briefly sketched by introducing the Map-oriented Machine 3 (MoM-3).
The MoM-3 supports up to seven generic address generators (GAGs), each with its own segment of data memory and a rALU subnet on the same board, called computational module (C-module). The rALU subnets of all C-modules are connected to allow propagation of interim results to the next board (figure 1). That way complex operations, which require more resources than a single rALU subnet can provide, can be done on multiple modules. As long as the data required by a rALU subnet resides in the memory segment on the same C-module, data accesses can be done in parallel to data accesses on the other modules. Otherwise, data is transferred on the common MoMbus, which enforces a sequentialization of non-local data accesses. The rALU is based on field-programmable logic (FPL). In the MoM the reconfigurable datapath architecture (rDPA) is used, an FPL architecture with better throughput and higher area efficiency than FPL available commercially . The (re-)configuration of GAGs and rALU subnets is done by a special MoM-3 controller circuit (M3C). It holds all configuration data for a complete application in its memory and is able to switch between configurations by downloading them to the GAGs or rALU subnets. The MoM-3 operates as a kind of configurable co-processor to a host computer. All I/O operations are done by the host as well as the memory management. The MoM-3 is a merely computational device to accelerate time-critical parts of algorithms. The host has direct access to all memory on the MoM-3, and on the other hand the MoM-3 can access data in the host's memory, though only sequentially due to the single bus.
For Xputer compilation all these problems have to be solved during compile time. First a theory is needed for the program partitioning and restructuring (parallelization). The result of this step is the determination of a partial execution sequence. Secondly the program's data has to be mapped in a regular way onto the 2-dimensionally organized Xputer data map, followed by a computation of the right address accesses (data sequencing) for each variable. Thus far all steps are target-hardware independent. Code generation for the MoM-3 results (1) in a hardware image file containing structural information for the configuration of the rALU, and (2) in a software image file containing the parameter sets for the data sequencer hardware, especially the generic address generators.
Partitioning of the Control-Flow. Given is the program graph G = (N, E). The node set N has to be partitioned, resulting in a number of subgraphs Gk, 1²k²n, and an arc set E*, defining a partial execution order. For the structure of the subgraphs an additional criterion has to be formulated, namely that each subgraph has to be convex. The question is now, how the partitioning of a graph into convex subgraphs can be achieved. This is performed by specifying an equivalence relation on the node set N and building the corresponding equivalence classes which represent convex subgraphs. The method for control flow partitioning results in a coarse grained block sequence with a partial execution order . These blocks are still target-hardware independent.
Partitioning of the Data-Flow. The goal of the next compilation step is to maximally parallelize each of the determined blocks in the sequence. Therefore the subgraphs are then maximally vectorized by using the Allen-Kennedy Vectorization Algorithm . The result is a new sequence of maximally parallelized blocks containing a partial execution order. Vectorization then generates a maximum degree of parallelism in a statement block of a loop nest for statements having no dependencies or not being part of a recurrence, nor member of a cycle . Here the special Xputer granularity is taken into consideration and the hardware resources are exploited optimally . This is the key for achieving a high performance during Xputer program execution.
Planarization. The two-dimensional data map DM is in contrast to the defined arrays which have higher dimensions. This leads to the mapping problem resulting in the definition of a data allocation scheme. The target hardware parameters (e.g. seven GAGs are available for the MoM 3) have to be fulfilled. This leads to the data alignment problem. Unrolling the dimensions I of a variable A defined to be d-dimensional, d > 2, and , means to determine a function dmap from the index domain of the data object to the two-dimensional index domain of an Xputer data map DM, by
(1) dmap: , with 1² i ² n.
The dimensions in the array definition are numbered from the right to the left and are then mapped. Even numbers are mapped onto the x-coordinate (DMx), odd numbers onto the y-coordinate (DMy) of the Xputers data map DM. Xputer dimension mapping is a planarization .
The task of configuring the rDPA is carried out in the following four phases: logic optimization and technology mapping, placement and routing, I/O scheduling, and finally the code generation. Partitioning of the statements onto the different rDPA chips is not necessary since the array of rDPA chips appears as one array of DPUs with transparent chip boundaries .
 D.I. Moldovan: Parallel Processing - From Applications to Systems; Morgan Kaufmann Publishers, 1993.
 J.R. Allen, K. Kennedy: Automatic Translation of FORTRAN Programs to Vector Form; ACM Transactions on Programming Languages and Systems, Vol. 9, No. 4, pp. 491-542, October 1987.
 U. Banerjee: Dependence Analysis for Super-computing; Kluwer, 1988.
 Reiner W. Hartenstein, Rainer Kress, Helmut Reinig: A Reconfigurable Accelerator for 32-Bit Arithmetic; International Parallel Processing Symposium, Santa Barbara, USA, April 1995
 R.W. Hartenstein, R. Kress, H. Reinig: A Reconfigurable Data-Driven ALU for Xputers; IEEE Workshop on FPGAs for Custom Computing Machines, FCCM«94, Napa, CA., April 1994.
 K. Hwang: Advanced Computer Architecture: Parallelism, Scalability, Programmability; McGraw-Hill, 1993.
 K. Schmidt: A Program Partitioning, Restructuring, and Mapping Method for Xputers; Ph.D. Thesis, University of Kaiserslautern, 1994
 Z. Shen, Z. Li, P.-C. Yew: An Empirical Study of Fortran Programs for Parallelizing Compiler; IEEE Transactions on Parallel and Distributed Systems, Vol.1, No.3, pp. 356-364, July 1990.
 H.P. Zima, B. Chapman: Supercompilers for Parallel and Vector Computers; ACM Press Frontier Series, Addison-Wesley, 1990.