Notice: The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarity and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.
Reiner W. Hartenstein, Jürgen Becker, Rainer Kress, Helmut Reinig
University of Kaiserslautern
Erwin-Schrödinger-Straße, D-67663 Kaiserslautern, Germany
Fax: ++49 631 205 2640, email: firstname.lastname@example.org
For such a new class of hardware platforms a new class of compilers is needed, which generate both, sequential and structural code: partitioning compilers, which partition a source into cooperating structural and sequential code segments. In such an environment hardware/software co-design efforts require two levels of partitioning: host/accelerator partitioning (first level) for optimizing performance and a structural/sequential partitioning (second level) for optimizing the hardware/software trade-off of the Xputer resources. This paper presents a framework based on this paradigm. The hardware/software co-design framework CoDe-X targets the partitioning and compilation of conventional C programs onto a host using the Xputer as universal configurable accelerator. In contrast to other hardware/software co-design approaches no additional programming extensions are required and the framework use is not limited to hardware experts. The partitioning is based on a simulated annealing algorithm. The fundamentally new feature of the CoDe-X framework is the two-level co-design approach.
First this paper describes the underlying target hardware. Section 3 presents the hardware/software co-design framework CoDe-X and its strategies. Section 4 discusses the partitioning of an computation-intensive example in the area of fractal images.
The large amount of input data typically requires an algorithm to be organized in (nested) loops, where different array elements are referenced as operands to the computations of the current iteration. The resulting sequence of data accesses shows a regularity, which allows to describe this sequence by a few parameters. An Xputer data sequencer provides seven hardwired generic address generators (GAGs), which are capable of interpreting such parameter sets to compute generic address sequences for the scan windows. This address computation does not require memory cycles. Thatęs why it is an accelerating factor.
To be run on an Xputer, an algorithm has to be transformed into parameters sets for the generic address generators and a configuration of the rALU, which implements the data manipulations within a loop body. Each time, the generic address generators have accessed all input data of a loop iteration, the complex operations configured into the rALU are applied to the input data and the outputs can be written to the data memory by the generic address generators. Since a compiler for an Xputer may rearrange the data in memory to optimize the data access sequences, a data map is required to describe the distribution of input and output data in the data memory.
The current prototype of the Xputer operates as a kind of configurable co-processor to a host computer. All I/O operations are done by the host as well as the memory management. The host has direct access to all memory on the Xputer, and on the other hand the Xputer can access data in the host's memory. The reconfigurable datapath architecture (rDPA) which serves as rALU has been developed especially for evaluating statement blocks in loops and other arithmetic and logic computations .
The partitioning into parts for the host and parts for the Xputer is based on the performance analysis of a profiler (Figure 2). For each basic block and also for the compound BBs the time for evaluation on the Xputer as well as the evaluation on the host is approximately computed (cost function of an BB). The performance data for every Xputer task is received from the datapath synthesis system (DPSS) and the X-C compiler. The X-C compiler is able to evaluate the number of iterations for the statement blocks. Knowing the evaluation time for the statement block from the datapath synthesis system, the complete evaluation time can be approximated. The time consuming placement of the DPSS does not have to be performed for approximating the execution time of the statement blocks. The performance of the host is approximated by examining the code using branch prediction techniques . For each processor type and workstation another model is required. The behaviour of the underlying hardware and operating system has to be deterministic and known. This implies the timing behaviour of all hardware components, the effects of caching, pipelining, etc. The operating system must provide static memory management, system calls should have a calculable timing behaviour, and no asynchronous interrupts should be allowed .
The iterative partitioning phase in the first level of CoDe-X is based on a simulated annealing algorithm , which is controlled by the profiler-computed cost functions of the BBs. This algorithm uses iterative improvement and it belongs to the probabilistic algorithms. The basic blocks of a chosen partitioning are swapped randomly. If the new partitioning results in a lower overall cost (due to the data dependencies the profiler takes concurrent evaluations into consideration), it is accepted. If not, there is still a finite probability to accept which depends on the temperature of the annealing process. This avoids that the algorithm gets stuck in a local minimum. The temperature is very high at the beginning and is gradually lowered.
The actual X-C compiler  takes an almost complete subset of ANSI C as input. Only constructs, which would require a dynamic memory management to be run on the Xputer are excluded. These are pointers, operating system calls and recursive functions. The X-C compiler is analyzing, restructuring and parallelizing the program part for the Xputer. It computes the parameter sets for the data sequencer (software part) and a rALU description for further synthesis in the datapath synthesis system (hardware part), without user interaction. This realizes in the second level of hardware/software co-design a constructive and resource-driven partitioning. First, the compiler performs a data and control flow analysis. The data structure obtained allows restructuring to perform parallelizations like those done by compilers for supercomputers (e.g. vectorization of statements). The next step performs a re-ordering of data accesses to obtain access sequences, which can be mapped well to the parameters of the data sequencer. Therefore, the compiler generates a so-called data map, which describes the way the input data has to be distributed in the data memory to obtain optimized hardware generated address sequences.
The datapath synthesis system (DPSS)  allows to map statements from a high level language description onto the rDPA. The statements may contain arithmetic or logic expressions, conditions, and loops, that evaluate iterative computations on a small number of input data.
At each time the Xputer is accessed, a synchronization point is integrated by the iterative partitioner into code-segments to be executed on the host. The code at these points consists of operating-system calls for synchronizing parallel processes. The tasks for the host are compiled with the GNU C compiler.
The function that is iterated to give the Mandelbrot set is very simple (equation (1)):
where z is a complex number, which is initialized to a starting value and then is allowed to alter as the function is iterated, and c is a complex number, that stays constant for the sequence of iterations.
The C-program of our small example was divided into four basic blocks. Two of them were filtered in the first step of the iterative partitioner for being executed on the host in every case (I/O-routines). The remaining two basic blocks are two nested for-loops (within a while-loop inside the inner for-loop for computing the complex Mandelbrot set). They are potential candidates for mapping onto the Xputer, because the preprocessor could transform them in equivalent X-C code. The profiler has computed the host- and Xputer-cost functions for them. The data dependencies require a sequential execution of the four basic blocks, which will influence the overall execution time of this application. In the following simulated annealing process different partitions have been analyzed. The final decision was to merge the two nested for loops to a single compound basic block and to shift this block to the Xputer. The partitioning overhead for synchronization and data transfers was to inefficient for computing one loop on the host and executing the other on the Xputer. The X-C compiler in the second level of hardware/software co-design was then mapping the compound basic block onto the Xputer. In this final solution the Xputer has to be configured only one time. The address generation of the outer for-loop is handled by one generic address generator (GAG, see Figure 1) and the inner for-loop by another GAG. During execution time a controler within the data sequencer will switch from one GAG to the second GAG without reconfiguration. The while-loop within the inner for-loop is mapped directly onto the rALU. In simulation examples with this application we achieved acceleration factors of up to 6 in comparison to a SUN SPARC 10/51.
 R. Ernst, J. Henkel, T. Benner: Hardware-Software Cosynthesis for Microcontrollers; IEEE Design & Test, pp. 64-75, Dec. 1993
 Proceedings of the IEEE Workshop on FPGAs for Custom Computing Machines, FCCM'94, Napa, CA, April 1994
 R. K. Gupta, C. N. Coelho, G. De Micheli: Program Implementation Schemes for Hardware-Software Systems: IEEE Computer, pp. 48-55, Jan. 1994
 L. Lin: High-Level Synthesis, Introd. to Chip and Syst. Design; Kluwer Acad. Pub., Boston, London, 1992
 R. W. Hartenstein, J. Becker, R. Kress, H. Reinig, K. Schmidt: A Reconfigurable Machine for Applications in Image and Video Compression; Conf. on Compression Techn. & Standards for Image & Video Compression, Amsterdam, Netherlands, March 1995
 R. W. Hartenstein, R. Kress: A Datapath Synthesis System for the Reconfigurable Datapath Architecture; Asia and South Pacific Design Automation Conference, ASP-DAC'95, Nippon Convention Center, Makuhari, Chiba, Japan, Aug. 29 - Sept. 1, 1995
 W. Luk, T. Lu, I. Page: Hardware-Software codesign of Multidimensional Programs; Proceedings of the IEEE Workshop on FPGAs for Custom Computing Machines, FCCM'94, Napa, CA, April 1994
 Heinz-Otto Peitgen, Dietmar Saupe: The Science of Fractal Images; Springer-Press, New-York, 1988
 Joe Pritchard: The Chaos Cookbook; Butterworth-Heinemann, UK, 1992
 P. Puschner, Ch. Koza: Calculating the Maximum Execution Time of Real-Time Programs; Journal of Real-Time Systems p. 159-176, Kluwer Academic Publishers 1989
 K. Schmidt: A Program Partitioning, Restructuring, and Mapping Method for Xputers; Ph.D. Thesis, University of Kaiserslautern, 1994
 M. Wolfe, C.-W. Tseng: The Power Test for Data Dependence; IEEE Transactions on Parallel and Distributed Systems, Vol. 3, No. 5, Sept. 1992