Notice: The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarity and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.
University of Kaiserslautern
Erwin-Schrödinger-Straße, D-67663 Kaiserslautern, Germany
Fax: ++49 631 205 2640, email: firstname.lastname@example.org
The scene of hardware/software co-design has introduced a number of approaches to speed-up performance, to minimize hardware/software trade-off and to reduce total design time , , . Further co-design approaches are advocated by the FCCM-scene, where a von Neumann host implementation is extended by adding external operators, which are configured on field-programmable circuits. Usually this configuration requires hardware experts. The von Neumann paradigm does not support efficiently "soft" hardware: because of its extremely tight coupling between instruction sequencer and ALU. So a new paradigm is required like the one of Xputers, which conveniently supports "soft" ALUs like the rALU concept (reconfigurable ALU). For such a new class of hardware platforms a new class of compilers is needed, which generate both, sequential and structural code: partitioning compilers, which partition a source into cooperating structural and sequential code segments. In such an environment hardware/software co-design efforts require two levels of partitioning: host/accelerator partitioning (first level) for optimizing performance and a structural/sequential partitioning (second level) for optimizing the hardware/software trade-off of the Xputer resources. This paper presents a framework based on this paradigm. The hardware/software co-design framework CoDe-X targets the partitioning and compilation of conventional C programs onto a host using the Xputer as universal configurable accelerator. In contrast to other hardware/software co-design approaches no additional programming extensions are required and the framework is not limited to hardware experts. The partitioning is based on a simulated annealing algorithm. The fundamentally new feature of the CoDe-X framework is the two-level co-design approach.
First this paper describes the underlying target hardware. Section 3 presents the hardware/software co-design framework CoDe-X. In this section the different strategies for the partitioning are shown and the used tools for evaluating the compilation and synthesis task are sketched. Section 4 discusses the partitioning of an computation-intensive example in the area of fractal images.
The current prototype of the Xputer operates as a kind of configurable co-processor to a host computer. All I/O operations are done by the host as well as the memory management. The host has direct access to all memory on the Xputer, and on the other hand the Xputer can access data in the host's memory. The reconfigurable datapath architecture (rDPA) which serves as rALU has been developed especially for evaluating statement blocks in loops and other arithmetic and logic computations .
The optimizations provided by the X-C compiler and datapath synthesis system allow to realize several implementations for the reconfigurable datapath architecture with different performance/area trade-offs in the second level of the partitioning of CoDe-X. This trade-off is called design space. Figure 4 marks the regions that can be reached with the different optimization techniques. Pipelining in vector statements can be used for all implementations where vector statements occur. It reduces the area of the implementation drastically. When the realized implementation is bound by I/O operations, this may be done without loosing performance. In other cases the speed will decrease slightly. Loop folding and loop unrolling can be used when the statement block occurs in inner loops only. Loop folding uses pipelining over loop iterations or loop boundaries. This technique will always increase the performance without requiring additional area. Thus this method is very attractive and used by default by the DPSS. Loop unrolling results always the highest speed. Depending on the number of unrolled iterations, the area increase may be significant. For data dependencies between the iterations of the loop (overlapping scan windows), memory cycle optimization can be applied resulting in a further increase of speed. Of course the proposed methods can be combined to reach further positions in the design space.
The performance of the host is approximated by examining the code using branch prediction techniques . For each processor type and workstation another model is required. The behaviour of the underlying hardware and operating system has to be deterministic and known. This implies the timing behaviour of all hardware components, the effects of caching, pipelining, etc. The operating system must provide static memory management, system calls should have a calculable timing behaviour, and no asynchronous interrupts should be allowed . Figure 4 shows a grey shaded line giving the performance of the host. For each basic block several acceleration factors can be computed. For each point in the design space, that can be reached with the Xputer implementation, the acceleration factor is computed as:
Partitioner. The partitioning phase in the first level is based on a simulated annealing algorithm , which is controlled by the profiler-computed cost functions of the BBs. This algorithm uses iterative improvement and it belongs to the probabilistic algorithms. The basic blocks of a chosen partitioning are swapped. If the new partitioning results in a lower overall cost, it is accepted. If not, there is still a finite probability to accept which depends on the temperature of the annealing process. This avoids that the algorithm gets stuck in a local minimum. The temperature is very high at the beginning and is gradually lowered.
The actual X-C compiler  takes an almost complete subset of ANSI C as input. Only constructs, which would require a dynamic memory management to be run on the Xputer are excluded. These are pointers, operating system calls and recursive functions. Since the host's operating system takes care of memory management and I/O, the software parts for execution on the Xputer do not need such constructs. Especially for scientific computing, the restrictions of the C subset are not that important, since FORTRAN 77 lacks the same features and is most popular in scientific computing. The X-C compiler is analyzing, restructuring and parallelizing the program part for the Xputer. It computes the parameter sets for the data sequencer (software part) and a rALU description for further synthesis in the datapath synthesis system (hardware part), without user interaction. This realizes in the second level of hardware/software co-design a constructive and resource-driven partitioning. First, the compiler performs a data and control flow analysis. The data structure obtained allows restructuring to perform parallelizations like those done by compilers for supercomputers (e.g. vectorization of statements). The next step performs a re-ordering of data accesses to obtain access sequences, which can be mapped well to the parameters of the data sequencer. Therefore, the compiler generates a so-called data map, which describes the way the input data has to be distributed in the data memory to obtain optimized hardware generated address sequences.
The datapath synthesis system (DPSS)  allows to map statements from a high level language description onto the rDPA. The statements may contain arithmetic or logic expressions, conditions, and loops, that evaluate iterative computations on a small number of input data.
At each time the Xputer is accessed, a synchronization point is integrated by the partitioner into code-segments to be executed on the host. The code at these points consists of operating-system calls for synchronizing parallel processes. The tasks for the host are compiled with the GNU C compiler.
The function that is iterated to give the Mandelbrot set is very simple (equation (2)):
where z is a complex number, which is initialised to a starting value and then is allowed to alter as the function is iterated, and c is a complex number, that stays constant for the sequence of iterations. If we want to compute the Mandelbrot set (black/white), the computer screen is first splitted into a grid (size of fractal image), where real numbers are represented on the x-axis and imaginary numbers on the y-axis. Then a point on this grid is selected representing a particular complex number, which we called c. The real part is being represented by the x position and the imaginary part by the y position. Now another complex number, called z, is set to 0 and equation (2) is iterated until the absolute value of z goes off to infinity (behaving chaotically) or, after a pre-specified number of iterations, is still fairly small (tending towards a specific attractor). If the function has iterated to infinity after a number of steps, then plot the point on the screen corresponding to the real and imaginary values of c in "white". If the number is still small after iterating it for a pre-determined number of steps, then we assume that this value of c will not iterate to infinity and the corresponding point is coloured "black". This process generates the Mandelbrot set.
Our small example was divided in four basic blocks. Two of them were filtered in the first step of the iterative partitioner for being executed on the host in every case (H-BB1, H-BB4: I/O-routines). The remaining two basic blocks (BB2, BB3) are two nested for loops (within a while-loop inside the inner for-loop for computing the complex Mandelbrot set) are potential candidates for mapping on the Xputer, because the preprocessor could transform in equivalent X-C code. The profiler is computing the host- and Xputer-cost functions for them. The data dependencies require a sequential execution of the four basic blocks, which will influence the overall execution time of this application. In the following simulated annealing process different partitions have been analyzed. The final decision was to merge BB2 and BB3 to a single compound basic block and to shift this block to the Xputer. The partitioning overhead for synchronization and data transfers was to inefficient for computing for example BB2 on the host and executing BB3 on the Xputer. The X-C compiler on the second level of hardware/software co-design is then mapping the compound basic block on the Xputer. In this final solution the Xputer must be configured only one time. The address generation of the outer for-loop is handled by one generic address generator (GAG, see figure 1) and the inner for-loop by another GAG. During execution time a controler within the data sequencer will switch from one GAG to the second GAG without reconfiguration. The while-loop within the inner for-loop is mapped directly onto the rALU. In different simulation examples with this application we achieved acceleration factors of up to 6 in comparison with a SUN SPARC 10/51.
The hardware/software co-design framework CoDe-X is completely specified. The X-C compiler and the datapath synthesis system (DPSS) have been implemented. The partitioner and the profiler are currently being implemented.
K. Buchenrieder, C. Veith: CODES: A Practical Concurrent Design Environment. Workshop handouts 1st Int'l Workshop on Hardware/Software Co-Design, Estes Park, Colorado, 1992.
Reiner W. Hartenstein, Jürgen Becker, Rainer Kress, Helmut Reinig, Karin Schmidt: A Reconfigurable Machine for Applications in Image and Video Compression. Conference on Compression Technologies and Standards for Image and Video Compression, Amsterdam, The Netherlands, March 1995.
R. W. Hartenstein, R. Kress: A Datapath Synthesis System for the Reconfigurable Datapath Architecture. Asia and South Pacific Design Automation Conference, ASP-DAC'95, Nippon Convention Center, Makuhari, Chiba, Japan, Aug. 29 - Sept. 1, 1995.
R. K. Gupta, C. N. Coelho, G. De Micheli: Program Implementation Schemes for Hardware-Software Systems. IEEE Computer, pp. 48-55, Jan. 1994.
W. Luk, T. Lu, I. Page: Hardware-Software codesign of Multidimensional Programs. Proc. of the IEEE Workshop on FPGAs for Custom Computing Machines, FCCM'94, Napa, CA, April 1994.
Schmidt: A Program Partitioning, Restructuring, and Mapping Method for Xputers. Ph.D. Thesis, University of Kaiserslautern, 1994.
N. A. Sherwani: Algorithms for Physical Design Automation. Kluwer Academic Publishers, Boston 1993.
Thomas Ball, James R. Larus: Branch Prediction for free. Proc. of the ACM-SIGPLAN ę93: Conference on Programming Language Design and Implementation, pp. 300 - 313; ACM-Press, 1993.
M. Wolfe, C.-W. Tseng: The Power Test for Data Dependence. IEEE Transactions on Parallel and Distributed Systems, Vol. 3, No. 5, Sept. 1992.
Joe Pritchard: The Chaos Cookbook. Butterworth-Heinemann, UK, 1992.
P. Puschner, Ch. Koza: Calculating the Maximum Execution Time of Real-Time Programs. The Journal of Real-Time Systems p. 159-176, Kluwer Academic Publishers 1989.
Heinz-Otto Peitgen, Dietmar Saupe: The Science of Fractal Images. Springer-Press, New-York, 1988.