Notice: The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarity and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.
University of Kaiserslautern
Erwin-Schrödinger-Straße, D-67663 Kaiserslautern, Germany
Fax: ++49 631 205 2640, email: firstname.lastname@example.org
In contrast to related other work in this field [EHB93] no programming language extensions are required by our solution, which supports portability of applications. Our partitioning is also based on a simulated annealing algorithm. The host and the Xputer can run in parallel, which depends on the analysis of the data-dependencies of the program. Another difference is the possibility of a partly reconfiguration of the accelerator while host and Xputer are running. Differently to Gupta and Micheli [GCM94] we have not such restrictions on the input of our framework, because they are using C with additional hardware features. While their partitioning is on instruction level we perform a coarse-grained partitioning on basic block level, which reduces the communication overhead. In comparison to Custom Computing Machines our partitioning into software and hardware parts incl. synchronization and communication is fully automatic. In the CoDe-X framework complete applications can be moved and compiled onto the Xputer without using commercial hardware-synthesis tools. Divide and conquer strategies used from Luk, Lu and Page [LWP94] are good examples for an efficient partitioning into parts for the host and the accelerator. The approach described in this paper is not limited to such algorithms. The fundamentally new feature of the CoDe-X framework is the two-level co-design approach. The paper is organized as follows: Section 3 presents the hardware/software co-design framework CoDe-X. The different algorithms for the partitioning are shown and the used tools for evaluating the compilation and synthesis task are sketched. Section 4 discusses the partitioning of an computation-intensive example in the area of image processing.
The large amount of input data typically requires an algorithm to be organized in (nested) loops, where different array elements are referenced as operands to the computations of the current iteration. The resulting sequence of data accesses shows a regularity, which allows to describe this sequence by a number of parameters. An Xputer data sequencer provides up to seven hardwired generic address generators (GAGs), which are capable of interpreting such parameter sets to compute generic address sequences for the scan windows. These seven GAGs can operate in parallel [HBK95]. To be run on an Xputer, an algorithm has to be transformed into parameters sets for the generic address generators and a configuration of the rALU, which implements the data manipulations within a loop body. Each time, the generic address generators have accessed all input data of a loop iteration, the complex operations configured into the rALU are applied to the input data and the outputs can be written to the data memory. Since a compiler for an Xputer may rearrange the data in memory to optimize the data access sequences, a data map is required to describe the distribution of input and output data in the data memory.
Our most recent prototype of the Xputer operates as a kind of configurable co-processor to a host computer. All I/O operations are done by the host as well as the memory management. The host has direct access to all memory on the Xputer, and on the other hand the Xputer can access data in the host's memory. The reconfigurable datapath architecture (rDPA), which serves as rALU has been developed especially for evaluating statement blocks in loops and other arithmetic and logic computations [HaKr95]. The rDPA consists of a regular array of identical processing elements called datapath units (DPUs, figure 2). Each DPU has two input and two output registers. The dataflow direction is from west and/or north to east and/or south. The operation of the DPUs is data-driven, the operation is performed as soon as the required operands are available. The communication between the neighbouring DPUs is synchronized by a handshake. A global I/O bus has been integrated into the rDPA, permitting the DPUs to write from their output registers directly outside the array and to read directly from outside. This means, that input data to expressions mapped into the rDPA does not need to be routed through the DPUs. The communication between an external controller, or host, and the DPUs is synchronized by a handshake like the internal communications. An extensible set of operators for each DPU is provided by a library. The set includes the operators of the programming language C. The operators of the DPUs are configurable. A DPU is implemented using a fixed ALU and a microprogrammed control. The configuration is data-driven, and therefore special timing does not have to be considered. Together with the rALU controller the rDPA forms the data-driven rALU, as shown in figure 2. The control chip consists of a control unit, a register file, and an address generation unit for addressing the DPUs. The register file is useful for optimizing memory cycles. The rDPA control unit holds a program to control the different parts of the data-driven rALU. The current version of the rDPA is described in [HaKr95].
The optimizations provided by the X-C compiler and datapath synthesis system allow to realize several implementations for the reconfigurable datapath architecture with different performance/area trade-offs in the second level of partitioning of CoDe-X. This trade-off is called design space. Figure 5 marks the regions that can be reached with the different optimization techniques. Pipelining in vectorized statements can be used for all implementations where vector statements occur. It reduces the area of the implementation drastically. When the realized implementation is bound by I/O operations, this may be done without loosing performance. In other cases the speed will decrease slightly. Loop folding and loop unrolling can be used, when the statement block occurs in an inner loop. Loop folding uses pipelining over loop iterations or loop boundaries. This technique will always increase the performance without requiring additional area. Loop unrolling results in most cases in the highest speed. Depending on the number of unrolled iterations, the area increase may be significant. For data dependencies between the iterations of the loop (overlapping scan windows) memory cycle optimization can be applied resulting in a further increase of speed. Of course the proposed methods can be combined to reach further positions in the design space.
The performance of the host is approximated by examining the code. For each processor type and workstation another model is required. The behaviour of the underlying hardware and operating system has to be deterministic and known. This implies the timing behaviour of all hardware components, the effects of caching, pipelining, etc. The operating system must provide static memory management, system calls should have a calculable timing behaviour, and no asynchronous interrupts should be allowed [PuKo89]. Branch prediction techniques [BaLa93] are considered by computing the execution time of a code segment. Figure 5 shows a grey shaded line giving the performance of the host. For each basic block several acceleration factors can be computed. The initial point in the design space requires n operators or rather DPUs for implementing the function of this basic block. If loop folding (pipelining across loop boudaries) is possible, the performance (acceleration factor) increases up to max. fold without requiring more DPUs. If pipelining in vectorized statements can be performed, the performance is the same (max. pipe) in the best case or the acceleration factor is decreasing. But the number of used DPUs will be less than before. The highest increase of acceleration factors is gained by loop unrolling (max. unroll), which requires the most hardware resources of the proposed optimization techniques. If two iterations of a loop body will be unrolled, then 2n DPUs are needed (see Figure 5). The values of max. fold, max. pipe and max. unroll depend on the actual viewed basic block.
(1) For each point in the design space that can be reached with the Xputer implementation the acceleration factor is computed as:
First, an initial partitioning for the simulated annealing process is computed by placing the basic blocks, which were filtered out by the preprocessor in the first level of partitioning as possible candidates for execution on the Xputer, in two partitions based on their acceleration factors (see equation (1)). One partition belongs to basic blocks that are executed on the host, the other partition belongs to basic blocks that are executed on the Xputer (figure 7). The basic blocks including I/O routines and dynamic structures have been filtered out and are always executed on the host (white basic blocks H-BBs in Figure 7).
The PERTURB-function uses several possibilities to change the partitioning randomly:
In general the costs are determined by delay times of the basic blocks, plus the penalty cost for the synchronization and possible reconfiguration of the Xputer during run time. The COST-function gives the complete cost by determining the overall execution time of the algorithm using the actual partitioning. The cost of a partitioning is determined from the profiler according to equation (2):
: sum of execution times of basic blocks, which will be executed on the host
: sum of execution times of basic blocks, which will be executed on the Xputer
: sum of delay times for reconfiguring the Xputer during run time
: sum of delay times for synchronizing host/Xputer (operating system calls)
: sum of overlapping execution times between basic blocks executed simultanously on host/Xputer
Due to the data dependencies the profiler takes concurrent evaluations into consideration. Moreover successive data dependent basic blocks can be fused to a compound basic block, dependent on the available Xputer hardware resources. This results maybe in a lower delay time than the addition of the delay times of the individual basic blocks. Such delay times are considered too. The profiler also considers possible delay times for reconfiguring the Xputer during run time. This will be necessary, if a partition moves more basic blocks to the Xputer, than parallel GAG- or rather rALU-units are available. If possible, the reconfigration is performed in parallel to running host tasks. For faster access all delay times are stored in a lookup table. For speed improvement of the simulated annealing algorithm, the cost increase or decrease due to the exchange is computed only. The rest of the basic blocks are not considered. For controlling the simulated annealing process, the COST-function can additionally be multiplied with a specific weight-function (e.g. exponential function), which results in a faster move to optimized solutions. The cooling schedule is controlled by a linear function f(temp) = temp * 0.95. The RANDOM()-function results a random number in the range between the first and second argument.
The task of configuring the rDPA is carried out in the following four phases: logic optimization and technology mapping, placement and routing, I/O scheduling, and finally the code generation. Partitioning of the statements onto the different rDPA chips is not necessary since the array of rDPA chips appears as one array of DPUs with transparent chip boundaries.
For further details about the four phases of the DPSS and examples of the mapping onto the rDPA from a high level language description see [HaKr95].
At each time the Xputer is accessed, a synchronization point is integrated by the first level partitioner into the program to be executed on the host. The code at these points consists of operating-system calls for synchronizing parallel processes (fork/join etc.). Then the tasks for the host are compiled with the GNU C compiler.
for x,y = 0,1, ..., N-1. S is the set of coordinates of points in the neighborhood of (but not including) the point (x,y), and M is a pre-computed normalization factor. The structure of the corresponding C-program is illustrated in Figure 8. This small example for illustrating the methods of CoDe-X is divided in four basic blocks. Two of them were filtered out in the first step of the iterative partitioner for being executed on the host in every case (H-BB1, H-BB4: I/O-routines). The remaining two basic blocks (BB2, BB3) are potential candidates for mapping onto the Xputer. The profiler is computing the host- and Xputer- cost functions for them. The data dependencies require a sequential execution of the four basic blocks, which will influence the overall execution time of this application. The final decision was to shift BB2 and BB3 to the Xputer. The X-C compiler on the second level of hardware/software co-design is then mapping these basic blocks onto the Xputer. In this final solution the Xputer must be configured only one time. The address generation for the data accesses in the fully nested loop of BB2 is handled by one generic address generator (GAG, see Figure 1) and BB3 by another GAG. During run time a controller within the data sequencer will switch from one GAG to another without reconfiguration. The performance/area trade-off of BB3 is illustrated in Figure 9 and was derived by implementing this block on a SUN SPARC 10/51 (287,0 seconds) and by analyzing its implementation on the current Xputer prototype (unrolled version in 19,6 seconds, assuming 120 ns memory access time). This results in an acceleration factor of 14,6. By additionally dividing the image into stripes, whose will be manupulated by different independent GAGs and rALUs in parallel, this acceleration factor can be multiplied by the factor 5. This results in the final acceleration factor of max.unroll=73,0 (see Figure 9). The therefore necessary code optimiztations and transformations (e.g. strip mining etc.) will be performed by the iterative partitioner in the first level of CoDe-X. Here a large fully nested loop over an image will be transformed into a number of smaller independent loops, which can exploit the parallel Xputer hardware resources in an optimized way.
In this example the communication overhead between host and Xputer is very small, because the image data can be accessed by both hardware platforms in the common memory and has not to be transfered. Due to this fact applications from image processing are well suited for our accelerator.
The hardware/software co-design framework CoDe-X is completely specified. The X-C compiler and the datapath synthesis system (DPSS) have been implemented. The partitioner and the profiler are currently being implemented.
Reiner W. Hartenstein, Jürgen Becker, Rainer Kress, Helmut Reinig, Karin Schmidt: A Reconfigurable Machine for Applications in Image and Video Compression; Conference on Compression Technologies and Standards for Image and Video Compression, Amsterdam, The Netherlands, March 1995
A. Ast, R. W. Hartenstein, H. Reinig, K. Schmidt, M. Weber: A General purpose Xputer Architecture derived from DSP and Image Processing; in M. A. Bayoumi (Ed.): VLSI Design Methodologies for Digital Signal Processing Architectures; Kluwer Academic Publishers, Boston, London, Dortrecht, pp. 365-394, 1994
Thomas Ball, James R. Larus: Branch Prediction for free. Proc. of the ACM-SIGPLAN «93: Conference on Programming Language Design and Implementation, pp. 300 - 313; ACM-Press, 1993.
R. Ernst, J. Henkel, T. Benner: Hardware-Software Cosynthesis for Microcontrollers; IEEE Design & Test, pp. 64-75, Dec. 1993
Proceedings of the IEEE Workshop on FPGAs for Custom Computing Machines, FCCM'94, Napa, CA, April 1994
R. K. Gupta, C. N. Coelho, G. De Micheli: Program Implementation Schemes for Hardware-Software Systems: IEEE Computer, pp. 48-55, Jan. 1994
D. D. Gajski, N. D. Dutt, A. C.-H. Wu, S. Y.-L. Lin: High-Level Synthesis, Introduction to Chip and System Design; Kluwer Academic Publishers, Boston, Dordrecht, London, 1992
R.W. Hartenstein, A.G. Hirschbiel, M.Weber: A Novel Paradigm of Parallel Computation and its Use to Implement Simple High Performance Hardware; InfoJapan'90- International Conference memorating the 30th Anniversary of the Computer Society of Japan, Tokio, Japan, 1990
R. W. Hartenstein, R. Kress: A Datapath Synthesis System for the Reconfigurable Datapath Architecture; Asia and South Pacific Design Automation Conference, ASP-DAC'95, Nippon Convention Center, Makuhari, Chiba, Japan, Aug. 29 - Sept. 1, 1995
W. Luk, T. Lu, I. Page: Hardware-Software codesign of Multidimensional Programs; Proceedings of the IEEE Workshop on FPGAs for Custom Computing Machines, FCCM'94, Napa, CA, April 1994
P. Puschner, Ch. Koza: Calculating the Maximum Execution Time of Real-Time Programs; The Journal of Real-Time Systems p. 159-176, Kluwer Academic Publishers 1989
N. A. Sherwani: Algorithms for Physical Design Automation; Kluwer Academic Publishers, Boston 1993
K. Schmidt: A Program Partitioning, Restructuring, and Mapping Method for Xputers; Ph.D. Thesis, University of Kaiserslautern, 1994
M. Wolfe, C.-W. Tseng: The Power Test for Data Dependence; IEEE Transactions on Parallel and Distributed Systems, Vol. 3, No. 5, Sept. 1992
R.C. Gonzalez, P. Wintz: Digital Image Processing; Addison-Wesley Publishing Company, USA 1977