|Notice: The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarity and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.|
The paper presents a two level hardware/software partitioning strategy of a co-design framework (CoDe-X), using an Xputer as accelerator with reconfigurable datapath. CoDe-X accepts C-programs and carries out both, profiling-driven host/accelerator partitioning and (2nd level) resource-parameter-driven sequential/structural partitioning of the accelerator source code to optimize the utilization of its reconfigurable datapath resources.
For such a new class of hardware platforms a new class of compilers is needed, which generate both, sequential and structural code: partitioning compilers, which partition a source into cooperating structural and sequential code segments. In such an environment hardware/software co-design efforts require two levels of partitioning: host/accelerator partitioning (first level) for optimizing performance and a structural/sequential partitioning (second level) for optimizing the hardware/software trade-off of the Xputer resources. This paper presents a framework based on this paradigm. The hardware/software co-design framework CoDe-X targets the partitioning and compilation of conventional C programs onto a host using the Xputer as universal configurable accelerator. In contrast to other hardware/software co-design approaches no additional programming extensions are required and the framework use is not limited to hardware experts. The partitioning is based on a simulated annealing algorithm. The fundamentally new feature of the CoDe-X framework is the two-level co-design approach.
First this paper describes the underlying target hardware. Section 3 presents the co-design framework CoDe-X and its strategies. Section 3.4 discusses a computation-intensive application example from image processing.
The large amount of input data typically requires an algorithm to be organized in (nested) loops, where different array elements are referenced as operands to the computations of the current iteration. The resulting sequence of data accesses shows a regularity, which allows to describe such sequences generically from a few parameters. An Xputer data sequencer (DS) provides seven hardwired generic address generators (GAGs), which are capable of interpreting such parameter sets to compute a generic data address sequence (SP: scan pattern). This address computation does not require memory cycles. That's why it provides substantial speed-up.
To be run on an Xputer, an algorithm has to be transformed into parameters sets controlling the generic address generators and a configuration of the rALU, which implements the data manipulations within a loop body. Data manipulations are transport-triggered. I. e., each time, the GAG places the SW to a new memory location, the complex operations configured into the rALU are evoked automatically. Since locality of data is essential in finding a generic SP, a compiler for an Xputer first has to create a good and regular storage scheme (DM: data map, see Figure 2. ) to obtain good results in performance optimization.
All I/O operations are done by the host as well as the memory management by taking advantage of the functionality of the host's operating system. The host has direct access to all memory on the Xputer via the bus interface, and on the other hand the Xputer can access data in the host's memory, so that global communication is available. The reconfigurable datapath architecture (rDPA) which serves as rALU has been developed especially for computing statement blocks from loops as well as other arithmetic and logic computations .
First the host/Xputer partitioning is explained (next section). Then the programming of the Xputer is shown.and finally the options for experienced users are discussed.
At the beginning, a filtered input program for the Xputer is represented as a graph, called flow graph. This graph is divided into several Xputer tasks. An Xputer task can be computed on a single Xputer module and is defined as follows:
For an initial partitioning, the host tasks are performed on the host, the Xputer tasks and the Xputer-library functions are performed on the Xputer modules. If several alternative paths are in the task graph, the largest task which fits onto an Xputer module is chosen. This results in a fine initial partition since it can be assumed that an Xputer tasks can be evaluated faster on the accelerator than on the host. But there may be two problems: if the accelerator needs reconfiguration at run-time and in a few cases, if the datamap has to be reorganized also at run-time.
Run-Time Reconfiguration. In the case of an Xputer-based accelerator, the data sequencer requires a new parameter set, and the rALU requires a reconfiguration. The parameter set of the data sequencer can be neglected since there are only a few parameters to configure. The configuration of the rDPA requires 147456 bits in the worst case and 23040 bits in an average case. Such a number cannot be neglected. In any case it is useful to have a local reconfiguration memory on each Xputer board. There are three possibilities to reduce the overhead for reconfiguration at run-time:
The performance of the host is approximated by examining the code. For each processor type and workstation, another model is required. The behavior of the underlying hardware and operating system has to be deterministic and known. This implies the timing behavior of all hardware components, the effects of caching, pipelining, etc. The operating system must provide static memory management, system calls should have a calculable timing behavior, and no asynchronous interrupts should be allowed . Branch prediction techniques  are considered by computing the execution time of a code segment. The techniques used are similar to these from Malik et. al. .
The profiler is also computing the overall execution time by using the Xputer as accelerator. The time includes delay times for synchronization, possible reconfigurations and memory re-mappings of the Xputer during run time (see equation (1)). The iterative partitioning process tries to optimize and is based on simulated annealing, which is described later in this chapter.
: sum of execution times of tasks executed on the host (HT)
: sum of execution times of tasks executed on the Xputer (XT)
: sum of delay times for reconfiguring the Xputer during run time
: sum of delay times for re-mapping the 2- dimensional organized Xputer data map during run time
: sum of delay times for synchronizing host/Xputer (operating system calls)
: sum of overlapping execution times between tasks executed simultaneously on host/Xputer
The overall execution time is determined by delay times of the basic blocks (host and Xputer), plus the above mentioned penalty times (see , equation (1)). Memory re-mappings are necessary, when two basic blocks use the same data onto the Xputer, but they are needing a different distribution of the data within the two-dimensional organized memory. Then a re-ordering of the data items must be performed before the execution of the second basic block starts, which must be also considered in equation (1). The value of is used as COST-function in the simulated annealing process (see figure 2) and has to be minimized for optimizing the acceleration-factor of an application (see equation (1)). The COST-function is determined each time from the profiler according to equation (1) using the actual viewed partitioning of the simulated annealing process:
Due to the data dependencies the profiler takes concurrent evaluations into consideration. The possible delay times for reconfiguring the Xputer during run time will be necessary, if a partition moves more basic blocks to the Xputer, than parallel GAG- or rather rALU-units are available, or if a fast on-chip reconfiguration is necessary for an Xputer-task. If possible, the reconfiguration is performed in parallel to running host tasks. For faster access all delay times are stored in a lookup table. For speed improvement of the simulated annealing algorithm, the cost increase or decrease due to the exchange is computed only. The rest of the basic blocks are not considered.
The PERTURB-function is randomly choosing and using one of these two possibilities, because simulated annealing is a probabilistic algorithm of optimizing combinatorial problems, where the exchanging of elements is completely arbitrary.
For controlling the simulated annealing process, the COST-function can additionally be multiplied with a specific weight-function (e.g. exponential function), which results in a faster move to optimized solutions. The cooling schedule is controlled by a linear function f(temp) = temp * 0.95. The RANDOM()-function results a random number in the range between the first and second argument.
The partitioner at 2nd level (see Figure 2. ) performs a data and control flow analysis. First the control flow of the program graph is partitioned according to the algorithm of Tarjan  resulting in a partial execution order. This algorithm is partitioning the program graph into subgraphs, which are called Strongly Connected Components. These components correspond to connected statement sequences like fully nested loops for example, which are possibly parallelizable. The Xputer hardware resources have not been considered until now. So second the obtained coarse-grained components have to be transformed such that:
The access sequences of up to 4 fully nested loops can be computed by this tool. A configuration file for the X-C compiler gives the current restrictions of the hardware, e.g. number of GAGs available, size of rALU subnets etc. This allows a flexible handling of the compiler for future up-grades of the Xputer hardware. The datapath synthesis system is responsible for the synthesis towards the rALU. The code optimizations of loop folding (pipelining across iterations of for loops) and loop unrolling are performed by the DPSS whenever possible. In the current version the rDPA serves as rALU. Due to a general interface between the rest of the X-C compiler and the synthesis system, it can be replaced easily by another synthesis tool, if required.
The task of configuring the rDPA is carried out in the following four phases: logic optimization and technology mapping, placement and routing, data scheduling, and finally the code generation. Partitioning of the statements onto the different rDPA chips is not necessary since the array of rDPA chips appears as one array with transparent chip boundaries. The DPSS also gives the execution time of the compound operator in the rDPA. Since the worst case delay times of the individual operators are known in advance, the data scheduling delivers the complete execution time. Further details about the X-C compiler and the four phases of the DPSS, as well as examples of the mapping onto the rDPA can be found in  and .
For each Xputer task the X-C compiler produces a trade-off between area, that means amount of rDPA resources, and performance is computed and displayed in a graph.
Now, for each Xputer task a single point in the trade-off graph is chosen. Since the size of the rDPA array is fixed, the point with the maximum performance which fits onto the rDPA of the Xputer module is taken. For this task the configuration set, the required time for configuration, and the execution time is stored. Usually the following optimizations are performed by the X-C compiler: l
for x,y = 0,1, ..., N-1. S is the set of coordinates of points in the neighborhood of (but not including) the point (x,y), and M is a pre-computed normalization factor. This small example for illustrating the methods of CoDe-X was divided in four tasks. Two of them were filtered out in the first step for being executed on the host in every case. These were tasks containing I/O routines for reading input parameters and routines for plotting the image, which cannot be executed on the Xputer. The remaining two tasks were potential candidates for mapping onto the Xputer. In one of these two tasks strip mining was applied by the 1st level partitioner, because one for-loop could be transformed according to figure 7
The profiler was computing the host- and Xputer- cost functions for them. The data dependencies require a sequential execution of the four tasks, which influenced the overall execution time of this application. The final decision was to shift the two candidates for the Xputer to the accelerator. The X-C compiler on the second level of hardware/software co-design was then mapping these basic blocks onto the Xputer. In this final solution the Xputer must be configured only one time. The address generation for the data accesses in the fully nested loops of the two Xputer tasks was handled each time by one generic address generator. The final acceleration factor in comparison with a SUN SPARC 10/51 was 73.
 N. N.: Configurable Logic, Design and Application Book; Atmel Coporation, San Jose, CA, 1994
 T. Ball, J. R. Larus: Branch Prediction for free; Proc. of the ACM-SIGPLAN «93: Conf. on Programming Language Design & Implement.ation, pp. 300 - 313; ACM-Press, 1993
 S. Churcher, T. Kean, B. Wilkie: The XC6200 FastMap(TM) Processor Interface; 5th Int. Workshop on Field-Programmable Logic and Applications, FPL'95, Oxford, UK, Aug./Sept. 1995
 A. Despain, I.-J. Huang: Synthesis of Application Specific Instruction Sets"; IEEE-Trans on CAD of Integrated Circuits and Systems, Vol. 14, no. 6, June 1995
 R. Ernst, J. Henkel, T. Benner: Hardware-Software Cosynthesis for Microcontrollers; IEEE Design & Test, pp. 64-75, Dec. 1993
 K. L. Pocek, D. Buell: Proc. IEEE Workshop on FPGAs for Custom Computing Machines, FCCM'94, Napa, CA, April 1993
 see , but 1994, 1995, and 1996
 Gonzalez R. C., Wintz P.: Digital Image Processing; Addison-Wesley Publishing Company, USA 1977
 R. K. Gupta, C. N. Coelho, G. De Micheli: Program Implementation Schemes for Hardware-Software Systems: IEEE Computer, pp. 48-55, Jan. 1994
 R. W. Hartenstein, J. Becker, R. Kress, H. Reinig, K. Schmidt: A Reconfigurable Machine for Applications in Image and Video Compression; Conf. on Compression Techniques & Standards for Image & Video Compression, Amsterdam, Netherlands, March 1995
 R. W. Hartenstein, R. Kress: A Datapath Synthesis System for the Reconfigurable Datapath Architecture; Asia and South Pacific Design Automation Conference, ASP-DAC'95, Nippon Convention Center, Makuhari, Chiba, Japan, Aug. 29 - Sept. 1, 1995
 R.W. Hartenstein, A.G. Hirschbiel, M. Weber: "A Novel Paradigm of Parallel Computation and its Use to Implement Simple High Performance Hardware"; InfoJapan'90- International Conference memorizing the 30th Anniversary of the Computer Society of Japan, Tokyo, Japan, 1990;
 see : also in: Future Generation Computer Systems no. 7, pp. 181-198 (North-Holland, 1991/92)
 R. Hartenstein, (opening key note): Custom Computing Machines - An Overview; Workshop on Design Methodologies for Microelectronics, Smolenice Castle, Smolenice, Slovakia, Sept. 1995
 R. Hartenstein, J. Becker, R. Kress: Customized Computers: a generalized survey; submitted for FPL'96, Darmstadt, 1996
 T. A. Kean: Configurable Logic: A Dynamically Programmable Cellular Architecture and its VLSI Implementation; Ph.D. Thesis, University of Edinburgh, Dept. of Computer Science, 1989
 K. Kennedy: Automatic Translation of Fortran Programs to Vector Form; Rice Technical Report 476-029-4, Rice University, Houston, October 1980
 R. Kress: A fast reconfigurable ALU for Xputers; Ph. D. dissertation (submitted), Kaiserslautern University, 1996
 R. Kress: Computing with reconfigurable ALU arrays; ITpress (planned for 1996)
 L. Lin: High-Level Synthesis, Introduction to Chip and System Design; Kluwer Acad. Publ., Boston, London, 1992
 D. B. Loveman: Program Improvement by Source-to-Source Transformation; Journal of the Association for Computing Machinery, Vol. 24,No. 1, pp.121-145, January 1977
 W. Luk, T. Lu, I. Page: Hardware-Software codesign of Multidimensional Programs; Proceedings of the IEEE Workshop on FPGAs for Custom Computing Machines, FCCM'94, Napa, CA, April 1994
 S. Malik, W. Wolf, A. Wolfe, Y.-T. Li, T.-Y. Yen: Performance Analysis of Embedded Systems; NATO/ASI, Tremezzo, Italy, 1995, Kluwer Acad. Publishers, 1995
 P. Marwedel, G. Goossens (ed.): "Code Generation for Embedded Processors", Kluwer Acad. Publishers, 1995
 D. Monjau: Proc. GI/ITG Workshop Custom Computing, Dagstuhl, Germany, June 1996
 N. N.: Configurable Logic Array (CLAy(TM)); Preliminary Datasheet, National Semiconductors, Santa Clara, CA, Dec. 1993
 P. Puschner, Ch. Koza: Calculating the Maximum Execution Time of Real-Time Programs; Journal of Real-Time Systems p. 159-176, Kluwer Academic Publishers 1989
 Sato et al.: An Integrated Design Environment for Application Specific Integrated Processors, Proc. ICCD 1991
 K. Schmidt: A Program Partitioning, Restructuring, and Mapping Method for Xputers; Ph.D. Thesis, University of Kaiserslautern, 1994
 K. Schmidt: Xputers: a high performance computing paradigm; ITpress (planned for 1996)
 N. A. Sherwani: Algorithms for Physical Design Automation; Kluwer Academic Publishers, Boston 1993
 R. E. Tarjan: Testing Flow Graph Reducibility; Journal of Computer and System Sciences 9, pp. 355-365, 1974
 M. Wolfe, C.-W. Tseng: The Power Test for Data Dependence; IEEE Transactions on Parallel and Distributed Systems, Vol. 3, No. 5, Sept. 1992