|Notice: The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarity and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.|
This paper presents the parallelizing programming environment CoDe-X introducing hardware/software co-design strategies on two levels for Xputer-based accelerators. Xputers are based on field-programmable logic providing high level of parallelism for the computation of data and on hardwired data sequencers providing accessing-sequences for fast address computation. CoDe-X accepts a C dialect, also including a data-procedural extension, and carries out both, the profiling-driven host/accelerator partitioning, and (2nd level) the resource-parameter-driven sequential/structural partitioning of the accelerator source code to optimize the utilization of its reconfigurable datapath resources.
Due to the availability of field-programmable logic (FPL), a new programming paradigm has been arisen, the structural programming paradigm. Algorithms or parts of algorithms are 'compiled' onto hardware rather than be evaluated from software. Accelerators based on such reconfigurable hardware are called custom computing machines (CCMs:  , ). The scene of hardware/software co-design has introduced a number of approaches to speed-up performance, to optimize hardware/software trade-off and to reduce total design time , , . Further co-design approaches are advocated explicitly or implicitly in developing custom computing machines such within the R&D scenes of ASIPs    and FCCMs  . With the FCCM approach a von Neumann host implementation is accelerated by external field-programmable circuits. Here application development usually requires hardware experts. The von Neumann paradigm does not efficiently support "soft" hardware because of its extremely tight coupling between instruction sequencer and ALU: architectures fall apart, as soon as the data path is changed. So a new paradigm is desirable like the one of Xputers, which conveniently supports "soft" ALUs like the rALU concept (reconfigurable ALU)   or the rDPA approach (reconfigurable data path array)  .
For such a new class of hardware platforms a new class of compilers is needed, including partitioning and parallelizing compilers, to link Xputers as universal embedded accelerators to workstations or other hosts. This paper presents a compilation framework based on this paradigm. The hardware/software co-design framework CoDe-X targets the partitioning and compilation of a C dialect, called X-C, onto a host using the Xputer as universal configurable accelerator. X-C (Xputer-C) is a C dialect, also including a data procedural extension, where experienced users may experiment with the Xputer data sequencing primitives to learn, how to obtain optimum acceleration advantage from this paradigm. At first level, performance-critical parts of an algorithm are localized and migrated to the Xputer without manual interaction. Profiling-driven strategies for synchronization overhead reduction, run-time reconfiguration, and memory organization are discussed. The second partitioning level performs a resource-parameter-driven procedural/structural partitioning of (Xputer) accelerator source code by generating both: structural code for configuring both: the address generators as well as the reconfigurable datapath resources, and, sequential code for (the) data sequencer(s). The goal is to optimize the utilization of the Xputer hardware resources providing parallelism at instruction level. In contrary to hardware/software co-design systems like Cosyma  or Vulcan , here the problem of accelerators based on a new machine paradigm is discussed. The fundamentally new feature of the CoDe-X framework is the two-level co-design approach.
First this paper describes the underlying target hardware. Section 3 presents the co-design framework CoDe-X and its strategies. Section 4 discusses a computation-intensive application example from the area of image processing.
Xputer-based accelerators consist of several (up to seven) Xputer modules. The modules are connected to a host computer. Making use of the host simplifies disk access and all other I/O operations. After setup, each Xputer module runs independently from the others and the host computer until a complete task is processed. Each module generates an interrupt to the host when the task is finished. So, the host is free to concurrently execute other tasks in-between. This allows the use of the Xputer modules as a general purpose acceleration board for time critical parts in an application.
The basic structure of an Xputer module consists of three major parts (figure 1):
All data manipulations are done in the reconfigurable ALU (rALU). The rALU consists of several rALU subnets. Each subnet is distinguished by its rALU subnet number. Each Xputer-module can simultaneously use three different rALU subnets on its board. The rALU subnets perform the computations on the data which are provided by the scan windows by applying a compiler configured complex operator on that data. That complex operator is called compound operator. Each rALU subnet has four output flags to control the generic address generators (GAGs), e.g. for data-dependent scan patterns. The operations in the rALU are done in parallel to the address computations of the GAGs. In order to have a general purpose interface between the rALU and the GAGs, these components are transport-triggered, which means the availability of data triggers the operation. This allows to be very flexible in implementing different rALU subnets, as the subnets do not need to have the same computation times. Furthermore, it allows to implement a pipelined rALU concept as well as a combinatorial net. After finishing, the rALU signals the end of the computation to the GAGs
For the evaluation of standard C programs, word-oriented operators such as addition, subtraction, or multiplication are required. This realization of a reconfigurable architecture for word-oriented operators needs a more coarse grained approach for the logic block architecture of the rALU. The granularity is more at operator level instead of gate level like in commercially available FPGAs. Complete arithmetic and logic operators can be implemented in a single logic block. Thus, in contrary to FPGAs, such a block is called datapath unit (DPU) to show its prior functionality. The datapath unit can be optimized to implement operators faster and more area efficient than FPGAs.
A word-oriented structure for the datapath unit is no drawback since random logic or control logic need not to be supported by this architecture. Furthermore, such a structure requires less configuration bits which saves additional area. In addition the timing is more predictable since there are less transits of signals via programming switches. This greatly simplifies the synthesis system since it saves a necessary back-annotation step to determine the exact delay times for simulation. In the current prototype, the reconfigurable datapath architecture (rDPA) serves as rALU. It consists of 72 identical word-oriented DPUs. For more details about the rDPA see    and .
The Xputer paradigm provides several advantages: The combinations of several operations to one compound operator allow to introduce pipelining and fine grain parallelism to a larger extend, as this can be done in fixed instruction set processors. Intermediate results can be passed along in the pipeline, instead of writing them back into the register file after every instruction. Xputers support application specific instruction sets with their reconfigurable ALUs. Furthermore most addressing and control overhead is eliminated by the data sequencer which gives hardware support for a rich repertory of data access sequences . One major advantage of the rDPA is its flexibility, e. g. the data paths of the GAGs can be also synthesized into a rDPA device  (see figure 1).
The Xputer tasks are the candidates for the partitioning process in the first level. The user may advise the partitioner about parts to be Xputer-executed in any case. For library functions a list of their performance data is available. First, program parts using MoPL extensions or containing loops which can be transformed by the X-C compiler, are routed for Xputer execution. Such an Xputer job is divided into the following Xputer tasks, which can be computed on a single Xputer module:
This technique can be used e.g. in image processing applications by dividing an image in stripes of equal sizes in order to manipulate these stripes concurrently. The optimization techniques  of loop distribution (splitting of one loop into two loops computing the same), of loop fusion (transforming two adjacent loops into a single loops) and of loop interchanging (switching inner and outer loop) are also performed by the 1st level partitioner in order to exploit potential parallelism and to optimize the utilization of the Xputer hardware resources.
For an initial partitioning, the host tasks are performed on the host, the Xputer tasks and the Xputer-library functions are performed on the Xputer modules. If several alternative paths are in the task graph, the largest task in the critical path, which fits onto an Xputer module is chosen next to be scheduled. This method performs a resource-constraint list-based scheduling of the task graph  and results in a fine initial partition since it can be assumed that an Xputer tasks can be evaluated faster on the accelerator than on the host. But there are two problems: the accelerator needs reconfiguration at run-time and in a few cases, the datamap has to be reorganized also at run-time.
The performance of the host is approximated by examining the code. For each processor type and workstation, another model is required. The behavior of the underlying hardware and operating system has to be deterministic and known. This implies the timing behavior of all hardware components, the effects of caching, pipelining, etc. The operating system must provide static memory management, system calls should have a calculable timing behavior, and no asynchronous interrupts should be allowed . Branch prediction techniques  are considered by computing the execution time of a code segment. The techniques used are similar to these from Malik et. al. .
The profiler is also computing the overall execution time by using the Xputer as accelerator. The time includes delay times for synchronization, possible reconfigurations and memory re-mappings of the Xputer during run time (see equation (1)). The iterative partitioning process tries to optimize and is based on simulated annealing, which is described later in this chapter.
: sum of execution times of tasks executed on the host (HT)
: sum of execution times of tasks executed on the Xputer (XT)
: sum of delay times for reconfiguring the Xputer during run time
: sum of delay times for re-mapping the 2-dimensional organized Xputer data map during run time
: sum of delay times for synchronizing host/Xputer (operating system calls)
: sum of overlapping execution timesbetween tasks executed simultaneously on host/Xputer
The overall execution time is determined by delay times of the tasks (host and Xputer), plus the above mentioned penalty times (see , equation (1)). Memory re-mappings are necessary, when two Xputer tasks use the same data, but they are needing a different distribution of the data within the two-dimensional organized memory. Then a re-ordering of the data items must be performed before the execution of the second task starts, which must be also considered in equation (1). The value of is used as COST-function in the simulated annealing process (see figure 5) and has to be minimized for optimizing the acceleration-factor of an application (see equation (1)). The COST-function is determined each time from the profiler according to equation (1) using the actual viewed partitioning of the simulated annealing process:
Due to the data dependencies the profiler takes concurrent evaluations into consideration. The possible delay times for reconfiguring the Xputer during run time will be necessary, if a partition moves more basic blocks to the Xputer, than parallel GAG- or rather rALU-units are available, or if a fast on-chip reconfiguration is necessary for an Xputer-task. If possible, the reconfiguration is performed in parallel to running host tasks. For faster access all delay times are stored in a lookup table. For speed improvement of the simulated annealing algorithm, the cost increase or decrease due to the exchange is computed only. The rest of the tasks is not considered.
For controlling the simulated annealing process, the COST-function can additionally be multiplied with a specific weight-function (e.g. exponential function), which results in a faster move to optimized solutions. The cooling schedule is controlled by a linear function f(temp) = temp * 0.95. The RANDOM()-function results a random number in the range between the first and second argument.
The X-C compiler translates input Xputer tasks into code which can be executed on the Xputer without further user interaction. It comprises four major parts: the data sequencer parameter generator (DSPG), the datapath synthesis system (DPSS), and the analysis/profiling tool (A/P) together with the partitioner. First the compiler performs a data and control flow analysis  . The data structure obtained allows restructuring to perform parallelism like those done by compilers for supercomputers (e.g. vectorization of statements) or other optimization techniques like loop folding or loop unrolling , . The data sequencer parameter generator maps the program's data in a regular way onto the two-dimensionally organized Xputer datamap, followed by a computation of the right address accesses (data sequencing) for each variable. A parameter file for the X-C compiler gives the current restrictions of the hardware, e.g. number of GAGs available, size of rALU subnets etc. This allows a flexible handling of the compiler for future up-grades of the accelerator hardware. The datapath synthesis system is responsible for the synthesis of the GAGs and rALU towards the rDPA. Due to a general interface between the rest of the X-C compiler and the synthesis system, it can be replaced easily by another synthesis tool, if required. The configuration of the rDPA is carried out in the following four phases: logic optimization and technology mapping, placement and routing, data scheduling, and finally the code generation. Partitioning of the operations onto the different rDPA chips is not necessary since the array of rDPA chips appears as one array of DPUs with transparent chip boundaries. For each Xputer task the X-C compiler produces a trade-off between area, that means number of DPUs in the rALU, and performance is computed and displayed in a graph. A small example is shown in figure 6
for x,y = 0,1, ..., N-1. S is the set of coordinates of points in the neighborhood of (but not including) the point (x,y), and M is a pre-computed normalization factor. This small example for illustrating the methods of CoDe-X was divided in four tasks. Two of them were filtered out in the first step for being executed on the host in every case. These were tasks containing I/O routines for reading input parameters and routines for plotting the image, which cannot be executed on the Xputer. The remaining two tasks were potential candidates for mapping onto the Xputer. In one of these two tasks strip mining was applied by the 1st level partitioner, because one for-loop could be transformed according to figure 7.
The resulting smaller independent loops can executed in parallel on different Xputer modules (see figure 8). The X-C compiler was performing loop unrolling additionally up to the limit of available hardware resources (see section 3.3).
The profiler was computing the host- and Xputer- cost functions for them. The data dependencies require a sequential execution of the four tasks, which influenced the overall execution time of this application. The final decision was to shift the two candidates for the Xputer to the accelerator. The X-C compiler on the second level of hardware/software co-design was then mapping these basic blocks onto the Xputer. In this final solution the Xputer must be configured only one time. The address generation for the data accesses in the fully nested loops of the two Xputer tasks was handled each time by one generic address generator. The final acceleration factor in comparison with a SUN SPARC 10/51 was 73.
 T. Ball, J. R. Larus: Branch Prediction for free; Proc. of the ACM-SIGPLAN «93: Conf. on Programming Language Design & Implement.ation, pp. 300 - 313; ACM-Press, 1993
 Jürgen Becker, Reiner W. Hartenstein, Rainer Kress, Helmut Reinig: High-Performance Computing Using a Reconfigurable Accelerator; Proc. of Workshop on High Performance Computing, Montreal, Canada, July 1995
 A. Despain, I.-J. Huang: Synthesis of Application Specific Instruction Sets"; IEEE-Trans on CAD of Integrated Circuits and Systems, Vol. 14, no. 6, June 1995
 R. Ernst, J. Henkel, T. Benner: Hardware-Software Cosynthesis for Microcontrollers; IEEE Design & Test, pp. 64-75, Dec. 1993
 D. D. Gajski, N. D. Dutt, A. C.-H. Wu, S. Y.-L. Lin: High-Level Synthesis, Introduction to Chip and System Design; Kluwer Academic Publishers, Boston, Dordrecht, London, 1992
 Gonzalez R. C., Wintz P.: Digital Image Processing; Addison-Wesley Publishing Company, USA 1977
 R. K. Gupta, C. N. Coelho, G. De Micheli: Program Implementation Schemes for Hardware-Software Systems: IEEE Computer, pp. 48-55, Jan. 1994
 R. W. Hartenstein, J. Becker, R. Kress, H. Reinig, K. Schmidt: A Reconfigurable Machine for Applications in Image and Video Compression; Conf. on Compression Techniques & Standards for Image & Video Compression, Amsterdam, Netherlands, March 1995
 R. W. Hartenstein, R. Kress: A Datapath Synthesis System for the Reconfigurable Datapath Architecture; Asia and South Pacific Design Automation Conference, ASP-DAC'95, Nippon Convention Center, Makuhari, Chiba, Japan, Aug. 29 - Sept. 1, 1995
 R.W. Hartenstein, A.G. Hirschbiel, M. Weber: "A Novel Paradigm of Parallel Computation and its Use to Implement Simple High Performance Hardware"; InfoJapan'90- International Conference memorizing the 30th Anniversary of the Computer Society of Japan, Tokyo, Japan, 1990;
 see : also in: Future Generation Computer Systems no. 7, pp. 181-198 (North-Holland, 1991/92)
 R. Hartenstein, (opening key note): Custom Computing Machines - An Overview; Workshop on Design Methodologies for Microelectronics, Smolenice Castle, Smolenice, Slovakia, Sept. 1995
 R. W. Hartenstein, H. Reinig: Novel Sequencer Hardware for High-Speed Signal Processing; Workshop on Design Methodologies for Microelectronics, Smolenice Castle, Slovakia, Sept. 11-13, 1995
 Reiner W. Hartenstein, Jürgen Becker, Michael Herz, Rainer Kress, Ulrich Nageldinger: A Partitioning Programming Environment for a Novel Parallel Architecture; 10th International Parallel Processing Symposium (IPPS), Honolulu, Hawaii, April 1996
 R. Hartenstein, J. Becker, R. Kress: Custom Computing Machines vs. Hardware/Software Co-Design: from a globalized point of view; submitted for FPL'96, Darmstadt, 1996
 T. A. Kean: Configurable Logic: A Dynamically Programmable Cellular Architecture and its VLSI Implementation; Ph.D. Thesis, University of Edinburgh, Dept. of Computer Science, 1989
 K. Kennedy: Automatic Translation of Fortran Programs to Vector Form; Rice Technical Report 476-029-4, Rice University, Houston, October 1980
 R. Kress: A fast reconfigurable ALU for Xputers; Ph. D. dissertation (submitted), Kaiserslautern University, 1996
 R. Kress: Computing with reconfigurable ALU arrays; ITpress (planned for 1996)
 D. B. Loveman: Program Improvement by Source-to-Source Transformation; Journal of the Association for Computing Machinery, Vol. 24,No. 1, pp.121-145, January 1977
 W. Luk, T. Lu, I. Page: Hardware-Software codesign of Multidimensional Programs; Proceedings of the IEEE Workshop on FPGAs for Custom Computing Machines, FCCM'94, Napa, CA, April 1994
 S. Malik, W. Wolf, A. Wolfe, Y.-T. Li, T.-Y. Yen: Performance Analysis of Embedded Systems; NATO/ASI, Tremezzo, Italy, 1995, Kluwer Acad. Publishers, 1995
 P. Marwedel, G. Goossens (ed.): "Code Generation for Embedded Processors", Kluwer Acad. Publishers, 1995
 D. Monjau: Proc. GI/ITG Workshop Custom Computing, Dagstuhl, Germany, June 1996
 K. L. Pocek, D. Buell: Proc. IEEE Workshop on FPGAs for Custom Computing Machines, FCCM'93, Napa, CA, April 1993
 see , but 1994, 1995, and 1996
 P. Puschner, Ch. Koza: Calculating the Maximum Execution Time of Real-Time Programs; Journal of Real-Time Systems p. 159-176, Kluwer Academic Publishers 1989
 Sato et al.: An Integrated Design Environment for Application Specific Integrated Processors, Proc. ICCD 1991
 K. Schmidt: A Program Partitioning, Restructuring, and Mapping Method for Xputers; Ph.D. Thesis, University of Kaiserslautern, 1994
 K. Schmidt: Xputers: a high performance computing paradigm; ITpress (planned for 1996)
 Z. Shen, Z. Li, P.-C. Yew: "An Empirical Study of Fortran Programs for Parallelizing Compiler"; IEEE Transactions on Parallel and Distributed Systems, Vol.1, No.3, pp. 356-364, July 1990.
 N. A. Sherwani: Algorithms for Physical Design Automation; Kluwer Academic Publishers, Boston 1993
 R. E. Tarjan: Testing Flow Graph Reducibility; Journal of Computer and System Sciences 9, pp. 355-365, 1974
 M. Wolfe, C.-W. Tseng: The Power Test for Data Dependence; IEEE Transactions on Parallel and Distributed Systems, Vol. 3, No. 5, Sept. 1992
 H.P. Zima, B. Chapman: "Supercompilers for Parallel and Vector Computers"; ACM Press Frontier Series, Addison-Wesley, 1990.