|Notice: The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarity and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.|
The paper presents a partitioning and parallelizing programming environment for a novel parallel architecture. This universal embedded accelerator is based on a reconfigurable datapath hardware. The partitioning and parallelizing programming environment accepts C-programs and carries out both, a profiling-driven host/accelerator partitioning for performance optimization in a first step, and in a second step a resource-driven sequential/structural partitioning of the accelerator source code to optimize the utilization of its reconfigurable resources.
Custom Computing Machines (CCMs: , ) use application-specific hardware for speed-up. FPGA-based CCMs (FCCMs  ) use field-programmable hardware as a general purpose accelerator co-processor. But application development here usually requires hardware experts, due to the lack of a suitable general paradigm. The von Neumann paradigm does not efficiently support "soft" hardware because of its extremely tight coupling between instruction sequencer and ALU: processor architectures fall apart, as soon as the data path is changed. So another paradigm is desirable like the one of Xputers, which supports "soft" ALUs like the rALU concept (reconfigurable ALU)  or the rDPA approach (reconfigurable data path array) by Rainer Kress  .
For this new class of hardware platforms a new class of compilers is needed, which generate both, sequential and structural code: i. e. partitioning compilers. This paper introduces a parallelizing compilation method with two levels of partitioning: host/accelerator partitioning and a structural/sequential partitioning (second level).
First the target hardware is briefly summarized. Section 3 explains the partitioning and parallelizing programming environment targeting conventional C programs onto a host using an Xputer as universal configurable accelerator.
The scan pattern generator of the GAG consists of two identical one-dimensional address generators for x- and y- addresses. Each one is realized by three identical address steppers: a Base Stepper, an Address Stepper and a Limit Stepper (figure 3).
The address stepper steps the address register A from the initial value, which is copied from B or L, by ÆA increments until a final value defined by maximum step count or a threshold given by B, L, F, or C. Most non-trivial scan patterns are nested. That«s why B or L may be used to step through a master scan pattern, whereas A is used to step through a slave scan pattern. Each dimension of a scan pattern is described by seven parameters shown on top of figure 3. A master stepper (B or L) computes within the outer loop the current base address, from which the inner loop steps A by ÆA. Two-dimensional scan patterns are composed by coordinated parallelism of the two one-dimensional stepper parts, one for the x address Ax, and one for the y address Ay.
|Name||Parameter||Meaning||Values in example|
|GAG 1||GAG 2|
|F||Floor||Absolute minimum for base||1||1|
|DB||Base step||Step width for base stepper||0||1|
|B0||Initial base||Initial value for base stepper||1||1|
|DA||Address step||Step width for address stepper||1||3|
|L0||Initial limit||Initial value for limit stepper||9||9|
|DL||Limit step||Step width for limit stepper||0||0|
|C||Ceiling||Absolute maximum for limit||9||9|
figure 5 illustrates GAG address generation by a shuffle exchange data transfer (size C-F is 9, destination stepwidth ÆA is 3, see table 1) using two 1-by-1 scan windows: SW1 as read-only source and SW2 as write-only data sink. GAG1 steps SW1 through memory space. The shuffle, scan pattern of SW2, is carried out by GAG2. This shuffle is a nested scan pattern example, where the master scan pattern stepping B has B0 or ÆB as initial location, or, step width, respectively (see figure 5). The slave scan pattern uses current B (from the master), or ÆA, respectively. figure 6 shows a few more 2-D scan patterns using a single GAG only. For more details on scan pattern see 
This technique can be used e.g. in image processing applications by dividing an image in stripes of equal sizes in order to manipulate these stripes concurrently . The optimization techniques  of loop distribution (splitting of one loop into two loops computing the same), of loop fusion (transforming two adjacent loops into a single loops) and of loop interchanging (switching inner and outer loop) are also performed by the 1st level partitioner in order to exploit potential parallelism and to optimize the utilization of the Xputer hardware resources.
Profiling. For each Xputer task candidate, the worst case execution time on the accelerator as well as on the host is computed. The performance data for the accelerator is received from the X-C compiler by analyzing the execution time of the rALU operators and their iterations , .
Host performance is approximated by code examination. For each host type, another model is required. The timing behavior of the underlying hardware and operating system has to be deterministic and known. The operating system must provide static memory management, system calls should have a computable timing behavior, and no asynchronous interrupts should be allowed . The techniques used are similar to these from Malik et. al. .
The profiler is also computing the overall execution time by using the Xputer as accelerator. The time includes delay times for synchronization, possible reconfigurations and memory re-mappings of the Xputer during run time (see equation (1)).
: sum of execution times of tasks executed on the host (HT)
: sum of execution times of tasks executed on the Xputer (XT)
: sum of delay times for reconfiguring the Xputer during run time
: sum of delay times for re-mapping the 2- dimensional organized Xputer data map during run time
: sum of delay times for synchronizing host/Xputer (operating system calls)
: sum of overlapping execution times between tasks executed simultaneously on host/Xputer
The overall execution time is determined by delay times of the tasks (host and Xputer), plus the above mentioned penalty times (see , equation (1)). Memory re-mappings are necessary, when two tasks use the same data onto the Xputer, but they are needing a different distribution of the data within the two-dimensional organized memory. Then a re-ordering of the data items must be performed before the execution of the second task starts, which must be also considered in equation (1). The value of is used as COST-function in the simulated annealing process ,  and has to be optimized by computing different partitionings of Xputer tasks in varying their task allocation between host and Xputer.
The access sequences of up to 4 fully nested loops contenting variables with linear index functions can be handled by transforming them into parameters for configuring the GAGs (see figure 1). A configuration file for the X-C compiler gives the current restrictions of the hardware, e.g. number of GAGs available, size of rALU subnets etc., which allows a flexible handling of the compiler for future up-grades of the Xputer hardware. Further details and examples about the X-C compiler can be found in , , .
 A. Ast, J. Becker, R. W. Hartenstein, R. Kress, H. Reinig, K. Schmidt: Data-procedural Languages for FPL-based Machines; 4th Int. Workshop on Field Programmable Logic and Appl., FPL'94, Prague, Sept. 7-10, 1994, Lecture Notes in Computer Science, Springer, 1994
 U. Banerjee: Dependence Analysis for Supercomputing; Kluwer, 1988.
 K. L. Pocek, D. Buell: Proc. IEEE Workshop on FPGAs for Custom Computing Machines, FCCM'94, Napa, CA, April 1993
 see , but 1994, 1995, and 1996
 R. W. Hartenstein, J. Becker, R. Kress, H. Reinig, K. Schmidt: A Reconfigurable Machine for Applications in Image and Video Compression; Conf. on Compression Techniques & Standards for Image & Video Compression, Amsterdam, Netherlands, March 1995
 R.W. Hartenstein, A.G. Hirschbiel, M. Weber: A Novel Paradigm of Parallel Computation and its Use to Implement Simple High Performance Hardware; InfoJapan'90- International Conference memorizing the 30th Anniversary of the Computer Society of Japan, Tokyo, Japan, 1990;
 see : also in: Future Generation Computer Systems no. 7, pp. 181-198 (North-Holland, 1991/92)
 R. Hartenstein, (opening key note): Custom Computing Machines - An Overview; Workshop on Design Methodologies for Microelectronics, Smolenice Castle, Smolenice, Slovakia, Sept. 1995
 R. W. Hartenstein, R. Kress: A Datapath Synthesis System for the Reconfigurable Datapath Architecture; ASP-DAC'95, Nippon Convention Center, Makuhari, Chiba, Japan, Aug. 29 - Sept. 1, 1995
 R. Hartenstein, J. Becker, R. Kress: Customized Computers: a generalized survey; submitted for FPL'96, Darmstadt, 1996
 R. Hartenstein, J. Becker, R. Kress: Two-level Hardware/Software Partitioning Using CoDe-X; Proc. of ECBS 96, Friedrichshafen, Germany, March 1996
 K. Kennedy: Automatic Translation of Fortran Programs to Vector Form; Rice Technical Report 476-029-4, Rice University, Houston, October 1980
 R. Kress: A fast reconfigurable ALU for Xputers; Ph. D. dissertation (submitted), Kaiserslautern University, 1996
 R. Kress: Computing with reconfigurable ALU arrays; ITpress (planned for 1996)
 D. B. Loveman: Program Improvement by Source-to-Source Transformation; Journal of the Association for Computing Machinery, Vol. 24,No. 1, pp.121-145, January 1977
 S. Malik, W. Wolf, A. Wolfe, Y.-T. Li, T.-Y. Yen: Performance Analysis of Embedded Systems; NATO/ASI, Tremezzo, Italy, 1995, Kluwer Acad. Publishers, 1995
 D. Monjau: Proc. GI/ITG Workshop Custom Computing, Dagstuhl, Germany, June 1996
 P. Puschner, Ch. Koza: Calculating the Maximum Execution Time of Real-Time Programs; Journal of Real-Time Systems p. 159-176, Kluwer Academic Publishers 1989
 K. Schmidt: A Program Partitioning, Restructuring, and Mapping Method for Xputers; Ph.D. Thesis, University of Kaiserslautern, 1994
 K. Schmidt: Xputers: a high performance computing paradigm; ITpress (planned for 1996)
 N. A. Sherwani: Algorithms for Physical Design Automation; Kluwer Academic Publishers, Boston 1993
 Z. Shen, Z. Li, P.-C. Yew: An Empirical Study of Fortran Programs for Parallelizing Compiler; IEEE Trans. on Parallel and Distrib. Syst., Vol.1, No.3, pp. 356-364, July 90.
 R. E. Tarjan: Testing Flow Graph Reducibility; Journal of Computer and System Sciences 9, pp. 355-365, 1974
 M. Wolfe, C.-W. Tseng: The Power Test for Data Dependence; IEEE Transactions on Parallel and Distributed Systems, Vol. 3, No. 5, Sept. 1992
 H.P. Zima, B. Chapman: Supercompilers for Parallel and Vector Computers; ACM Press Frontier Series, Addison-Wesley, 1990