Notice: The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarity and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.


A Novel Hardware/Software Co-Design Framework



Reiner W. Hartenstein, Jürgen Becker, Rainer Kress, Helmut Reinig, Karin Schmidt

University of Kaiserslautern

Erwin-Schrödinger-Straße, D-67663 Kaiserslautern, Germany

Fax: ++49 631 205 2640, email: abakus@informatik.uni-kl.de

Abstract

A novel hardware/software co-design framework (CoDe-X) is presented in this paper, where an Xputer is used as universal accelerator based on a reconfigurable datapath hardware. CoDe-X accepts C-programs and carries out both, the profiling-driven host/accelerator partitioning for performance optimization, and the resource-driven sequential/structural partitioning of the accelerator source code to optimize the utilization of its reconfigurable datapath resources.

1. A Comparative Introduction

The scene of hardware/software co-design has introduced a number approaches to speed-up performance, to minimize hardware/software trade-off and to reduce total design time [EHB93], [GCM94], [LWP94] and others. Further co-design research advocates the so-called Custom Computing Machines (CCMs) approach [FCCM94], where a von Neumann host implementation is extended by adding external operators, which are configured on field-programmable circuits. Usually this configuration requires hardware experts. It is already difficult to add one or two extra instructions only to the given instruction set. The von Neumann paradigm does not support efficiently "soft" hardware: because of its extremely tight coupling between instruction sequencer and ALU. So a new paradigm is required like the one of Xputers [HHW90], which conveniently supports "soft" ALUs like the rALU concept (reconfigurable ALU). An Xputer is a reconfigurable, parallel and data-driven architecture. The hardware structure of this novel architectural paradigm and its execution mechanism is described in section 2. For this new class of hardware platforms a new class of compilers is needed, which generate both, sequential and structural code: partitioning compilers, which partition a source into cooperating structural and sequential code segments. In such an environment hardware/software co-design efforts require two levels of partitioning: host/accelerator partitioning for optimizing performance and a structural/sequential partitioning for optimizing the hardware/software trade-off of the Xputer resources. This paper presents a framework based on this paradigm. The hardware/software co-design framework CoDe-X targets the partitioning and compilation of conventional C programs onto a host using the Xputer as universal configurable accelerator.

In contrast to related other work in this field [EHB93] no programming language extensions are required by our solution, which supports portability of applications. Our partitioning is also based on a simulated annealing algorithm. The host and the Xputer can run in parallel, which depends on the analysis of the data-dependencies of the program. Another difference is the possibility of a partly reconfiguration of the accelerator while host and Xputer are running. Differently to Gupta and Micheli [GCM94] we have not such restrictions on the input of our framework, because they are using C with additional hardware features. While their partitioning is on instruction level we perform a coarse-grained partitioning on basic block level, which reduces the communication overhead. In comparison to Custom Computing Machines our partitioning into software and hardware parts incl. synchronization and communication is fully automatic. In the CoDe-X framework complete applications can be moved and compiled onto the Xputer without using commercial hardware-synthesis tools. Divide and conquer strategies used from Luk, Lu and Page [LWP94] are good examples for an efficient partitioning into parts for the host and the accelerator. The approach described in this paper is not limited to such algorithms. The fundamentally new feature of the CoDe-X framework is the two-level co-design approach. The paper is organized as follows: Section 3 presents the hardware/software co-design framework CoDe-X. The different algorithms for the partitioning are shown and the used tools for evaluating the compilation and synthesis task are sketched. Section 4 discusses the partitioning of an computation-intensive example in the area of image processing.

2. The Underlying Hardware

The underlying hardware consists of a host workstation that uses the Xputer as hardware accelerator. The key concept of an Xputer [AHR94], [HHW90] is to map the structure of performance critical algorithms onto the hardware architecture. Performance critical algorithms typically apply the same set of operations to a large amount of data. So the main applications are numeric and scientific computations, which have regular data dependencies in their algorithms for exploiting the Xputer hardware features efficiently. Since ever the same operations are applied, an Xputer has a reconfigurable arithmetic-logic unit (rALU), which can implement complex operations on multiple input words and compute multiple results (figure 1). All input and output data to the complex rALU operations is stored in a so-called scan window (SW). A scan window is a programming model of a sliding window, which moves across data memory under control of a data sequencer. All data in the scan windows can be accessed in parallel by the rALU.

The large amount of input data typically requires an algorithm to be organized in (nested) loops, where different array elements are referenced as operands to the computations of the current iteration. The resulting sequence of data accesses shows a regularity, which allows to describe this sequence by a number of parameters. An Xputer data sequencer provides up to seven hardwired generic address generators (GAGs), which are capable of interpreting such parameter sets to compute generic address sequences for the scan windows. These seven GAGs can operate in parallel [HBK95]. To be run on an Xputer, an algorithm has to be transformed into parameters sets for the generic address generators and a configuration of the rALU, which implements the data manipulations within a loop body. Each time, the generic address generators have accessed all input data of a loop iteration, the complex operations configured into the rALU are applied to the input data and the outputs can be written to the data memory. Since a compiler for an Xputer may rearrange the data in memory to optimize the data access sequences, a data map is required to describe the distribution of input and output data in the data memory.

Our most recent prototype of the Xputer operates as a kind of configurable co-processor to a host computer. All I/O operations are done by the host as well as the memory management. The host has direct access to all memory on the Xputer, and on the other hand the Xputer can access data in the host's memory. The reconfigurable datapath architecture (rDPA), which serves as rALU has been developed especially for evaluating statement blocks in loops and other arithmetic and logic computations [HaKr95]. The rDPA consists of a regular array of identical processing elements called datapath units (DPUs, figure 2). Each DPU has two input and two output registers. The dataflow direction is from west and/or north to east and/or south. The operation of the DPUs is data-driven, the operation is performed as soon as the required operands are available. The communication between the neighbouring DPUs is synchronized by a handshake. A global I/O bus has been integrated into the rDPA, permitting the DPUs to write from their output registers directly outside the array and to read directly from outside. This means, that input data to expressions mapped into the rDPA does not need to be routed through the DPUs. The communication between an external controller, or host, and the DPUs is synchronized by a handshake like the internal communications. An extensible set of operators for each DPU is provided by a library. The set includes the operators of the programming language C. The operators of the DPUs are configurable. A DPU is implemented using a fixed ALU and a microprogrammed control. The configuration is data-driven, and therefore special timing does not have to be considered. Together with the rALU controller the rDPA forms the data-driven rALU, as shown in figure 2. The control chip consists of a control unit, a register file, and an address generation unit for addressing the DPUs. The register file is useful for optimizing memory cycles. The rDPA control unit holds a program to control the different parts of the data-driven rALU. The current version of the rDPA is described in [HaKr95].

3. The Hardware/Software Co-Design Framework CoDe-X

This section gives an overview on the hardware/software co-design framework CoDe-X. Input language to the framework is the programming language C. The input is partitioned in a first level into a part for execution on the host and a part for execution on the accelerator. In this case the Xputer is used as accelerator. Possible parts for the execution on the Xputer are described in Xputer-C (X-C). X-C is an almost complete subset of the programming language C, which lacks only dynamic structures. The X-C input is then partitioned in a second level in a sequential part for programming the data sequencer and a structural part for configuring the rALU (see Figure 3). This rALU file is used for further synthesis in the datapath synthesis system (DPSS). The output of the X-C compiler results in three files: a rALU file for the configuration of the rALU, a parameter set for the loading of the data sequencer (data sequencer code), and a description of the storage scheme (datamap). The description of the storage scheme gives the location of the variables in the memory. Thus it describes a part of the interface between the host and the Xputer. The part of the C program, which is partitioned onto the host is compiled with the GNU C compiler. Figure 3 gives a detailed overview on the hardware/software co-design framework CoDe-X.

3.1 Partitioning based on Profiling

The partitioner in the first level represents the input C program as a graph, called flow graph. This flow graph is first filtered for code-segments without dynamic structures, operating system calls and I/O-routines, because these filtered segments are well suited for the Xputer. This is performed by a preprocessor within the partitioner of CoDe-X (see Figure 3). Then these segments are partitioned into basic blocks (BBs, grey shaded in figure 4), which can later be executed on the host or on the Xputer. A basic block is a sequence of consecutive statements, in which the flow of control enters at the beginning and leaves at the end without halt or possibility of branching except at the end. The remaining code segments are host-specific (for example I/O routines) and they are also partitioned into basic blocks (H-BBs in figure 4), which will later be executed by the host in every case. Figure 4a shows an example of a graph representation of basic blocks with their control flow. A data dependency analysis based on the GCD-test [WoTs92] gives the data dependencies between the basic blocks (figure 4b). In this phase some heuristics can be introduced that reduce the number of basic blocks. Basic blocks in nested loops may be merged together to a single compound basic block, if it fits onto the reconfigurable datapath architecture. This heuristic is based on the knowledge that nested loops achieve high performance factors on Xputers. The area required can be approximated by the number of operators in the statement blocks of these loops. Rearranging the basic blocks according to their data dependencies results in a data flow graph, where the basic blocks are represented by the nodes (figure 4c). The basic blocks that can be evaluated concurrently are displayed side by side.

Profiler

The partitioning into parts for the host and parts for the Xputer is based on the performance analysis of a profiler. For each basic block the time for evaluation on the Xputer as well as the evaluation time on the host is approximately computed. The performance data for the Xputer task is received from the datapath synthesis system (DPSS) and the X-C compiler. The X-C compiler is able to evaluate the number of iterations for the statement blocks. Knowing the evaluation time for the statement block from the datapath synthesis system, the complete evaluation time can be approximated. The time-consuming placement of operators in the DPSS does not have to be performed for approximating the execution time of the statement blocks.

The optimizations provided by the X-C compiler and datapath synthesis system allow to realize several implementations for the reconfigurable datapath architecture with different performance/area trade-offs in the second level of partitioning of CoDe-X. This trade-off is called design space. Figure 5 marks the regions that can be reached with the different optimization techniques. Pipelining in vectorized statements can be used for all implementations where vector statements occur. It reduces the area of the implementation drastically. When the realized implementation is bound by I/O operations, this may be done without loosing performance. In other cases the speed will decrease slightly. Loop folding and loop unrolling can be used, when the statement block occurs in an inner loop. Loop folding uses pipelining over loop iterations or loop boundaries. This technique will always increase the performance without requiring additional area. Loop unrolling results in most cases in the highest speed. Depending on the number of unrolled iterations, the area increase may be significant. For data dependencies between the iterations of the loop (overlapping scan windows) memory cycle optimization can be applied resulting in a further increase of speed. Of course the proposed methods can be combined to reach further positions in the design space.

The performance of the host is approximated by examining the code. For each processor type and workstation another model is required. The behaviour of the underlying hardware and operating system has to be deterministic and known. This implies the timing behaviour of all hardware components, the effects of caching, pipelining, etc. The operating system must provide static memory management, system calls should have a calculable timing behaviour, and no asynchronous interrupts should be allowed [PuKo89]. Branch prediction techniques [BaLa93] are considered by computing the execution time of a code segment. Figure 5 shows a grey shaded line giving the performance of the host. For each basic block several acceleration factors can be computed. The initial point in the design space requires n operators or rather DPUs for implementing the function of this basic block. If loop folding (pipelining across loop boudaries) is possible, the performance (acceleration factor) increases up to max. fold without requiring more DPUs. If pipelining in vectorized statements can be performed, the performance is the same (max. pipe) in the best case or the acceleration factor is decreasing. But the number of used DPUs will be less than before. The highest increase of acceleration factors is gained by loop unrolling (max. unroll), which requires the most hardware resources of the proposed optimization techniques. If two iterations of a loop body will be unrolled, then 2n DPUs are needed (see Figure 5). The values of max. fold, max. pipe and max. unroll depend on the actual viewed basic block.

(1) For each point in the design space that can be reached with the Xputer implementation the acceleration factor is computed as: acceleration factor = performance Xputer OVER performance host

Simulated-Annealing

The partitioning phase is based on a simulated annealing algorithm [Sher93]. This algorithm uses iterative improvement and it belongs to the probabilistic algorithms. The basic blocks of a chosen partitioning are swapped. If the new partitioning results in a lower overall cost, it is accepted. If not, there is still a finite probability to accept which depends on the temperature of the annealing process. This avoids that the algorithm gets stuck in a local minimum. The temperature is very high at the beginning and is gradually lowered. Figure 6 shows the algorithm for simulated annealing partitioning.

First, an initial partitioning for the simulated annealing process is computed by placing the basic blocks, which were filtered out by the preprocessor in the first level of partitioning as possible candidates for execution on the Xputer, in two partitions based on their acceleration factors (see equation (1)). One partition belongs to basic blocks that are executed on the host, the other partition belongs to basic blocks that are executed on the Xputer (figure 7). The basic blocks including I/O routines and dynamic structures have been filtered out and are always executed on the host (white basic blocks H-BBs in Figure 7).

The PERTURB-function uses several possibilities to change the partitioning randomly:

The PERTURB-function is randomly choosing and using one of these two possibilities, because simulated annealing is a probabilistic algorithm of optimizing combinatoric problems, where the exchanging of elements is completely arbitrary.

In general the costs are determined by delay times of the basic blocks, plus the penalty cost for the synchronization and possible reconfiguration of the Xputer during run time. The COST-function gives the complete cost by determining the overall execution time of the algorithm using the actual partitioning. The cost of a partitioning is determined from the profiler according to equation (2):

(2) INDEXES (not implemented) = SUM INDEXES (not implemented)+SUM INDEXES (not implemented)+SUM INDEXES (not implemented)+SUM INDEXES (not implemented)+-SUM INDEXES (not implemented)

whereas:

SUM INDEXES (not implemented) : sum of execution times of basic blocks, which will be executed on the host

SUM INDEXES (not implemented) : sum of execution times of basic blocks, which will be executed on the Xputer

SUM INDEXES (not implemented) : sum of delay times for reconfiguring the Xputer during run time

SUM INDEXES (not implemented) : sum of delay times for synchronizing host/Xputer (operating system calls)

SUM INDEXES (not implemented) : sum of overlapping execution times between basic blocks executed simultanously on host/Xputer

Due to the data dependencies the profiler takes concurrent evaluations into consideration. Moreover successive data dependent basic blocks can be fused to a compound basic block, dependent on the available Xputer hardware resources. This results maybe in a lower delay time than the addition of the delay times of the individual basic blocks. Such delay times are considered too. The profiler also considers possible delay times for reconfiguring the Xputer during run time. This will be necessary, if a partition moves more basic blocks to the Xputer, than parallel GAG- or rather rALU-units are available. If possible, the reconfigration is performed in parallel to running host tasks. For faster access all delay times are stored in a lookup table. For speed improvement of the simulated annealing algorithm, the cost increase or decrease due to the exchange is computed only. The rest of the basic blocks are not considered. For controlling the simulated annealing process, the COST-function can additionally be multiplied with a specific weight-function (e.g. exponential function), which results in a faster move to optimized solutions. The cooling schedule is controlled by a linear function f(temp) = temp * 0.95. The RANDOM()-function results a random number in the range between the first and second argument.

3.2 The Xputer Part

As shown in figure 3 the program part for the Xputer uses the X-C compiler and the datapath synthesis system (DPSS). The splitting into compiler and synthesis system allows simple adaptation of the framework if a different reconfigurable ALU is used.

3.2.1 The X-C Compiler

The X-C compiler [Schm94] takes an almost complete subset of ANSI C as input. Only constructs, which would require a dynamic memory management to be run on the Xputer, are excluded. These are pointers, operating system calls and recursive functions. Since the host's operating system takes care of memory management and I/O, the software parts for execution on the Xputer do not need such constructs. Especially for scientific computing, the restrictions of the C subset are not that important, since FORTRAN 77 lacks the same features and is most popular in scientific computing. The X-C compiler is analyzing, restructuring and parallelizing the program part for the Xputer. It computes the parameter sets for the data sequencer and a rALU file for further synthesis in the datapath synthesis system (see Figure 3), without user interaction. This realizes the resource-driven partitioning of the second level of hardware/software co-design in the CoDe-X framework. First, the compiler performs a data and control flow analysis. The data structure obtained allows restructuring and code optimizations to perform parallelizations like those done by compilers for supercomputers (e.g. pipelining of vectorized statements, loop unrolling etc.). These code transformations affects the performance of the basic blocks (see Figure 5) on the Xputer, whose execution times are computed by a profiler. Then these basic blocks are partitioned in the first level of hardware/software co-design within a simulated annealing process onto the host or onto the Xputer (see section 3.1). The next step of the X-C compiler performs a re-ordering of data accesses to obtain access sequences, which can be mapped well to the parameters of the data sequencer. Therefore, the compiler generates a so-called data map, which describes the way the input data has to be distributed in the data memory to obtain optimized hardware generated address sequences.

3.2.2 The Datapath Synthesis System

The datapath synthesis system (DPSS) allows to map statements from a high level language description onto the rDPA. The statements may contain arithmetic or logic expressions, conditions, and loops, that evaluate iterative computations on a small number of input data.

The task of configuring the rDPA is carried out in the following four phases: logic optimization and technology mapping, placement and routing, I/O scheduling, and finally the code generation. Partitioning of the statements onto the different rDPA chips is not necessary since the array of rDPA chips appears as one array of DPUs with transparent chip boundaries.

Logic optimization and technology mapping

The condition statements are converted into single assignment code. The same is done for the loop conditions. The loop itself is controlled by the rALU controller. Loops and sequences of assignments are considered as basic blocks. Directed acyclic graphs (DAGs) are constructed from the basic blocks. Herewith common subexpressions, identical assignments, local variables, and dead code are removed. Further constant folding and reduction in strength is used. Unary operators are moved into the next operator if the operator library provides this new merged operator. This step reduces the number of required DPUs in the rDPA array. Further parallelism of single expressions is increased by tree-height reduction. A simple algorithm is performed which use the commutativity and the associativity of some operators.

Placement and routing

A poor placement degrades the performance since some internal variables have to be routed via the internal I/O bus. During that time the bus is blocked for other I/O operations. A simulated annealing algorithm is chosen which gives better results than a simple constructive cluster growth algorithm, especially when the rDPA is connected as a torus. The simulated annealing algorithm belongs to probablistic algorithms which improve iteratively [Sher93]. The cost function considers the chip boundaries, the routing operators and with a high cost the connections via the internal I/O bus. Finally a kind of maze router is used for the final routing.

I/O scheduling

Due to the global I/O bus of the rDPA array, the loading of the data and the storing are restricted to one operation per time. An optimal sequence of these I/O operations has to be determined. Furthermore, if the statements of the algorithm belong to an inner loop and are executed several times in direct sequence, pipelining is used to improve speed. On the other side, if the statements belong to an outer loop and are executed once or several times but not in direct sequence, the vectorized expressions are used for different variables and thus improving area. This scheduling is done using a kind of list based scheduling algorithm known from high level synthesis [GDW92].

Code generation

The rDPA configuration file is computed from the mapping information of the processing elements and a library with the microprogram code of the operators. The configuration file for the rALU control unit is extracted from the final schedule of the I/O operators.

For further details about the four phases of the DPSS and examples of the mapping onto the rDPA from a high level language description see [HaKr95].

3.3 The Interface and the Host Part

Since the Xputer shares the memory with the host, the exchange of data is managed via the memory. Since the Xputer requires the data in a specific arrangement in the memory, described by the datamap description, the addresses of shared variables in the host program has to be updated accordingly. The host starts the Xputer by a special control signal which allows the Xputer to run on its own. The reconfigurations of the Xputer for every task can be performed in parallel to running host tasks before the starting signal. An interrupt generated by the Xputer signals the end of the computation. During that time, an algorithm on the host may also run concurrently.

At each time the Xputer is accessed, a synchronization point is integrated by the first level partitioner into the program to be executed on the host. The code at these points consists of operating-system calls for synchronizing parallel processes (fork/join etc.). Then the tasks for the host are compiled with the GNU C compiler.

4. Example: Smoothing Algorithm for Edges

Smoothing operations are used primarily for diminishing spurious effects, that may be present in a digital image as a result of a poor sampling system or transmission channel. Neighborhood averaging is a straightforward spatial-domain technique for image smoothing [GoWi77]. Given an N x N image f (x,y), the procedure is to generate a smoothed image g (x,y), whose gray level at each point (x,y) is obtained by averaging the gray-level values of the pixels of f contained in a predefined neighborhood (kernel) of (x,y). In other words, the smoothed image is obtained by using the equation:

(3) g(x,y) = 1 OVER MSUM f(n,m)

for x,y = 0,1, ..., N-1. S is the set of coordinates of points in the neighborhood of (but not including) the point (x,y), and M is a pre-computed normalization factor. The structure of the corresponding C-program is illustrated in Figure 8. This small example for illustrating the methods of CoDe-X is divided in four basic blocks. Two of them were filtered out in the first step of the iterative partitioner for being executed on the host in every case (H-BB1, H-BB4: I/O-routines). The remaining two basic blocks (BB2, BB3) are potential candidates for mapping onto the Xputer. The profiler is computing the host- and Xputer- cost functions for them. The data dependencies require a sequential execution of the four basic blocks, which will influence the overall execution time of this application. The final decision was to shift BB2 and BB3 to the Xputer. The X-C compiler on the second level of hardware/software co-design is then mapping these basic blocks onto the Xputer. In this final solution the Xputer must be configured only one time. The address generation for the data accesses in the fully nested loop of BB2 is handled by one generic address generator (GAG, see Figure 1) and BB3 by another GAG. During run time a controller within the data sequencer will switch from one GAG to another without reconfiguration. The performance/area trade-off of BB3 is illustrated in Figure 9 and was derived by implementing this block on a SUN SPARC 10/51 (287,0 seconds) and by analyzing its implementation on the current Xputer prototype (unrolled version in 19,6 seconds, assuming 120 ns memory access time). This results in an acceleration factor of 14,6. By additionally dividing the image into stripes, whose will be manupulated by different independent GAGs and rALUs in parallel, this acceleration factor can be multiplied by the factor 5. This results in the final acceleration factor of max.unroll=73,0 (see Figure 9). The therefore necessary code optimiztations and transformations (e.g. strip mining etc.) will be performed by the iterative partitioner in the first level of CoDe-X. Here a large fully nested loop over an image will be transformed into a number of smaller independent loops, which can exploit the parallel Xputer hardware resources in an optimized way.

In this example the communication overhead between host and Xputer is very small, because the image data can be accessed by both hardware platforms in the common memory and has not to be transfered. Due to this fact applications from image processing are well suited for our accelerator.

5. Conclusions

CoDe-X, a two-level hardware/software co-design framework for Xputers has been presented. The Xputer is used as universal accelerator based on a reconfigurable datapath hardware. One of the new features is the two level co-design approach. CoDe-X accepts C-programs and carries out the profiling-driven host/accelerator partitioning for performance optimization and the resource-driven sequential/structural partitioning. In the first level, performance critical parts of an algorithm are mapped onto the Xputer without manual interaction. The second level uses a resource-driven compilation/synthesis of the accelerator source code to optimize the utilization of its reconfigurable datapath resources.

The hardware/software co-design framework CoDe-X is completely specified. The X-C compiler and the datapath synthesis system (DPSS) have been implemented. The partitioner and the profiler are currently being implemented.

References

[HBK95]

Reiner W. Hartenstein, Jürgen Becker, Rainer Kress, Helmut Reinig, Karin Schmidt: A Reconfigurable Machine for Applications in Image and Video Compression; Conference on Compression Technologies and Standards for Image and Video Compression, Amsterdam, The Netherlands, March 1995

[AHR94]

A. Ast, R. W. Hartenstein, H. Reinig, K. Schmidt, M. Weber: A General purpose Xputer Architecture derived from DSP and Image Processing; in M. A. Bayoumi (Ed.): VLSI Design Methodologies for Digital Signal Processing Architectures; Kluwer Academic Publishers, Boston, London, Dortrecht, pp. 365-394, 1994

[BaLa93]

Thomas Ball, James R. Larus: Branch Prediction for free. Proc. of the ACM-SIGPLAN 93: Conference on Programming Language Design and Implementation, pp. 300 - 313; ACM-Press, 1993.

[EHB93]

R. Ernst, J. Henkel, T. Benner: Hardware-Software Cosynthesis for Microcontrollers; IEEE Design & Test, pp. 64-75, Dec. 1993

[FCCM94]

Proceedings of the IEEE Workshop on FPGAs for Custom Computing Machines, FCCM'94, Napa, CA, April 1994

[GCM94]

R. K. Gupta, C. N. Coelho, G. De Micheli: Program Implementation Schemes for Hardware-Software Systems: IEEE Computer, pp. 48-55, Jan. 1994

[GDW92]

D. D. Gajski, N. D. Dutt, A. C.-H. Wu, S. Y.-L. Lin: High-Level Synthesis, Introduction to Chip and System Design; Kluwer Academic Publishers, Boston, Dordrecht, London, 1992

[HHW90]

R.W. Hartenstein, A.G. Hirschbiel, M.Weber: A Novel Paradigm of Parallel Computation and its Use to Implement Simple High Performance Hardware; InfoJapan'90- International Conference memorating the 30th Anniversary of the Computer Society of Japan, Tokio, Japan, 1990

[HaKr95]

R. W. Hartenstein, R. Kress: A Datapath Synthesis System for the Reconfigurable Datapath Architecture; Asia and South Pacific Design Automation Conference, ASP-DAC'95, Nippon Convention Center, Makuhari, Chiba, Japan, Aug. 29 - Sept. 1, 1995

[LWP94]

W. Luk, T. Lu, I. Page: Hardware-Software codesign of Multidimensional Programs; Proceedings of the IEEE Workshop on FPGAs for Custom Computing Machines, FCCM'94, Napa, CA, April 1994

[PuKo89]

P. Puschner, Ch. Koza: Calculating the Maximum Execution Time of Real-Time Programs; The Journal of Real-Time Systems p. 159-176, Kluwer Academic Publishers 1989

[Sher93]

N. A. Sherwani: Algorithms for Physical Design Automation; Kluwer Academic Publishers, Boston 1993

[Schm94]

K. Schmidt: A Program Partitioning, Restructuring, and Mapping Method for Xputers; Ph.D. Thesis, University of Kaiserslautern, 1994

[WoTs92]

M. Wolfe, C.-W. Tseng: The Power Test for Data Dependence; IEEE Transactions on Parallel and Distributed Systems, Vol. 3, No. 5, Sept. 1992

[GoWi77]

R.C. Gonzalez, P. Wintz: Digital Image Processing; Addison-Wesley Publishing Company, USA 1977


For a printed version, please contact abakus@informatik.uni-kl.de




Webmaster