**Notice:** The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarity and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.

Reiner W. Hartenstein, Jürgen Becker, Rainer Kress, Helmut Reinig, Karin Schmidt

University of Kaiserslautern

Erwin-Schrödinger-Straße, D-67663 Kaiserslautern, Germany

Fax: ++49 631 205 2640, email: abakus@informatik.uni-kl.de

Section 2 introduces the Xputer machine paradigm and our prototype map-oriented machine 3. The usage of this accelerator is shown on the example of the JPEG compression algorithm in section 3. Performance results and a comparison with a modern workstation ends this section. How to encode video images with the MPEG algorithm is outlined in section 4. Finally some conclusions end the paper.

The key concept of an Xputer [2], [7] is to map the structure of performance critical algorithms into the hardware architecture. Performance critical algorithms typically apply the same set of operations to a large amount of data. Since ever the same operations are applied, an Xputer has a reconfigurable arithmetic-logic unit (rALU), which can implement complex operations on multiple input words and compute multiple results (figure 1). The large amount of input data typically requires an algorithm to be organized in (nested) loops, where different array elements are referenced as operands to the computations of the current iteration. The resulting sequence of data accesses shows a regularity, which allows to describe this sequence by a number of parameters. A simple example would be two nested loops, where the inner loop increments the column index to a two-dimensional array, an the outer loop increments the row index, resulting in a video-scan like data access sequence. Such a sequence can be represented by seven parameters: the starting index, the stepwidth, and the maximum index, all three for both rows and columns, and a flag, which tells that columns have to be incremented before rows. An Xputer provides several hardwired Generic Address Generators (GAGs), which are capable of interpreting such parameter sets to compute generic address sequences to the data. Multiple Generic Address Generators simplify parallel access to different arrays of input data, which generally is the case for non-trivial algorithms. To be run on an Xputer, an algorithm has to be transformed into parameters sets for the Generic Address Generators and a configuration of the rALU, which implements the data manipulations within a loop body. Each time, the Generic Address Generators have accessed all input data of a loop iteration, the complex operations configured into the rALU are applied to the input data and the outputs can be written to the data memory by the Generic Address Generators. Since a compiler for an Xputer may rearrange the data in memory to optimize the data access sequences for the Generic Address Generators, a data map is required to describe the distribution of input and output data in the data memory.

The next section introduces an implementation of the Xputer paradigm, called MoM-3 (Map-oriented Machine) because it is the third prototype of a series of machines, which operate on a two-dimensional data memory.

Together with a programmable control chip the rDPA forms a data-driven reconfigurable ALU (rALU), as shown in figure 5. The control chip consists of a control unit, a register file, and an address generation unit for addressing the DPUs. The register file is useful for optimizing memory cycles. It makes it possible to use each DPU in the rDPA for operations by using the internal bus for routing. The address generation unit delivers the address for the DPU registers before each data is written into the rDPA over the bus. The rDPA control unit holds a program to control the different parts of the data-driven rALU. The instruction set consists of instructions for loading data into the rDPA array to a special DPU from the external units, for receiving data from a specific DPU, or branches on a special control signal from the GAG. The rDPA control unit supports context switches between three control programs which allows the use of three independent virtual rALU subnets. The control program is loaded during configuration time.

To increase the hardware usage the JPEG algorithm is implemented in four steps on our reconfigurable machine. This simplifies the explanation of the individual steps. The original picture coming from a host or a video interface is converted into a YUV image first. In this step additional functions such as gamma correction or filters easily can be implemented. Due to the reconfigurability of the Xputer it is simple to adjust the input interface to the different input formats such as 24-bit RGB, 16-bit YCb/YCr colour images or 12-bit greyscale images. The width of the communication buses in the rALU are always 32-bit. But operators can be adapted to compute with an accuracy of e.g. 8-bit or any other number. This improves speed for the compression algorithm since the image conversion is usually computed with 8 bit and the DCT with 11-bit. The full 32-bit I/O bandwidth is used at the rALU-memory interface. The dense packing of the data words is shown later in detail. An Y:U:V 4:2:2 sampling is included in the conversion. The second step is the two dimensional discrete cosine transform (2d-DCT) which is applied on an input of 8 by 8 blocks. The 2d-DCT can be computed by two 1d-DCTs:

(1) $F(u,v)\; =\; 1\; OVER\; 4C(u)C(v)SUM\; SUM\; f(x,y)cos(2x+1)upi\; OVER\; 16cos(2x+1)vpi\; OVER\; 16\; ATOP\; =1\; OVER\; 2C(u)SUM\; \{1\; OVER\; 2C(v)SUM\; f(x,y)cos(2x+1)vpi\; OVER\; 16\}cos(2x+1)upi\; OVER\; 16$

The input data words are shifted from unsigned integer with range [0, 255] to signed integer with range [-128, 127]. The 2d-DCT computes 64 DCT coefficients, one DC and 63 AC coefficients. Because sample values vary slowly from point to point in an image, the compression can concentrate on the coefficients with lower frequency. The higher frequency coefficients have zero or near zero amplitude and need not to be encoded. Step three in our implementation is the quantisation. The goal of this step is to discard all the information which is not necessary to achieve the desired image quality. The 64 element quantisation table is configured as constants in the rALU. The last step is the coding of the DC coefficients and the run length encoding with the Huffmann coding. The necessary zig-zag sequence is generated automatically in the GAGs. The run length and the Huffmann coding is done in one step in the rALU. Figure 6 gives an overview of the different processing steps of the JPEG algorithm.

(2) $MATRIX\; (not\; implemented)\; =\; MATRIX\; (not\; implemented)\xb7MATRIX\; (not\; implemented)$

The YUV conversion requires three Generic Address Generators (GAGs), see figure 8. The first reads the image in its original colour model from its local memory. The values are read in a simple video-scan pattern. In figure 8 this pattern is divided into 64-word linear segments only to visualize the relationship between such a 64-word segment and an eight-by-eight output block. The rALU performs the matrix multiplications to compute the YUV representation. The equations for an RGB to YUV conversion are shown in (2). Since the computations are quite simple, it would be possible to perform some filter operations on the input image before, resulting in a higher utilisation of the rALU resources. Furthermore, the rALU passes the computed Y, U and V values to the memory interfaces on the two following boards. On these boards, two other GAGs write the results back to their local memory. The first of these GAGs writes the Y values, using a block storage scheme. That is every 64 values are written to a block of eight lines of four 32-bit words each. Each 32-bit word contains two Y values of 16 bits, which is sufficient, since the following computations are done on 11-bit signed integers. The blocks are written line by line, then the next block follows with its first line. The third GAG performs a slightly modified scan pattern. Since the Y, U and V values are sampled at relative rates of 4:2:2, only half as many U values are produced compared to the number of Y values. The same is true for the V values. Therefore both the U and the V values are written to the local memory of the third GAG. The first value taken from the rALU is a U value, which is written according to the same block storage scheme as is applied to the Y values. The second value is a V value, which is stored with an offset of four 32-bit words to the right of the previously written U value, without continuing in the block storage scheme. That way two adjacent blocks of U and V values are written in an interleaved fashion while at the same time two blocks of Y values are written consecutively by the second GAG. After writing two blocks of U and V values, the third GAG skips the area of the block of V values, before starting with the next sequence of two U and V blocks.

At the end of the YUV conversion step, the whole image has been converted and distributed across two separately accessible memory areas. The Y, U and V values are stored in two-dimensional blocks of eight by eight values, which is exactly the data structure required for the two-dimensional DCT specified in the JPEG standard. The following computations can be accelerated further, if a total of seven boards with GAGs, memory and a rALU subnet are available. Three lines of the input image could be processed in parallel then. The first GAG would read three lines of the image in parallel. The rALU would distribute the first line to the second and the third board, like in the configuration above. The Y, U and V results corresponding to the second line of the input image would be stored on the fourth and fifth board, while the third line would go to the sixth and seventh board. Instead of computing the DCT only on the second and third board for the Y and the U/V values, respectively, a total of six boards could operate in parallel, resulting in a speed-up factor of three.

The memory address sequence in the Generic Address Generators corresponds to the rALU pipeline in a way that two rows of four packed words are read from memory before the first packed results are written back. After eight rows have been read, the corresponding results are taken form the pipeline before the next eight-by-eight block is read. The results are written back in a quite complicated pattern, because both the row and column exchange and the bit-reversal addressing mode are combined. Figure 9 gives an idea of the read and write sequence. The inputs are read from top to bottom and from right to left from the positions corresponding to the values on top of the packed words. To obtain the position in the row, the values have to be divided by two and taken modulo 4. The outputs are written from top to bottom and from right to left to the first column of the row which is indicated by the value on top of the packed words.

The configuration of the rALU to compute the 1d-DCT is shown in more detail (including pre-and post-processing) in figure 10. For simplicity only the used communication paths of the 8 by 16 rDPA array are shown here. The implementation is based on Lee's algorithm [10]. The original flowgraph and the corresponding sparse matrix factors can be found in [11]. The split operators (s) and the merge operators (m) are the same as in figure 9. Two merge operators in sequence are used as kind of FIFO for storing the results which flush out of the pipeline. The multiplications with the constants are marked with a constant operator (k). The values can be found in [11]. The temporary variables (t#) which are produced in the DPUs are routed via the local register file of the rALU control to other DPU input registers.

(3) $INDEXES\; (not\; implemented)(u,v)\; =\; \{[(F(u,v)+Q(u,v)\; OVER\; 2)DIV\; F(u,v)]\xb7F(u,v);\; for\; F(u,\; v)\; \xb3\; 0\; ATOP\; [(F(u,v)+-Q(u,v)\; OVER\; 2)DIV\; F(u,v)]\xb7F(u,v);\; for\; F(u,\; v)\; <\; 0$

where F(u, v) is the original DCT coefficient, Q(u, v) is the quantisation constant and F*Q*(u, v) is the quantized DCT coefficient.

In each 32-bit word of one local memory segment two values of the Y component are packed. In a second memory segment one value of the U and one of the V component are packed. For each packed value the quantisation is done in two DPUs. The first DPU reads the packed two data words, unpacks them and writes one word to the second DPU. The first one computes the quantisation as noted in equation (3) and writes the quantized value to the second DPU. The second DPU starts with the quantisation and finishes with packing the two quantized values. These values will be written back to the local memory. For the 64 coefficients an array of 8 by 8 DPUs is required per subnet. The quantisation coefficients are directly configured as constants into the rDPA array. In figure 11 the quantisation is shown for two packed input words. The handle address generators of the GAGs perform a video scan over the beginnings of the blocks and the memory address generators produce the addresses of the packed data words.

(4) $INDEXES\; (not\; implemented)(Deltai,Deltaj)=SUM\; SUM\; |INDEXES\; (not\; implemented)(k,l)+-INDEXES\; (not\; implemented)(k+Deltai,l+Deltaj)|$

where (k,l) are the pixel coordinates within a macroblock and Æi and Æj are the components of the vector corresponding to the compared macroblocks. The motion vector (Æi,Æj)*opt* is determined by searching for a global minimum over all these distance measures within the search area.

(5) $INDEXES\; (not\; implemented)\; =\; INDEXES\; (not\; implemented)(Deltai,Deltaj)\; ;\; all\; Deltai,Deltaj$

In order to increase the performance of this application, it is possible to divide the pictures into overlapping slices and to work with several GAGs and rALUs in parallel. So the Xputer can be well used as an off-line MPEG-encoder.

The prototype implementation Map-oriented Machine 3 has been completely specified with the hardware description language Verilog. The Generic Address Generators, the MoM controller and the programmable control chip for the rDPA have already been fabricated with standard cells during the European EUROCHIP project. An rDPA array will be submitted for fabrication soon. The programming environment has been specified and is currently being implemented on Sun SPARCstations.

2. A. Ast, R. W. Hartenstein, H. Reinig, K. Schmidt, M. Weber: A General purpose Xputer Architecture derived from DSP and Image Processing; in M. A. Bayoumi (Ed.): VLSI Design Methodologies for Digital Signal Processing Architectures; Kluwer Academic Publishers, Boston, London, Dortrecht, pp. 365-394, 1994

3. J. Buck: Hilfreicher Verlust; mc MicroComputer, pp. 94-101, Juni 1993

4. J. Buck: Komprimierte Bewegung; mc MicroComputer, pp. 114-123, April 1994

5. D. Bursky: Codec compresses images in real time; Electronic Design, vol. 41, no. 20, pp.123-124, Oct. 1993

6. P. Dessarte, B. Macq, D. T. Slock: Signal-Adapted Multiresolution Transfrom for Image Coding; IEEE Transactions on Information Technology, vol. 38, no. 2, pp. 897-904, March 1992

7. R. W. Hartenstein, A. G. Hirschbiel, M. Riedmüller, K. Schmidt, M. Weber: A Novel ASIC Design Approach Based on a New Machine Paradigm; IEEE Journal of Solid-State Circuits, Vol. 26, No. 7, July 1991

8. R. W. Hartenstein, A. G. Hirschbiel, K. Schmidt, M. Weber: A novel paradigm of parallel computation and its use to implement simple high-performance hardware; North Holland, Future Generation Computer Systems 7, pp. 181-198, 1991/92

9. R. W. Hartenstein, R. Kress, H. Reinig: A Reconfigurable Data-Driven ALU for Xputers; Proceedings of the IEEE Workshop on FPGAs for Custom Computing Machines, FCCM'94, Napa, CA, April 1994

10. B. G. Lee: A new algorithm for the discrete cosine transform; IEEE Transactins on Acoustic, Speech, and Signal Processing, vol. ASSP-32, pp. 1243-1245, Dec. 1984

11. K. R. Rao, P. Yip: Discrete Cosinus Transform: Algorithms, Advantages, Applications: Academic Press, Inc., Boston, 1990

12. A. Razavi, I. Shenberg, D. Seltz, D. Fronczak: A High Performance JPEG Image Compression Chip Set for Multimedia Applications; Proceedings of the SPIE - The International Society for Optical Engineering, vol.1903, p.165-74, 1993

13. K. Schmidt: A Restructuring Compiler for Xputers; Ph. D. Thesis, University of Kaiserslautern, 1994

14. G. K. Wallace: The JPEG Still Picture Compression Standard; IEEE Transactions on Consumer Electronics, Dec. 1991

For a printed version, please contact abakus@informatik.uni-kl.de