FAQ on Xputers - Xputer Lab Kaiserslautern - Reconfigurable Computing with KressArray
[ > | Xputer Lab | Xputer literature | directory | FAQ&FQA on Xputers | FAQ1 on Xputers | FAQ2 on Xputers | FAQ3 on Xputers | Data Sequencers | Kress Array | History of Xputer Lab | What's new? | Key words on Xputers ]
 

FQA about Xputers:

Frequently questioned Answers about Xputers

In the late eighties we had problems in publishing about Xputers. Although many submissions have been rejected, in 1989 we succeeded in publishing our first speed-up data - our answers to questions about the usefulness of the Xputer paradigm. These answers sometimes have been questioned. Some people did not believe in our performance figures.

Usually a conference time slot is too short for such a highly innovative topic, since most of the audience did not have the background to understand the key issues. Our experience has been, that about 90 minutes have been needed to convince people, that our performance figures are realistic. These 90 minutes we have got by colloquium talks followed by a long discussion. Even better is a tutorial about Xputers. These FAQ und FQA pages intend to be a kind of interactive fast mini tutorial for the Web surfer.

Contents

Speed-up Factors by Xputers
Acceleration by multiple sequencers
Why Xputers are more powerful than Software to Hardware Migration
Optimizing Compilation Techniques for Xputers
Xputers and the Memory Communication Gap


Speed-up Factors by Xputers

Inalmost ten years we have implemented three different generations of Xputer architectures: the MoM-1, MoM-2, and the MoM3 (also see the section section about sequencers). Our first performance figures, from the MoM-1, have been published in 1989, showing the speed-up against Motorola 68020 by using a MoM-1 Xputer (see following table). MoM-1 is our first generation Xputer architecture. MoM stands for "Map-oriented Machine". 

 < > Table 1. Speed-up by MoM-1:

Application (MoM-1 versus Motorola 68020)
speed-up factor
published
  CMOS design rule check   (grid-based)
optimized 68020 version
>2,000
[1] [5] [6] [6b]
equivalent algorithms: non- optimized 68020 version
>15,000
   grid-based electrical rule check
>300
[1] [5] [6] [6b]
  Lee routing (grid-based)
>160
[1] [5] [6] [6b]
The grid-based design rule check is based on pattern matching. It is carried out by a single video scan over the entire layout pixel map, where a 4-by-4 scan window is moved in smallest steps. For a single metal single poly CMOS process (Fraunhofer IMS, Duisburg, Germany) 800 reference patterns have been linked to this 4-by-4 scan window. For Mead-&-Conway nMOS Design rules only 256 reference patterns have been needed.

  < > Table 2. Speed-up by Multiple Scan Windows:

  Application (MoM-1 vs Motorola 68020)
number of scan windows
speed-up factor
published
  10-by-10 vector  matrix     multiplication
1
>9
[2]
2
150


Acceleration by Multiple Sequencers

By the way, the complexity of this grid-based design rule check goes linear with chip area. This is substantially better than with conventional design rule check algorithms, which need complex divide and conquer strategies to obtain halfways reasonable computation speed. We have also investigated the influence of the number of sequencers used simulatiously on the performance (see table no. 2). In a 10 by 10 matrix vector multiplication the use of two sequencers has brought a speed-up by more than an order of magnitude, compared to the version with only a single sequencer.

  < > Table 3. Speed-up of MoM-3 versus SPARC 10/51:

  Application
  ( MoM-3 vs SPARC 10/51)
no. of scan windows
speed-up factor
published
  complete JPEG  compression
MoM-3 execution times have been compared (without reconfiguration times) to user-process execution times of Sun SPARC 10/51.
3
3.5
[8]
7
(table 2) 10.2
[8]
New present technology is here assumed for MoM-3 hardware. MoM-3 execution time (without reconfiguration time) has been compared to user- plus system-process execution times of Sun SPARC 10/51.
7
(table 10) 24
[12]

Most of the acceleration has been achieved due to computing 800 boolean equations in parallel within a single clock cycle. The optimized 68020 version of the algorithm skips certain boolean equation computations depending on results of other boolean equaltions. Such an optimization does not make sense for the MoM version. That's why comparing to the non-optimized 68020 algorithm version is more fair, where the acceleration factor is greater than fifteen thousand (se table above). Another source of speed-up is avoiding address computation overhead by the Xputer's data sequencer. In the design rule check example this contributes more than an order of magnitude (see section on sequencers). 

MoM-3 Performance versus SPARC 10/51

A couple of years later we have experimented with two different versions of the MoM-3 architecture (the MoM-3, and the faster MoM-3NT, where NT stands for "newer technology"). Table 3 shows the results for the core of the JPEG algorithm (a multi media application). Also this time the performance can be improcved by using more sequencers.

 < > Table 4. MoM-3 speed-up in Multi Media and DSP:

 Application (MoM-3 vs Sun SPARC 10/51, 50 MHz, 192MB RAM)
different details
speed-up factor
published
 complete JPEG compression
 MoM-3 (with 7 scan windows) execution times have been compared (without reconfiguration times) to user-process execution times of Sun SPARC 10/51.
10.2
[8]
 MoM-3 execution time (without reconfiguration time) compared to user- plus system-process execution times of Sun SPARC 10/51.
11,8
[9]
 MoM-3 execution- plus reconfiguration- times compared to user- plus system- process execution times of Sun SPARC 10/51.
7,3
[10] [14]
 New present technology is here assumed for MoM-3 hardware. MoM-3 execution time (without reconfiguration time) has been compared to user-plus system-process execution times of Sun SPARC 10/51.
24
[12]
 2-dimensional FIR filter with 3-by-3 Kernel size, 1280x1024 8-bit pixels.
 MoM-3 execution- plus reconfiguration- times compared to user-process time of Sun SPARC 10/51
39.4
[11]
 Ising 128 lattice, 1000 iterations
 MoM-3 execution-time compared to user-process time of Sun SPARC 10/51
53.8
[14]

Why Xputers are more powerful than Software to Hardware Migration

The use of an Xputer as an accelerator co-processor also means (1) software-to-hardware migration, but not only. Most frequently used loops and their expression body is moved onto the rALU. But only with Xputers more speed-up phenomena are available: (2) almost complete avoidance of addressing overhead by migration into the sequencer(s) (not into the rALU!), and, (3) run time to compile time migration, where such is not possible on parallel von Neumann platforms. Parallelism in Xputers does not suffer from the fine granularity switching explosion, like parallel (von Neumann) computers, since most of the communication in Xputer systems is defined by the compilation method. Table 5 shows the run time addressing overhead for von Neumann implementations of several algorithms. For example, the grid-based design rule check [5] [6] has a high addressing overhead. It uses 92% of somputation time for addressing. By migration into a data sequencer due to Xputer use this yields a speed-up of more than ten.

< > Table 5. Comparative Overhead Analysis:

  algorithm (on VAX-11/750)
data manipulation
overhead
addressing
control flow
  design rule check
7%
93%
<1%
  digital filter
28%
58%
14%
  Lee router
  seek starting point
14%
74%
12%
  wavefront propagation
6%
92%
6%
  backtracking
17%
67%
17%

Optimizing Compilation Techniques for Xputers

.... sorry: we are still working at this section

Xputers and the Memory Communication Gap

.... sorry: we are still working at this section  < ^

Literature

[1] R.Hartenstein, A.Hirschbiel, M.Weber: MoM - a partly custom-design architecture compared to standard hardware; IEEE Comp Euro 89, Hamburg, Germany, 1989, IEEE Press 1989, p 5/7-9, 1989
[2] R.Hartenstein, A. Hirschbiel, M.Weber: Rekonfigurierbare ALU erlaubt Parallelisierung auf unterster Ebene; VMEbus, 1990
[3] R. Hartenstein, A. Hirschbiel, M.Weber: Xputers - An Open Family of Non von Neumann Architectures; Proc. of 11th ITG/GI-Conf. Architektur von Rechensystemen, VDE-Verlag, 1990
[4] R.Hartenstein, A.Hirschbiel, M.Weber: The Machine Paradigm of Xputers and its Application to Digital Signal Processing Acceleration; 1990 Int´l Conference on Parallel Processing, St. Charles, Illinois, 1990
[5] R. Hartenstein, A.Hirschbiel, M. Riedmüller, K. Schmidt, M.Weber: Automatic Synthesis of Cheap Hardware Accelerators for Signal Processing and Image Preprocessing; 12. DAGM-Symposium Mustererkennung (Pattern Recognition), Oberkochen-Aalen, Germany 1990
---- Best Paper and Best Presentation Award: DM 1000.-- Speaker: M. Weber ----
[6] R. Hartenstein, A. Hirschbiel, M. Weber: A Novel Paradigm of Parallel Computation and its Use to Implement Simple High Performance Hardware; InfoJapan'90- Int'l Conf. memorating 30th Anniversary Computer Society of Japan, Tokyo, Japan, 1990
[6b] R. Hartenstein, A. Hirschbiel, M.Weber: A Novel Paradigm of Parallel Computation and its Use to Implement Simple High Performance Hardware; Future Generation Computer Systems, no. 7, pp. 181-198 (North-Holland PublishingCo., 1991/92)
[7] R. Hartenstein, R. Kress, H. Reinig: A Dynamically Reconfigurable Wavefront Array Architecture for Evaluation of Expressions; Proc. ASAP'94 , Int'l Conf. on Application-Specific Array Processors, San Francisco, Aug. 1994, IEEE CS Press 1994
[8] R. Hartenstein, J. Becker, R. Kress, H. Reinig, K. Schmidt: A Reconfigurable Machine for Applications in Image and Video Compression; Int`l Conf. on Compression Technologies and Standards for Image and Video Compression, Amsterdam, Holland, March 1995
[9] Reiner W. Hartenstein, Rainer Kress, Helmut Reinig: A Reconfigurable Accelerator for 32-Bit Arithmetic; International Parallel Processing Symposium, Santa Barbara, USA, April 1995
[10] J. Becker, R. Hartenstein, R. Kress, H. Reinig: High-Performance Computing Using a Reconfigurable Accelerator; Proceedings of Workshop on High Performance Computing, Montreal, Canada, July 1995
[11] R. Hartenstein, R. Kress: A Scalable, Parallel, and Reconfigurable Datapath Architecture; Sixth International Symposium on IC Technology, Systems & Applications, ISIC'95, Singapore, Sept. 6-8, 1995
[12] R. Hartenstein (opening key note): Custom Computing Machines - An Overview; Int`l Workshop on Design Methodologies for Microelectronics, Smolenice Castle, Slovakia, September 1995
[13] J. Becker, R. Hartenstein, R. Kress, H. Reinig: A Reconfigurable Parallel Architecture to Accelerate Scientific Computation; Proc. of Int. Conf. on High Performance Computing, New Delhi, India, Dec. 1995
[14] R. Hartenstein, J. Becker, R.Kress, H. Reinig: High-Performance Computing Using a Reconfigurable Accelerator; CPE Journal, Special Issue of Concurrency: Practice and Experience, John Wiley & Sons Ltd., 1996
[15] R. Hartenstein, J. Becker, R.Kress: Custom Computing Machines versusHardware/Software Co-Design - a globalized point of view; in [16]
[16] R. Hartenstein, M. Glesner: Field-programmable Logic, Smart Applications, New Paradigms and Compiler; Springer Verlag, Lecture Notes in Computer Science 1142, 1996


Do you have more questions? Do you have better questions?

If yes, please, inform our webmaster. Our goal is the steady improvement of this list of questions.

[ Xputer Lab | Xputer literature | directory | FAQ&FQA on Xputers | FAQ1 on Xputers | FAQ2 on Xputers | FAQ3 on Xputers | Data Sequencers | Kress Array | History of Xputer Lab | What's new? | Key words on Xputers ]

© Copyright 1996, Universitaet Kaiserslautern, Kaiserslautern, Germany ---- Webmaster