In the late eighties we had problems
in publishing about Xputers. Although many submissions have been
rejected,
in 1989 we succeeded in publishing our first speed-up data - our
answers
to questions about the usefulness of the Xputer paradigm. These answers
sometimes have been questioned. Some people did not believe in our
performance
figures.
Usually a conference time slot is too short for such a highly
innovative
topic, since most of the audience did not have the
background to understand the key issues. Our experience has been,
that
about 90 minutes have been needed to convince people, that our
performance
figures are realistic. These 90 minutes we have got by colloquium talks
followed by a long discussion. Even better is a tutorial about Xputers.
These FAQ und FQA pages intend to be a kind of interactive fast mini
tutorial
for the Web surfer.
Speed-up Factors by Xputers
Acceleration by multiple sequencersWhy Xputers are more powerful than Software to Hardware Migration
Optimizing Compilation Techniques for Xputers
Xputers and the Memory Communication Gap
Inalmost ten years we have implemented three different generations of Xputer architectures: the MoM-1, MoM-2, and the MoM3 (also see the section section about sequencers). Our first performance figures, from the MoM-1, have been published in 1989, showing the speed-up against Motorola 68020 by using a MoM-1 Xputer (see following table). MoM-1 is our first generation Xputer architecture. MoM stands for "Map-oriented Machine".
| |
|
|
|||
| |
|
|
|
||
| |
|
||||
| |
|
|
|||
| |
|
|
|||
| |
|
|
|
||
| |
|
|
|
||
|
|
|
||||
By the way, the complexity of this grid-based design rule check goes linear with chip area. This is substantially better than with conventional design rule check algorithms, which need complex divide and conquer strategies to obtain halfways reasonable computation speed. We have also investigated the influence of the number of sequencers used simulatiously on the performance (see table no. 2). In a 10 by 10 matrix vector multiplication the use of two sequencers has brought a speed-up by more than an order of magnitude, compared to the version with only a single sequencer.
| |
|
|
|
||
| |
|
|
|
|
|
|
|
|
|
|||
|
|
|
|
|
||
Most of the acceleration has been achieved due to computing 800 boolean equations in parallel within a single clock cycle. The optimized 68020 version of the algorithm skips certain boolean equation computations depending on results of other boolean equaltions. Such an optimization does not make sense for the MoM version. That's why comparing to the non-optimized 68020 algorithm version is more fair, where the acceleration factor is greater than fifteen thousand (se table above). Another source of speed-up is avoiding address computation overhead by the Xputer's data sequencer. In the design rule check example this contributes more than an order of magnitude (see section on sequencers).
A couple of years later we have experimented with two different versions of the MoM-3 architecture (the MoM-3, and the faster MoM-3NT, where NT stands for "newer technology"). Table 3 shows the results for the core of the JPEG algorithm (a multi media application). Also this time the performance can be improcved by using more sequencers.
| |
|
|
|
||
| |
|
|
|
||
| |
|
|
|||
| |
|
|
|||
| |
|
|
|||
| |
|
|
|
||
| |
|
|
|
||
The Ising model (chemistry, molecular
biology)
is used for the analysis of phase transitions by explaining how
short-range
interactions between components of a large structure (e.g. molecules in
a crystal) give rise to a large-range, correlating behaviour, for
predicting
the potential for a phase transition.
The use of an Xputer as an accelerator co-processor also means (1) software-to-hardware migration, but not only. Most frequently used loops and their expression body is moved onto the rALU. But only with Xputers more speed-up phenomena are available: (2) almost complete avoidance of addressing overhead by migration into the sequencer(s) (not into the rALU!), and, (3) run time to compile time migration, where such is not possible on parallel von Neumann platforms. Parallelism in Xputers does not suffer from the fine granularity switching explosion, like parallel (von Neumann) computers, since most of the communication in Xputer systems is defined by the compilation method. Table 5 shows the run time addressing overhead for von Neumann implementations of several algorithms. For example, the grid-based design rule check [5] [6] has a high addressing overhead. It uses 92% of somputation time for addressing. By migration into a data sequencer due to Xputer use this yields a speed-up of more than ten.
< > Table 5. Comparative Overhead Analysis:
algorithm (on VAX-11/750) data manipulation overhead addressing control flow design rule check 7% 93% <1% digital filter 28% 58% 14% Lee router seek starting point 14% 74% 12% wavefront propagation 6% 92% 6% backtracking 17% 67% 17%
....
sorry: we are still working at this section
....
sorry: we are still working at this section < ^If yes, please, inform our webmaster. Our goal is the steady improvement of this list of questions.
© Copyright 1996, Universitaet Kaiserslautern, Kaiserslautern, Germany ---- Webmaster