Configuration Scenarios - Reconfigurable Computing with KressArray

Configurable Computing Systems

Please check the HICSS Architecture Web Page ASAP if you intend to attend the workshop!

HICSS 30 Architecture Track

Wednesday January 8, 1997

Bill Mangione-Smith

University of California at Los Angeles

billms@icsl.ucla.edu

Session Description

Configurable computing systems combine high density FPGAs along with processors to achieve the best of both worlds: customized digital circuits accelerators which are responsive to dynamic events. A number of different models have been proposed for configurable computing, including off-board data pumps, co-processors, configurable function units on the processor datapath and configurable datapaths. Each approach provides a different set of strengths and weaknesses, along with a different model of computation. The task force on Configurable Computing will be discussing the critical impediments which are currently limiting the use of these systems: computing models (architectural abstractions), runtime support for optimization and reconfiguration, driving applications and FPGA technology (density, configuration time and clock rates). Further information is available from Bill Mangione-Smith (billms@ucla.edu) or at http://www.icsl.ucla.edu/~billms/hicss97.

Position Papers

Configurable Computing: Concepts and Issues

William Mangione-Smith

Electrical Engineering

University of California at Los Angeles

billms@ucla.edu

Introduction

Configurable computing is an area of active research that has sprung up over the last several years. By combining aspects of traditional computing, such as high performance microprocessor and commodity memory devices, with programmable hardware devices, configurable computing attempts to gain the benefits of both adaptive software and optimized hardware. Because the field is young it remains ill-defined and often either misunderstood or misrepresented. Fortunately, there currently exist a small number of applications which show significant improvement through the use of configurable computing technology, and have sparked much of the interest in the field.

This brief paper will touch on some of the different types of Configurable computing systems that have been either proposed or developed, consider applications that seem to be well suited to this approach, and discuss the open issues in tool design.

Configuration Scenarios

Thus far the research community seems to agree on exactly one characteristic which defines Configurable computing: programmable hardware. In most cases field programmable gate arrays (FPGAs) are used to provide this capability, though in some cases programmable switches are used to configure interconnect hardware.

A number of different application scenarios have been proposed for Configurable computing, and one way to characterize the differences is to consider the rate of configuration. The key characteristic is that some amount of information is semi-static, i.e. changing frequently enough to require programmable hardware but slowly enough to provide the opportunity to improve performance through customized hardware.

Situational

Situation based configuration involves hardware changes at a relatively slow rate, for example on the scale of hours, days or weeks. This classification trivially subsumes the case where the new configuration provides a bug fix. The examples below illustrate some of the specific applications of situational configuration that have been proposed by the research community:

While the first scenario involves algorithmic specialization, the second deals with data-specific optimization. The author has been involved with an effort to produce a high performance system for template based Automatic Target Recognition. This effort involves generating highly customized adder trees which are used for executing two dimensional image correlation. The image source is generated from a Synthetic Aperture Radar (SAR), and produces images representing radar reflection and absorption rather than a direct optical image. The SAR templates are approximately 25% populated, and so the customized adder trees are better than a general purpose correlater because they require less hardware and have shorter critical paths. By using configurable hardware to implement multiple data-specific adder trees in sequence, the system is faster and consumes less power than either a general purpose programmable part or a general purpose correlater. While it is true that the same technique could be used to produce a sequence of ASICs which are each customized to a set of templates, the resulting system would require far too many ASICs to be practical.

Time-Sharing

A second use for configurable computing is through straight forward time sharing of the programmable hardware. If a particular application lends itself to a pipelined hardware implementation it may be possible to implement each phase on the programmable part in sequence. An FPGA from National Semiconductor was used at UCLA to achieve a four-fold reduction in hardware through round robin reconfiguration for a video image transmission system. The device managed data acquisition, image quantization, wavelet coding and finally modem transfer. A more ambitious use of the same part has appeared in the DISC work at BYU. Hutchings and his students compile C code fragments to assembly language for the Sun SPARC as well as activations to an FPGA accelerator. The FPGA circuitry is able to tell if the desired circuit is currently loaded, and executes a demand-driven fetch and reload in the case of a fault.

Dynamic Circuit Generation

The most ambitious approach suggested thus far involves dynamic generation of FPGA circuits. This technique has been proposed for both evolutionary systems that exhibit emergent behavior, as well as more traditional applications involving parameterized macro libraries and some dynamic placement. It seems likely that dynamic circuit generation will provide one of the most compelling uses of the full set of properties available in FPGAs, simply because it inherently excludes the use of even a large number of ASICS as an alternative. However, it has not yet been shown just how broadly this approach can be applied.

Configuration Units (Architecture)

FPGA

Thus far, the basic unit of configuration typically used for configurable computing systems has been the commercial offerings in FPGAs. These devices provide simple processing elements, each of which can implement the equivalent of tens of logic gates. Configurable computing with these devices has been focused on CAD development tools, including schematic capture, VHDL and Verilog, and netlist generation and simulation

Enhanced Elements

A number of researchers have proposed using more powerful elements in the FPGA array. In particular, the introduction of multipliers has attracted a great deal of attention, with the hope of developing high performance configurable computers for more traditional signal processing applications. Other approaches suggested include ALUs with wider data paths as well as larger memories.

"Smart" Processing Elements

Researchers at MIT, Virginia Tech and the University of Washington have suggested placing some of the control flow circuitry in the embedded processing elements. This paradigm represents a fundamental shift away from the circuit view of traditional FPGAs, and moves the devices closer to previous work in systolic array processors. One fundamental problem with this approach is that it cannot completely leverage either the compilation work found in traditional processors or the synthesis tools used for ASIC and FPGA development. On the other hand, it holds out the hope of being more efficient than either approach once sufficient models of computing and tools are developed.

Routing

The final class of resource which can be configured in a configurable computer involves the routing resources. At an important level the existing FPGA architectures already provide this capability: the interconnect between arrays of processing elements is switchable under the control of SRAM cells (typically). However, some researchers have proposed building highly capable reconfigurable interconnect which can adaptively route between fixed (though high performance) logic blocks. This reasoning fits well with the effort to increase support for aggressive DSP applications on these devices. One could imagine a linear array of processing elements which could be subdivided or rerouted according to the time-varying system requirements.

Research projects are currently underway to develop effective device architectures supporting each of the forms of configuration mentioned above. Thus far, the vast majority of commercial effort has been invested towards improving traditional FPGA devices along one of the standard paths: density, speed, power and cost.

Applications

Configurable computing architectures will never prove good targets for general purpose computing, in part because of the huge investment in special purpose hardware which is designed to support general software structures. It is worthwhile considering what sort of applications do in fact map well to the existing models of configurable computing.

Embarrassingly Parallel

Because of the regular structure of simple processing elements, configurable computers lend themselves to applications that exhibit embarrassingly large amounts of parallelism. Examples include signal processing and matrix operations. FPGA based system are particularly amenable to bit-wide operations, though bit-serial datapaths have proven effective. Coarser grained structures, such as some of the enhanced processing elements discussed above, lend themselves to wider data paths. These applications also map well to ASIC technology, and so other application characteristics are needed to justify the use of less efficient and more expensive configurable computing devices.

Embarrassingly Customizable

On the other hand, some applications exhibit large opportunities for data-dependent optimization. Examples already mentioned include automatic target recognition for templates which are sparsely populated, and data encryption with pseudo-static encryption keys. These systems do exhibit large amounts of parallelism, but they are mostly interesting because of the work that can be avoided by leveraging runtime data information. This feature makes it impossible for an ASIC based approach to ever provide comparable performance with the comparable resources, even though the underlying problem formulation involves circuit design and a similar set of CAD tools are used for development.

Personal Speculation

Based on experience, I am comfortable drawing the following conclusions:

  1. Configurable computing will never be considered general purpose computing, though it possible that a configurable computing module will be embedded in a commercial general purpose computing system.
  2. Hardware structures should not become coarser in an attempt to accelerate signal processing tasks. Programmable DSP vendors are striving to put as much performance as possible in their devices, and the market forces are not in our favor if we move too close to their domain. Configurable computing will work best when it attacks problems and uses technology which are clearly distinct from both general purpose computing and traditional DSP.
  3. One or more effective programming models must be developed before configurable computing emerges as an important computing paradigm. Current configurable computing systems are hand crafted, partly because of the tight dependence on low level CAD tools which are tied to vendor architectures, and partly because of the immaturity of the community.
  4. Configurable computing will show a commercial success within the next several years. While the success stories are currently few and far between, and development approach tend to be ad hoc with little reusability, there are a small number of strongly encouraging successes.

Whither Configurable Computing?

Carl Ebeling

Department of Computer Science and Engineering

University of Washington

Configurable computing has captured the imagination of many architects who want the performance of application-specific hardware combined with the flexibility of general-purpose computers. Despite the efforts of many research groups over the past decade, successes have been rare: Configurable computers so far exhibit poor cost performance for most common applications. To make things worse, configurable computers are notoriously hard to program.

Commercial FPGAs are not well-suited to most applications. These FPGAs are necessarily very fine-grained so they can be used to implement arbitrary circuits, but the overhead of this generality exacts a very high price in density and performance. Compared to general purpose processors (including DSPs), which use very optimized function units that operate in bit-parallel fashion on long data words, FPGAs are very inefficient for performing ordinary arithmetic and logical operations. FPGA-based computing has the advantage only when it comes to complex bit-oriented computations like count-ones, find-first-one or complicated masking and filtering.

Because FPGAs are so fine-grained and general purpose, programming an FPGA-based configurable computer is akin to designing an ASIC. The programmer either uses synthesis tools that deliver poor density and performance, or designs the circuit manually which requires both intimate knowledge of the configurable architecture and substantial design time. Neither alternative is attractive, especially if the computation itself is relatively uncomplicated and can be described in a few lines of C code.

We are certainly willing to pay some price for configurability but the question is how large a price will we put up with? The current cost-performance price of configurable computers is a factor of about 100x, and the programming price in terms of expertise and time is many orders of magnitude greater. This combined price is much too high for most applications and most users.

Conclusion 1 - FPGA-based configurable computers will be used only in niche applications where cost is of little concern or that require substantial bit-level data computation. New architectures like the Xilinx 6200 will provide some improvement for some applications, but expecting dramatic improvements is unrealistic. Progress in making configurable computers easier to program will be disappointing.

Conclusion 2 - New configurable computers will appear that are based on new coarse-grained architectures more suitable for conventional arithmetic-intensive tasks. Research examples include the Matrix (MIT) and RaPiD (University of Washington) configurable architectures.

Conclusion 3 - Progress in programming configurable computers will require a coordination between the architecture model of computation, the application domain, and the programming language. One need only look to the successful silicon compilers that have been developed for DSP applications such as Cathedral (IMEC) and Lager (Berkeley) to see the advantage of this approach. Although traditional FPGA-based architectures can benefit from this methodology, the newer coarse-grained architectures can take full advantage of it from the ground up.

Conclusion 4 - Systems will appear that incorporate dynamically programmable components in new and interesting ways that allow conventional computing to be blended with application-specific computing at a fine-grained level. Initial attempts include PRISC (Harvard), DISC (BYU) and Brass (Berkeley).

In summary, future progress in configurable computers will result not from continued research along the same FPGA-based path, but from a diversity of approaches including more coarse-grained configurable architectures and constrained programming models that allow more powerful compilation techniques.

End-to-end Solutions for Reconfigurable Systems: The Programming Gap and Challenges

Krishna V. Palem

Courant Institute

New York Univeristy

palem@cs.nyu.edu

Substantial effort has gone into the research and development of hardware "media" that permit varying hardware images to the user, from fine-grained to coarse-grained levels. At its finest granularity, these media permit reconfiguration at the levels of single bits, and can span the entire spectrum to systems that reconfigure at the levels of individual physical or virtual processors. While the term itself is used pervasively and offers exciting opportunities in all of the above contexts, our focus in this discussion is on the devices at the finer level of granularity. In particular, we are concerned with devices that form a basis for

  1. Defining processors---perhaps dedicated to single application---which can be reconfigured over time either in terms of their instruction sets, word sizes and/or data-path configurations and
  2. that can form an easily altered communication fabric connecting groups of specialized or COTS processors.

The hardware media that constitute the clay from which different configurations can be molded easily and dynamically range from DPGA technologies at the device level, to reconfigurable meshes at the system level.

Reconfigurable hardware of this sort has found a variety of interesting applications typically in the context of application requiring special throughput, and in particular, well-defined timing behavior. Several applications such as ATR, and multimedia applications using the MPEG standard, all use reconfigurable "glue" at critical points in the computational path. While reconfigurable hardware remains a very desirable choice in the context of all these application domains, the potential rapid evolution (revolution) is yet to come. A primary barrier in this regard is the absence of programming tools and software support to eventually compile algorithms implemented in standard and widely-used languages such as C onto the hardware platforms.

Current support through VHDL based synthesis does not come close to providing the level of support that is eventually desirable, i.e., the level of the specification is too close to concerns of hardware. In contrast, application development tends to be much more centered around algorithmic specifications at much higher levels. In fact, currently, even acceptable "models" of a range of reconfigurable media that a compiler can target are lacking. The problem is further compounded by the fact that the compiler must also optimize to take advantage of the potential for reconfiguration, as well as the parallelism that these platforms have to offer.

There are some proposals aimed at targeting public domain compilers such as GCC at restricted forms of reconfigurable architectures such as the transport-triggered approach. While transport-triggering raises some interesting architectural opportunities, approaches such as these raise two important concerns. First, transport-triggering offers a limited view of reconfigurability, almost constrained by the type of machines that canonical optimizing compiler technology such as that embodied in GCC can try to exploit; it is not even clear that conventional compiler technologies and intermediate representations are the adequate in dealing with this situation. Consequently, if we were to consider more ambitions forms of reconfigurability, the nature and concerns of optimizing compiler technology needs substantial research and innovation. This entire challenge is compounded substantially if we add the need for preserving timing expectations in the application, motivated by the rich range of embedded applications in the context of which reconfigurable hardware can play an important role. The depth, breadth and need for possible innovations is very great. In this short position paper, we will highlight some of the more crucial aspects and issues while deferring a more detailed discussion to the workshop.

Optimizing Programming Tools and Compilation Support: Reconfigurable platforms offer several novel opportunities in terms of target hardware including varying word size, degree and type of instruction level parallelism and communication topology. To design efficient optimizing compilers to target the "processors" with these properties, we need to revisit and quite possibly rethink

  1. Issues of intermediate representation (IR) design in the compiler. We perceive the needs of reconfigurable computing as being substantially different from those supported by current IR research and designs, aimed at benefiting super-scalar and VLIW targets. Emphasizing a good design in this regard is crucial to enabling optimizations that use it and can make the difference between success and failure of the resulting compiler.
  1. Phrasing optimization phases in precise ways that capture the essence of improving performance in the target hardware by taking advantage of the reconfigurable parameters, is crucial. There is substantial work to be done in identifying the type of optimizations, as well in designing fast algorithms for performing them.
  1. The types of support provided during program development through profiling and debugging.

It will be very useful to try and leverage the wealth of knowledge available in the context of program development for conventional platforms, but it will be crucial to try and understand the different needs of reconfigurable hardware and their impact on eventual performance.

Real-time and Embedded System Support: This area is rapidly growing and while having substantial potential for using reconfigurable hardware, also has a need for programming support in conventional settings as well. This situation is especially true in the context of targeting processors with ILP. Conventional compile-time optimizations restructure the program quite dramatically and are not geared to cope with timing constraints in the applications being compiled. Some crucial needs that are also true in the context of developing embedded applications using reconfigurable targets are:

  1. Notations for expressing time-constraints, and which can be integrated into conventional front-ends for widely-used "host" languages such as C and C++. To really succeed, the notation must be simple and easy to use on the one hand, and must easily integrate into a range of host languages. We envision the applications themselves being developed in the host language with the timing constraints being an add-on feature. In particular, proposals that involve changing the host language with special constructs, or that propose new languages are not likely to succeed in gaining wide acceptance. Furthermore, the wealth of program development and related technology that is available for languages such as C can continue to be used in the context of the previous approach.
  1. Innovating crucial global optimizations such as instruction scheduling that are sensitive to the timing constraints. Current approaches do not concern themselves with this issue and there is substantial room for research in this area.
  1. Understanding models of caching behavior for key applications so that we can have accurate estimates of the timing behavior of instructions. These estimates will prove to be central in the context of designing the optimization algorithms mentioned above.

The real-time compilation technologies and ILP (ReaCT-ILP) project at NYU that this author directs, is addressing several of these issues. Algorithm specific methods for mapping key applications onto reconfigurable platforms such as meshes have been developed by the USC group with which we are interacting on the issue of developing models and IRs at which our compilation technologies can be targeted. They are presenting an independent position paper at this workshop, describing these modeling aspects and related innovations.

CONFIGURABLE COMPUTING: HOW TO DELIVER THE PROMISE !

Kiran Bondalapati and Viktor K. Prasanna

Department of EE-Systems, EEB-244

University of Southern California

Los Angeles, CA 90089-2562

Contact Email : prasanna@usc.edu, Tel: +1-213-740-4483

http://www.usc.edu/dept/ceng/prasanna/home.html

Configurable computing ideas are being explored to design high performance systems for many applications. Devices which provide partial reconfigurability of combinational logic are now in the market. Future devices which provide dynamic reconfigurability of both combinational logic and interconnection network based on intermediate results promise enormous computational power.

To realize the inherent potential of this technology we need algorithmic techniques and tools which exploit the hardware in a non-trivial manner. Characteristics of future devices also need to be explored. Current approaches to design configurable solutions are largely based on "Logic synthesis" in which an HDL description is statically compiled onto hardware. Using such an automated synthesis approach is not amenable to designing solutions which analyze the run-time behavior of applications and exploit dynamic reconfiguration.

Collapsing the numerous levels of abstraction in the automated synthesis approach will provide a new paradigm for designing configurable computing solutions. We propose to achieve this by using a computational model of configurable computing devices which facilitates an algorithm synthesis approach as opposed to the logic synthesis approach. In our approach the user is exposed to the underlying device characteristics which will allow the user to make use of the dynamic reconfiguration features. The computational model not only allows the user to implement algorithms in a natural manner but also permits analysis of the runtime behavior.

We will first illustrate some earlier models proposed by our group. These models of parallel computation which permit dynamic reconfiguration of the interconnection network on a per-instruction basis provide distributed control using local intermediate computational results. These models provide interesting ideas as to the directions in which devices should evolve. Currently, we are also developing practical models which will consider the cost of reconfiguration, partial reconfigurability and performance in light of these issues. These will be discussed in the presentation. Variants of these models are also used for compilation by NYU researchers. These issues are discussed in a separate presentation at the workshop.

This work is supported by DARPA under contract DABT63-96-C-0049.

Directions in General-Purpose Computing Architectures

Andre DeHon

University of California at Berkeley

General-purpose computing devices and systems are commodity building blocks which can be adapted to solve any number of computational tasks. We adapt these general-purpose devices by feeding them a series of control bits according to our computational needs. We have traditionally called these bits instructions, as they instruct the programmable silicon on how to function.

While all general-purpose computing devices have instructions, distinct architectures treat them differently -- and it is precisely the management of device instructions which differentiates various general-purpose computer architectures. When architecting a general-purpose device, we must make decisions on issues such as:

Conventional programmable processors, such as microprocessors, have

As a consequence these devices are efficient on wide word data and irregular tasks -- i.e. tasks which need to perform a large number of distinct operations on each datapath processing element. On tasks with small data, the active computing resources are underutilized, wasting computing potential. On very regular computational tasks, the on-chip space to hold a large sequence of instructions goes largely unused.

In contrast, conventional configurable devices, such as FPGAs, have

As a consequence these devices are efficient on bit-level data and regular tasks -- i.e. tasks which need to repeatedly perform the same collection of operations on data from cycle to cycle. On tasks with large data elements, these fine-grain devices pay excessive area for interconnect and instruction storage versus a coarser-grain device. On very irregular computational tasks, active computing elements are underutilized -- either the array holds all sub-computations required by a task, but only a small subset of the array elements are used at any point in time, or the array holds only the sub-computation needed at each point in time, but must sit idle for long periods of time between computational sub-tasks while the next subtask's array instructions are being reloaded.

Unfortunately, most real computations are neither purely regular nor irregular, and real computations do not work on data elements of a single data size. Typical computer programs spend most of their time in a very small portion of the code. In the kernel where most of the computational time is spent, the same computation is heavily repeated making it very regular. The rest of the code is used infrequently making it irregular. Further, in systems, a general-purpose computational device is typically called upon to run many applications with differing requirements for datapath size, regularity, and control streams. This broad range of requirements makes it difficult, if not impossible, to achieve robust and efficient performance across entire applications or application sets by selecting a single computational device with the extremes of today's conventional architectures.

Potential solutions to this dilemma reside in architectures which tightly couple elements of both extremes and which draw from the broad architectural space left open in the middle.

Multiple context FPGAs, such as MIT's DPGA, provide one such intermediate in this architectural space. The DPGA retains the bit-level granularity of FPGAs, but instead of holding a single instruction per active array element, the DPGA stores several instructions per array element. The memory necessary to hold each instruction, is small compared to the area comprising the array element and interconnect which the instruction controls. Consequently, adding a small number of on-chip instructions does not substantially increase die size. While the instructions are small, their size is not trivial -- supporting a large number of instructions per array element (e.g. tens to hundreds) would cause a substantial increase in die area decreasing the device efficiency on regular tasks.

Multiple context components with moderate datapaths also come down in the intermediate architectural space. Pilkington's VDSP has an 8-bit datapath and space for 4 instruction per datapath element. UC Berkeley's PADDI and PADDI-II have a 16-bit datapath and 8 instruction per datapath element. Both of these architectures were originally developed for signal processing applications and can handle semi-regular tasks on small datapaths very efficiently. Here, too, the instructions are small compared to the active datapath computing elements so including 4-8 instructions per datapath substantially increases device efficiency on irregular applications with minimal impact on die area.

While intermediate architectures such as these are often superior to the conventional extremes of processor and FPGAs, any architecture with a fixed datapath width, on-chip instruction depth, and instruction distribution area will always be less efficient than the architecture whose datapath width, local instruction depth, and instruction distribution bandwidth exactly matches the needs of a particular application. Unfortunately, since the space of allocations is large and the requirements change from application to application, it will never make sense to produce every such architecture. Flexible, post fabrication, assembly of datapaths and assignment of routing channels and memories to instruction distribution enables a single component to deploy its resources efficiently, allowing the device to realize the architecture best suited for each application. This is the approach taken by MIT's MATRIX component.

Since many tasks have a mix of irregular and regular computing tasks, a hybrid architecture which tightly couples arrays of mixed datapath sizes and instruction depths along with flexible control can often provided the most robust performance across the entire application. In the simplest case, such an architecture might couple an FPGA array into a conventional processor, allocating the regular, fine-grained tasks to the array, and the irregular, coarse-grained tasks to the conventional processor. Such coupled architectures are now being studied by several groups.

In summary, we see that conventional, general-purpose device architectures, both microprocessors and FPGAs, live at extreme ends of a rich architectural space. As feature sizes shrink and the available computing die real-estate grows, microprocessors have traditionally gone to wider datapaths and deeper instruction and data caches, while FPGAs have maintained single-bit granularity and a single instruction per array element. This trend has widened the space between the two architectural extremes, and accentuated the realm where each is efficient. A more effective use of the silicon area now becoming available for the construction of general-purpose computing components lies in the space between these extremes. In this space, we see the emergence of intermediate architectures, architectures with flexible resource allocation, and architectures which mix components from multiple points in the space. Both processors and FPGAs stand to learn from each other's strengths. In processor design, we will learn that not all instructions need to change on every cycle, allowing us to increase the computational work done per cycle without correspondingly increasing on-chip instruction memory area or instruction distribution bandwidth. In reconfigurable device design, we will learn that a single instruction per datapath is limiting and that a few additional instructions are inexpensive, allowing the devices to cope with a wider range of computational tasks efficiently.

ASICs, Processors, and Configurable Computing

Brad L. Hutchings (hutch@ee.byu.edu)

Configurable Computing Lab

Dept. of Electrical and Computer Engineering

Brigham Young University

Provo, UT 84602


It is often suggested that configurable computing represents a new computational middle ground that fills the existing void between conventional microprocessors and ASICs. This point of view is based upon the observation that FPGAs share some similarities with both processors and ASICs. FPGAs are seen as similar to processors because they are customized in the field by the end-user by downloading configuration data into the device. They can also be seen as similar to ASICs because they can implement high-performance, application-specific circuits. It is hoped that if configurable computing can be shown to be similar to conventional processors, it will be possible to borrow microprocessor architecture and compilation techniques for use in the configurable-computing community.

However, configurable computing, as defined by current FPGA technology, does not fill the void between ASICs and processors. FPGAs hold much more in common with ASICs than they do processors. Indeed, if the spectrum of computing approaches were to be viewed as a family, ASICs and configurable computing would be siblings and processors would be distant relatives, at best. The distant relationship between FPGAs and processors can be seen best by studying the organization of successful FPGA applications, i.e., those applications that achieve at least order-of-magnitude performance gains over other processor-based approaches. A quick review of these applications shows that they are highly concurrent, deeply pipelined and achieve performance gains primarily by exploiting massive amounts of data-level parallelism -- typically 100-1000 times that of a general-purpose microprocessor. Contrast this with typical microprocessor applications that are described using sequential languages, implemented as sequential instructions, and executed on machines optimized for sequential execution.

The relationships between configurable computing, ASICs, and microprocessors has several important implications. First, sequential programming languages and related compilation approaches are not likely to be a good match for highly parallel configurable-computing applications. While it may be possible to achieve moderate speedup, significant speedup will only be achieved by directly exploiting massive amounts of parallelism. This is currently done using low-level circuit design tools. Second, the architectural organization (both at the device and system level) will be much more distributed than is commonly found in existing computer systems. For example, whereas typical computing systems consist of large global memories, configurable-computing platforms will be much better served by many, smaller distributed memories. Finally, because of the fundamental mismatch between the datapaths in processors and configurable-computing systems, hybrid systems of microprocessors and FPGAs are best coupled in flexibly so that the best features of each device can be fully exploited.





Performance Metrics For Configurable Computing

Henk Spaanenburg

Honeywell Technology Center

3660 Technology Drive

Minneapolis, MN 55418

Metrics determine what kind of conclusions may be drawn from benchmark results, and also affect how benchmarks must be performed. The metrics in use in scientific computing benchmarks address mainly four questions:

  1. how can a given algorithm be characterized
  2. how good is an algorithm to solve a problem on a given implementation
  3. how good is a machine across multiple problems, and
  4. how effectively does an implementation-algorithm combination scale?

When evaluating architectural and packaging options for an architecture, one commonly encounters the problem of meeting performance requirements within the constraints of weight, volume and power envelope as well as the amount of computation performance that can be realized with a given physical envelope. This assessment process can be guided by a metric that we have found to be relatively consistent in past applications. It incorporates throughput in million operations per second (MOPS), weight (and implicitly volume) in kilograms, and power in watts. The MOPS/(kg.watt) ratio has been used to evaluate technology and packaging tradeoffs.

The claim is that given a particular technology (pre-VHSIC, VHSIC phase 1 and 2) and a particular packaging approach (representing various die size per real estate area ratios), the selected combination will produce a system where the MOPS/(kg.watt) is known to be within a certain order of magnitude. A commercial supercomputer, such as the Intel Paragon (Gamma), assuming 7.7 GFLOPS, 3000 lbs and 116 kW of power, would represent a 0.0005 MOPS/(kg.watt). A Honeywell militarized Touchstone (Sigma) avionics supercomputer, assuming 7.7 GFLOPS, 82 lbs and 2.9 kW power, would represent a 0.71 MOPS/(kg.watt). For example, future avionics systems need to have a MOPS/(kg.watt) on the order of several hundred (209) for a 1.8 GFLOPS/20 GOPS Touchstone enhanced radar preprocessor realized on one double sided (2 lbs), liquid cooled (200 W) SEM-E form factor board (1 GFLOPS ~ 10 GOPS).

In our experience, FPGA-based systems for any function perform at an order of magnitude better MOPS/(kg.watt) than a similar DSP-based implementation but still at an order of magnitude less than a full ASIC implementation. Our evaluation will confirm that observation, but in addition, will provide insight into the mechanism of reconfiguration and its related timing expense.

In addition to the above metric, others have been defined to measure the effectiveness especially of reconfiguration aspect of configurable computing devices. A direct function-for-function evaluation, especially relative to ASICs, is not the proper way to evaluate configurable computing. Of interest are some of the metrics as proposed by DeHon at MIT as part of their reinventing computing program.