Notice: The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarity and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.
Dept. of Computer Science, University of Kaiserslautern
P.O.Box 3049, 67653 Kaiserslautern, Germany
e-mail: email@example.com, FAX: ++49 631 205 2640
The following section introduces the Generic Address Generator hardware. An application example demonstrates its use in a high-speed image analysis environment for automotive applications. A final section provides some technical information on the device and concludes the paper.
The Generic Address Generator provides a logical view to the memory as a two-dimensional map. Higher dimensional data arrays can be mapped by slicing them into planes, which can be placed adjacently onto the two-dimensional memory organization. Large linear arrays (vectors) are constructed by concatenating multiple rows in the address computations. Physically, a linear memory is accessed, of course, for compatibility with conventional memory organizations. The Generic Address Generator can be programmed to several variations how to combine two address parts to a single linear memory address. To support efficient access to structured data, a Generic Address Generator operates in a two-stage pipeline. The first stage computes handle positions for so-called scan windows, which represent a neighbourhood of data elements around the handle position (Handle Position Generator). A handle position consists of two 16-bit values for the two dimensions of the data memory. The sequence of handle positions describes how the corresponding scan window moves across the data memory (figure 1). Such a sequence of handle positions is called scan pattern.
The second pipeline stage computes a sequence of offsets to the handle positions, to obtain the effective memory addresses for the computations. Therefore this stage is called Memory Address Generator. The offsets may be programmed to appear in arbitrary order, because the memory addresses are computed from simple RISC-like instructions (figure 1). With a proper instruction pipelining, addresses can be delivered one in each clock cycle, so that the introduction of instruction sequencing at this level does not introduce a bottleneck.
Multiple Generic Address Generators may be run in parallel on a single memory bus or multiple memory buses. They synchronize their scan patterns through the activations of the data processing device(s). All scan patterns proceed until either they detect an explicit synchronization point in the offset sequence of the Memory Address Generator, or they are blocked by a write operation to memory, waiting for the data processing device(s) to compute the desired result. In case of a single memory bus, explicit synchronization instructions in the Memory Address Generator program are mandatory to allow bus mastership to change from one Generic Address Generator to the next.
The following sections describe the pipeline stages of the Generic Address Generator in more detail. Afterwards, the capabilities of a Generic Address Generator are illustrated by examining the set of scan patterns that a Generic Address Generator can produce without requiring a reconfiguration.
The programming model of the memory is a two-dimensional map. Therefore the Handle Position Generator consists of two identical parts, one for each dimension (figure 2). This allows to use both dimensions at will, since the hardware structure induces no preference to assign specific characteristics only to one dimension. The so-called 1-D Address Generators operate in parallel and synchronize through a trigger logic, which preserves the symmetry of the design. The programmer decides, which (if any) 1-D Address Generator operates as master, triggering the other 1-D Address Generator to perform a step at the appropriate positions. The trigger logic simply routes the trigger signals according to the programmer's specification. Furthermore, it evaluates all conditions, which require a knowledge of the state of the complete system (e.g. when a scan pattern terminates).
The one-dimensional positions produced by each of the 1-D Address Generators are checked against segment limits, before they are presented to the Memory Address Generator as valid handle positions. This serves as a kind of memory protection scheme, which is especially useful, if the handle positions are computed under control of the data manipulating devices, dependent on the data processing results. The segment check units make sure, that all handle positions of a Generic Address Generator remain within a programmer-defined orthogonal bounding box in the two-dimensional memory map. By providing the maximum and minimum offsets that occur in the Memory Address Generator program, the segment checks can even evaluate whether the bounding-box would be left during the generation of memory addresses and invalidate such a handle position before it is passed on in the pipeline.The actual address computations are done by the 1-D Address Generators. Each 1-D Address Generator is capable of handling two nested loops, where the number of iterations is known at compile time.
The range of offsets may be -32 to +31 in both dimensions of the data memory. The memory to store the instructions (address instruction RAM, AIR) has 256 entries, so that at most about 250 references to the data memory can be made from a single handle position. At least three instructions are overhead, which have to be inserted to accept a new handle position from the Handle Position Generator, to signal the start of computations to the data processing devices, and to jump to the beginning at the end of an offset sequence.
After the instructions have been fetched and decoded, the offsets have to be added to the current handle position. A number of steps are performed with each of the resulting addresses, to transform them into physical memory addresses. These tasks are completely hardwired, because they have to be applied to each address. The first is an address modification to allow for cyclic addressing. For each dimension, a CycleMask word is used to keep selected bits of the address from changing, and a CyclePattern word provides default values for the masked bits. If the CycleMask is partitioned into a leading block of 16-k zeroes and a trailing block of k ones, for example, the resulting addresses automatically wrap around at the 2k boundary, because the values of the higher (16 - k) bits are masked. Input to the following step are still two 16-bit address words, one for each dimension. To be able to access a conventional linear memory, the address parts have to be combined to a linear memory address. The address parts of the two dimensions may be combined to a real linear memory address in four ways: one row of two-dimensional memory (x dimension) may consume an address space of 10, 12, 14, or 16 bits. This allows to adjust the "size" of the data memory to the size of the processed data, to reduce wastage of address space. After the concatenation of the two address parts to a linear address, a 32-bit base address is added, to obtain the effective memory address. The base address typically is the starting address of the data array referenced by this Generic Address Generator. Finally, the bus interface handles the protocol for the actual data transfers between the data memory and the data manipulating devices, using the effective memory address to access the data memory. A block diagram (figure 3) illustrates these tasks, as they are performed in the Memory Address Generator. The AIRport register has been introduced for two reasons. First, the instructions in the Address Instruction RAM (AIR) are only 16 bits wide, so that the AIRport register serves as an interface register, to transfer two instructions with one configuration data transfer. Second, to the processor which downloads the parameters, the Address Instruction RAM is hidden behind the AIRport register and consumes only a single address. The programming model corresponds to a shift register, allowing to write a complete instruction stream to the same AIRport address during configuration. This can be done efficiently using block transfers. This concept improves the scalability of the Generic Address Generator, because different sizes of the Address Instruction RAM do not require a different layout of configuration register addresses, but only a different length of the block transfers. This is true for the number of instructions that can be stored as well as for the format of the instructions (32-bit instructions would allow for a larger offset range, for example). The resulting changes to the download software can easily be parameterized, whereas the processor interface remains unchanged for all these variations of Memory Address Generators.
A chip photograph of the Generic Address Generator can be seen in figure 4. It has been fabricated in a 1.0 Ám CMOS standard cell process at European Silicon Structures (ES2). The circuit contains 6821 standard cells, being equivalent to 126702 transistors, excluding the RAM. The macro-block in the upper left corner is the 256 word Address Instruction RAM. The die size is 8.3 by 7.4 mm2, including the pads. It is housed in a 101 pin PGA package and operates at 20 MHz. Spending additional efforts in the place and route process and inserting some more stages to the pipeline, higher operation frequencies will be possible.
For all scan patterns, the stepwidths can be chosen arbitrarily in the range of 1 to 65536. They have been reduced to one for the more complex patterns only for compactness of the presentation. And since both dimensions are completely symmetric, all variations of the scan patterns, where x and y dimension are interchanged, are valid scan patterns as well. Even the interleaved video-scan in figure 5c, where the odd x positions are visited first and the even ones next before proceeding with the next line can be changed e.g. to a pattern, where first the odd and then the even lines are processed, like an interlaced video display does. The data dependent scan patterns in figure 5f and g can be handled with little overhead, because the data manipulating devices only have to signal the sign bits of the single step vectors, using the programmed stepwidths of the Generic Address Generator. This simplifies curve following and Lee routing applications as examples of algorithms requiring nearest neighbour data dependent processing. Furthermore, the arbitrary sequence of addresses programmable in the Memory Address Generator allows to produce any scan pattern made up of up to 250 references to addresses in a 64 by 64 words wide area. This includes all scan patterns required for the JPEG compression algorithm, which operates on blocks of 64 data words in an eight by eight array organization.
The first Generic Address Generator computes the address sequence for the image preprocessing hardware, which performs an edge detection (e.g. by applying a two-dimensional FIR filter with appropriate weights). The pixels are read directly from the sensor array in the order which is important to the selected algorithm. The second Generic Address Generator reads the resulting edge enhanced image from the preprocessing hardware and stores it into a dual-ported data memory. There a general-purpose CPU or a digital signal processor may perform the image analysis tasks, where the edges have to be combined to objects. These are identified to detect obstacles, like the approaching car, and the borders of the street, for example. Since these algorithms do not reveal address patterns known at compile time, they can be handled with less overhead by a von Neumann style processor than by an address generator.
The hardware for the image preprocessing might be a reconfigurable hardware like the rDPA , or the reconfigurable data-driven multiprocessor architecture proposed in . Both would be superior over special purpose hardware with regard to flexibility in the choice of algorithms to be run. The required data rates of 3.2 MBytes per second to process the contents of the complete sensor array of 128 by 256 pixels 100 times a second can easily be handled by both kinds of devices. This allows to process even larger sensor arrays built from multiple devices, or higher frame processing rates than 100 per second.
Using two independent Generic Address Generator allows to perform the image analysis in a pipelined fashion, where the image acquisition, the image preprocessing, and the image analysis are done in parallel in different stages of the pipeline. This does not even increase the latency from the image acquisition to the collision detection, because all these steps would have to be done sequentially as well in a non-pipelined fashion, only at lower overall data rates or higher bandwidth requirements. The versatility of the scan patterns in combination with the two-level hierarchy of address generation allows to run a wide variety of image preprocessing algorithms in an endless loop without requiring any download of parameters after the initial power-up configuration.
The Generic Address Generator has been successfully designed and fabricated in a student project. The 1.0 Ám CMOS standard cell process allows an operation at 20 MHz. The circuit fabrication and the design software have been made available by the Eurochip project of the CEC.
 HARTENSTEIN, R.W. - KRESS, R. - REINIG, H.: A Reconfigurable Data-Driven ALU for Xputers. Proceedings of the IEEE Workshop on FPGAs for Custom Computing Machines (FCCM'94), Napa, CA, April 1994.
 N.N.: ADSP-21020 User's Manual. Analog Devices 1991.
 N.N.: DSP96002 IEEE Floating-Point Dual-Port Processor User's Manual. Motorola 1989.
 N.N.: Harris Semiconductor Data Book. Harris Corporation 1992, pp. 6-3 - 6-15.
 N.N.: TMS320C3x User's Guide. Texas Instruments 1990.
 YEUNG, A.K.W. - RABAEY, J.M.: A Reconfigurable Data-Driven Multiprocessor Architecture for Rapid Prototyping of High Throughput DSP Algorithms. Proceedings 26th Hawaii Int'l. Conf. on System Sciences, Vol. 1, IEEE Computer Society Press 1993, pp. 169 - 178.