Increasingly, highly integrated consumer products such as cellular phones incorporating a still camera and video playback, or HDTV-quality DVD players, must execute complex algorithms and process voluminous data content. The very high performance requirement of these devices can be met by the deployment of multiple microprocessors and digital signal processors in system-on-chip (SoC) designs. Problematically, this multiprocessor approach can exceed the power and cost constraints of the application.
However, the SoC's performance, power and cost targets can be achieved by the use of an application engine. An application engine is a custom hardware/software system, typically consisting of a combination of a processor and dedicated hardware accelerators, and optimized to execute a specific algorithm or suite of algorithms. Consider, for example, a JPEG encoder or decoder. A lower performance implementation for a low resolution camera may require only a processor, while a higher performance implementation (for example, for high resolution, photo quality printing) would deploy dedicated hardware accelerators for most of the processing, and a processor for management and control.
The use of dedicated hardware accelerators in an application engine can reduce the power consumption by up to two orders of magnitude while delivering performance an order of magnitude greater than that of a purely programmable solution . However, historically, the manual design of such application engines has been both time-consuming and costly.
Application engine synthesis (AES) is an automated approach that reduces application engine design effort from engineer-years to engineer-weeks, or even engineer-days. This increase in engineering productivity not only accelerates the design of any given application engine, but also offers the option of implementing even more algorithms in an application engine than has historically been practically possible within the permitted chip design time. In other words, AES can fundamentally alter the up front hardware/software (HW/SW) partitioning decisions, greatly expanding the design team's options for meeting or beating increasingly stringent power constraints. Moreover, AES automatically incorporates standard RTL power - or, rather, energy - optimization approaches that in manual design constitute a significant design burden.
This article focuses upon one component of AES - the automated design of the dedicated hardware accelerators that are critical to achieving the greater speed and lower power consumption of an application engine.
In order to understand the role of AES, we must first examine the manual design flow for an application engine from an algorithm.
Manual Accelerator Design
The flow begins with the HW/SW partitioning of system functionality, comprehending - among other considerations - performance requirements, energy consumption and chip area constraints, re-usability needs, and the effort required to design the individual functional blocks within the specified market window. Compute-intensive algorithms such as those for imaging, video, audio and wireless applications are the subject of the most analysis, because they require high performance and are the most energy hungry functions. Such functions are prime candidates for an application engine implementation, and the balance of this flow is predicated on the assumption that an application engine implementation has been selected.
Application engine design starts with the manual development of a custom reference algorithm - which is proprietary IP that often constitutes a critical differentiator - or with the adoption of an established industry-standard reference algorithm, or both. For instance, a still camera would typically deploy custom image processing algorithms and standard JPEG compression.
The reference algorithm is a functional description written in a high level language such as C, and is independent of implementation (Figure 1). That is, it is applicable to a wide range of implementations, e.g. various resolutions, data formats and compression modes. It is often expressed in floating-point arithmetic, and its hardware implementation is undefined. Development of a custom reference algorithm takes engineer-months of effort, while understanding an industry standard algorithm may take an engineer-month, or so. Both require a thorough understanding of the application. Development of a custom reference algorithm also requires a high degree of innovative skill.
Figure 1. Manual Dedicated Hardware Accelerator Design Flow
In the second step, an implementation-specific algorithm must be derived manually from the reference algorithm. This algorithm is expressed in fixed-point arithmetic, and may define hardware resources. Thus, its development may include hardware architecture design and certainly includes determination of the appropriate memory architecture. For instance, the reference algorithm uses a single global memory space, while the implementation algorithm defines global memory and local memory storage, streaming data connections, and DMA accesses. Development of this algorithm generally takes several engineer-months, and requires expertise in system architecture and design.
The third step is the manual RTL design and verification of the hardware part of implementation-specific algorithm, encompassing hardware-specific decisions such as determination of interface bit widths, which must be large enough to deliver the desired output quality, but small enough to ensure a cost-effective implementation within the energy constraints. The choices made at this stage are primary determinants of the energy consumption of the design. For instance, throughput can be achieved by varying the clock frequency or the degree of pipelining, or a combination of both, with each configuration consuming a different amount of energy. Consequently, this step mandates a comprehensive design space exploration to determine the optimum configuration - a very time-consuming task when executed manually. Development of this implementation can take engineer-years of effort.
Clearly, automation of this third step would not only significantly reduce design time and effort, but also dramatically increase the extent and quality of design space exploration, producing a more optimal implementation than is practically possible in manual design.
Automated Accelerator Design
AES enables exploration of algorithm at different levels of abstraction in "what if?" scenarios to devise the optimum implementation. It then automatically generates synthesizable RTL; logic synthesis scripts; a testbench derived from the C simulation; the device driver code to ease integration into the SoC design; a SystemC interface that facilitates system level simulation and validation; and a full SystemC hardware/software co-simulation flow.
AES deploys customizable and parameterized IP to automate the generation of efficient, synthesizable RTL from the implementation-specific C algorithm. AES deploys a Pipeline of Processing Arrays (PPA) - IP that enables it to implement a sequence of C algorithm loop nests into a custom pipeline of hardware blocks. As can be seen in Figure 2, there are three levels of hierarchy.
Figure 2. PPA Architecture
1. Processing Element (PE): A PE consists of arithmetic units, such as adders and multipliers. Data is communicated between functional units, using a novel, patent-pending storage structure, ShiftQ, which also optimizes area utilization.
2. Processing Array (PA): A PA comprises one or more PE, connected to each other using nearest neighbor interconnect. A PA incorporates local memory resources - both SRAM and synthesized register banks - thus reducing the demand for global memory capacity.
3. Pipeline of Processing Arrays (PPA): A PPA comprises a number of PA that communicate streams of data via FIFOs or memories. PA operation is coordinated by a timing controller. The PPA communicates with the CPU under the management of a host controller. Configuration parameters for one or more tasks are stored in a task frame memory.
This architecture enables the exploitation of parallelism in the C algorithm at every level, as shown in Figure 3.
Figure 3. Exploitation of Parallelism At Every Level
- Inter-task parallelism: A task can be commenced before a prior task has been completed.
- Intra-task parallelism: The execution of the loop nests is pipelined. A PA is free to commence execution as soon as input data are available.
- Inter-iteration parallelism: Multiple iterations of a loop are pipelined on a single PE.
- Intra-iteration parallelism at the instruction level: Within an iteration, multiple operations can execute in parallel on multiple functional units in a PE.
Automated Accelerator Design Flow
In this flow, code analysis, code optimizations and hardware allocations are performed to synthesize an RTL implementation optimized for performance, area and energy. A PPA template enables the use and re-use of pre-verified parameterized functional units such as adders, multipliers, shifters, etc.
Figure 4. Automated Hardware Accelerator Design Flow
This approach is critical to right-first-time timing closure and place-and-route. With reference to Figure 4, the flow proceeds in four distinct phases:
1. Implementation-specific C code is analyzed to identify the parallelism that enables the application engine to achieve the requisite performance. Each loop nest is mapped to a PA, and rate matching is used to determine the most efficient PA design to meet performance requirement.
2. Sequential code is optimized using optimizations such as dead code elimination, constant propagation, common sub-expression elimination, and strength reduction.
3. Instruction-level parallelism is identified, and code is mapped to a software pipeline by the allocation of an optimal number of functional units followed by the scheduling of instructions to these units.
4. An efficient hardware implementation is created with only the required components, such as functional units, interconnect, and registers.
AES automates the otherwise difficult and time-consuming manual analysis and optimization of candidate hardware configurations necessary to identify and implement the optimum energy-efficient candidate. It evaluates trade-offs with real RTL or synthesized netlists with which accurate power analysis can be performed.
Example trade-offs -- the consequences of which are not intuitively obvious and which are non-trivial to analyze manually -- include:
- Increasing the clock rate with the same throughput per clock cycle. This would appear to increase energy consumption. However, at the higher clock rate, it may be possible to reduce memory porting and, therewith, area and power consumption. An analysis must be performed to derive the relationship between clock rate and energy consumption to determine the optimum configuration.
- Maintaining the clock rate while evaluating different throughput options. Lower throughput per cycle may well eliminate considerable resources, such as multipliers and memory ports, and consequently reduce energy consumption. By contrast, executing at a higher throughput rate would result in shorter activity periods, which may also reduce energy consumption. An analysis must be performed over a range of implementations in order to determine the energy/throughput relationship.
In addition, the PA architecture is a significant aid in energy reduction. For instance, the consistent instantiation of registers simplifies clock gating. The PA architecture also avoids unnecessary circuit activity when there is no useful data to be processed - often an unintended characteristic of a manual design that contains numerous independent state machines that produce complex time-dependent advance and stall behavior. By contrast, each PA stalls as a unit when no useful processing can be performed, ensuring that no element in its datapath can execute.
Multiple simulations and checks are executed to exhaustively verify the RTL, using the implementation-specific C description as the "golden reference" for the verification flow, as shown in Figure 5.
Figure 5. The Verification Flow
- "Linting" simulation to detect errors in the code that are often difficult to find in hardware, such as un-initialized variables, out of bounds array references, and overflows from customer-constrained bit-widths.
- Software simulation of transformed code with API calls to model HW/SW interfaces and transactions, such as the initialization of registers and memories internal to the accelerator. This simulation also enables extraction of stimulus/response for RTL simulation.
- RTL testbench creation and an "offline" RTL simulation that enables the design team to compare RTL behavior with that of the original C implementation algorithm. This simulation may be executed using a variety of popular RTL simulators, including Cadence NC-Verilog, Mentor ModelSim and Synopsys VCS.
- Co-simulation of PPA SystemC models with original driver code, using transaction-level interfaces. A bit-accurate SystemC model represents the same functionality as that used in the 'linting' simulation, while a thread-accurate SystemC model represents the parallel operation of the loop nests. This latter model encompasses more of the hardware's behavioral characteristics, but its simulation is nonetheless still very fast.
- Co-simulation of RTL with that portion of the original C driver code that remains in software. This co-simulation exercises features that cannot be tested with 'offline' RTL simulation. For instance, some accelerators require data to be retained in memory across tasks, while others allow task overlap in which different loops execute different tasks simultaneously.
- FIFO analysis with event-based simulations using schedule-accurate traces for each loop nest to analyze stalling behavior. This simulation identifies maximum useful FIFO lengths as well as performance/area tradeoffs, and deadlock conditions.
- Pseudo-random perturbation simulation to exercise RTL behavior that has no equivalent in the C implementation algorithm. Such behavior includes the unavailability of data in an input stream; the inability of an output stream to accept new data; and the advent of an 'abort' signal in mid-process, followed by a 'restart'.
A Design Project
We shall now discuss the use of application engine synthesis in the design of a dedicated hardware processing pipeline for still images, as shown in Figure 6. Typically, such pipelines include a number of proprietary processing functions such as de-mosaicing; color adjustments such as color correction and gamma correction; image enhancements such as edge enhancement and smoothing; and image scaling. It also includes a compression process such as JPEG.
Figure 6. JPEG Conversion For A Digital Still Camera
Although JPEG is a standard algorithm, it accommodates multiple picture and color formats, and different levels of compression. Thus the designer can implement different trade-offs depending upon desired image quality.
The pre-JPEG section of the image encoder pipeline performs the following functions on data in raster order:
- Identification and compensation for image quality problems due to sensor imperfections. For instance, 'dead' sensor pixels, and the luminance variations that result from the sensor's varying analog behavior.
- 'De-mosaic' or interpolation of the image data captured by the color filter array in order to reconstruct the full RGB values for each pixel.
- Adjustment of color for factors such as the difference between intensity of illumination for various colors as detected by the sensor and that detected by the eye.
- Transformation of the sensor RGB data into a preferred luminance/chrominance color space such as YUV.
The JPEG section of the pipeline performs the following functions:
- Block management to transform the data from raster order to blocks of 8 x 8 pixels, simultaneously discarding unnecessary data. For instance, because the eye is less sensitive to color than to light intensity (luma), a number of color 'U' and 'V' pixels are eliminated, while 'Y' pixels are retained, reducing the total data by up to 50%.
- Discrete cosine transformation (DCT) on each 8 x 8 block to transform the signal from the spatial domain to the frequency domain. DCT separates the image into spectral sub-bands of differing importance with respect to the visual quality of the image.
- Quantization of each block to compress a range of values to a single quantum value. This increases the number of zero values in the stream, thus increasing the compression that may be achieved.
- Run/level coding, which relies upon strings of zeros to further reduce data volume.
- Huffman coding, which exploits the probabilities that particular run/level values will occur. This allows higher probability run/level pairs to be encoded with fewer bits. Huffman coding is a lossless transformation with no effect upon image quality.
The design must meet the need for maximum throughput, varying processing rates, flow control, and resource sharing, as described in more detail below.
- Maximizing Throughput: The image pipeline typically receives data in raster scan order at some constant rate, perhaps one pixel every cycle or every two cycles. As each pixel is received only once, the application engine must store any data that must be re-used, for instance previous picture lines for demosaicing or vertical scaling. Each stage of the pipeline - including real time JPEG compression - must process data at a rate determined by the sensor output. The JPEG compression stage may operate at a lower rate only if it executes 'off line' after image data is stored in memory.
- Varying Processing Rates: The design's functional blocks operate at different rates. For instance, the Row/Column Transformation algorithms, e.g. DCT, are typically expressed as 8 operations on each of 8 data elements, while many of the image processing functions are expressed as operations on one element. With such variable factors, rate matching between functions is necessary to meet throughput. Row and column transforms may require one loop iteration to commence every 8 cycles, while quantization or scan ordering may require one loop iteration every cycle. AES automatically devises the appropriate rate-matching scheme after analyzing the number of loop iterations vs. the target number of cycles to execute the task, and the data stream bandwidth between the blocks.
- Flow Control: Inputs and outputs are not guaranteed to be available for reading/writing on each cycle, requiring the inclusion of flow-controlled interfaces in the design. All state machines in the design must interact correctly in the presence of any flow control conditions. AES inserts flow control circuitry automatically, enabling functional blocks to operate or stall independently of other functional blocks. Thus, provided any given PA has input data available and can write to its outputs, it can advance, and may do so independently of other PAs. Note that FIFOs inserted automatically during the design eliminate the need for each function to have exactly the same cycle-by-cycle behavior, greatly simplifying the design problem. AES can be used to determine FIFO sizing in order to optimize performance and to avoid deadlock.
- Resource Sharing and Design Flexibility: The range of images that must be processed by the pipeline mandates a degree of design flexibility. For instance, it may be necessary to process images of arbitrary sizes. AES enables the designer to allocate different types of tasks to the procedure, such as 'process a new image and write to memory', 'process a new image and compress to JPEG', and 'read an image from memory and compress to JPEG'.
The table below shows the results of using the automated design flow described above on various dedicated hardware accelerator designs.
It can be seen that AES produces performance results equal to or better than manual RTL design, with area results equal to or less than manual RTL design, and in considerably less time.
Application Engine Synthesis automates the design and verification of an application engine - a custom hardware/software system optimized to execute a specific algorithm or suite of algorithms - that is inherently faster and less energy-hungry than purely programmable approaches. When applied to the dedicated hardware accelerators within the application engine, it delivers performance and area characteristics comparable with those of manually designed RTL, while its underlying IP architecture enables the use of standard RTL energy optimization approaches.
By eliminating engineer-years of effort from the more mundane design and verification tasks, AES enables the design team to perform a comprehensive design space exploration to identify the optimum energy-efficient configuration.
By Vinod Kathail, Vice-president, Engineering and CTO, Synfora, Inc.
Go to the Synfora, Inc. website to learn more.