April 14 2006 -- Success is measured by delivering the defined functionality on time, within budget and in a way that conforms to all of the non-functional constraints, such as power, performance and reliability. To ensure that all of these metrics are met, the industry has traditionally turned to verification and, more specifically, simulation, as a way of predicting these before delivery of actual silicon. Simulators have always concentrated on functional verification and have ignored many of the other aspects of a complete verification environment. In addition, systems now contain more than just hardware, and it is no longer possible to verify functionality or the non-functional constraints without including the software. This is not just about the execution of the software on the hardware, but a complete merger of the concepts of verification of both the software and hardware components, such that the effect of one on the other can be explored.
Verification is also a continuum that parallels the design flow, rather than being a singular function. Each task in that continuum should be performing verification of a single facet of the system, such as functionality, architecture, performance, timing and implementation. While possible, it is not optimal to use the RTL model for all of these verification functions because it contains too much information and this impacts simulator performance. In addition, each of these facets of verification requires different information to be gathered, analyzed and displayed. System performance, for example, cannot be ascertained by looking at waveforms, and functional verification cannot be performed by looking at cache-miss statistics.
Consider the simplified design flow shown to the left in Figure 1, which ignores many of the realities of life and omits all feedback loops, but is able to illustrate the point. At the top of the design flow, the system is designed and algorithms are developed. Most companies do this on paper today, or pieces of it may be modeled in UML, Simulink from the Mathworks, or similar languages capable of the necessary levels of abstraction. No distinction is made between hardware and software, and no consideration is given to the algorithms' implementation.
 |
Figure 1. Facets of design and verification. |
Assuming that models existed for everything at this level, you would want to verify that the algorithms were correct, and that the interaction between the blocks produces the desired functions on the primary outputs. This is the purpose of functional verification. System level functional simulators are not new, but functional verification is usually performed later in the flow today because of the unavailability of suitable models. Timing does not exist at this level of abstraction, and models are normally synchronized by the passage of data through the system or by some regular schedule of execution.
The next stage in the design process is setting the solution's basic architecture by deciding what functionality to implement in hardware and software. Much of the IP that will be used is selected, particularly any platforms and the essential operating systems for that architecture. This lets the system architect forecast a fairly accurate picture of system performance. The industry standard taxonomy defines this as an abstract-behavioral model, which describes the function and timing of a component without explaining its implementation. The models' interfaces may be token-passing in nature but will contain real data and accurate functionality is performed on them. The industry often calls these "transactions," enabling exploration of load factors, congestion, resource utilization and other system aspects. While abstract hardware/ software co-simulation has been available for a while, these products attempt to act as functional simulators rather than concentrating on the performance aspects and do not collect the right kind of data to be useful in identifying performance problems. This issue will be discussed in more detail in the next section.
In the hardware space, the design process considers the micro-architectural decisions, such as the amount of parallelism to use, pipelining and resource sharing. These decisions impact the area, power, latency and throughput of the solutions. Once the implementation detail is added to the refined behavioral model, an RTL model emerges. Here is where most companies start the verification process, including system-level functional verification, performance verification and implementation verification. These models contain superfluous detail for these types of verification, resulting in wasted effort and slow tools.
Performance Verification
Performance verification can help identify problems in the architecture or algorithms early on. It should be noted that it is not only speed that can be investigated at this stage, but how well each of the components in a system are being utilized, establishing correct component sizing, power estimates and many other aspects of the architecture. It can also help to look for problems in both the hardware and the software. In the general case, companies that do not do performance verification tend to oversize everything so that problems will not be found in the later stages of verification. While this works, it is expensive, especially if you are designing for a cost or power sensitive market.
The following experiments were conducted by Poseidon Design Systems who have a tool called Triton Tuner. This tool enables performance analysis to be conducted on abstract models of the system and can be used early in the system development process. The design used for this optimization process was an MPEG4 decoder block that had already been implemented on an ARM9 core with 64kBytes of instruction and data cache and 128MBytes of main memory. The designers of this block felt that they had made good use of the system components and had optimized their code. The performance evaluations were done over 8 frames of data. The system was modeled at the transaction level using an ISS for the processor and abstract behavioral models for the other components of the system.
The first stage in the optimization process was to locate where the time was being spent in the software. After running the simulator, it is possible to get a break down of which pieces of the code are taking the longest. This can be done at the function level or the loop level and an example output is shown in Figure 2.
 |
Figure 2. Locating hot spots in the source code. |
When an interesting part of the code has been identified, the reasons why it takes so long can be identified. Examples of performance problems may be multiple cache misses or pipeline stalls in the processor. In this specific MPEG4 example, a large amount of time was spent in a nested loop. By unrolling the identified loop, the results shown in Table 1 were obtained. Figures are given both for the results of the simulator and also for the application mapped into a prototyping board so that it can be shown that the simulated performance results are accurately reflected in the real hardware.
Compiler-optimized application |
Cycles without hand optimization |
Cycles with hand optimization |
Tuner |
299,710,340 cycles |
256,923,824 cycles |
Board |
23179 msec |
19514 msec |
Table 1. Effect of loop unrolling |
This single hand optimization resulted in a predicted 14.2% performance improvement in the code. Once implemented the actual speed-up obtained was a 15.8% improvement. This demonstrated that this abstract level of modeling provides sufficiently accurate results to conduct this type of improvement. A second experiment looked at the sizing of the cache.
 |
Figure 3. |
What would happen if the sizes of the cache were either increased or decreased? A simple experiment showed that both the instruction and data cache size could be cut in half, down to 32kBytes with only 0.6% degradation in performance. This provided a huge area, power and cost savings and only gave up a fraction of the performance gain already identified. Many other smaller optimizations and tradeoffs were eventually made in this example, but these two provided the bulk of the gains.
Running with RTL models would have been too slow to do this kind of analysis. In this example, the simulation was running at about 300kcycles/sec such that each run completed in about 1 second. This makes it possible to experiment with a number of solutions, or if required, to run much longer test cases which would make the tuning results even more accurate.
Another way to increase the performance of such a system is to migrate some of the very high activity parts of the software into dedicated hardware solutions. The addition of a custom pipeline can enable very significant performance increases and knowing the memory traffic patterns needed to get the required data into and out of these accelerators can enable the creation of very efficient memory transfer schemes. This can provide an additional performance boast as it can relieve congestion on the buses or memory subsystems. The key to these synthesis solutions is to first capture enough data about the system performance to know how and where to do the partitioning between the hardware and software components, the way to connect the accelerator into the existing architecture and to automate the entire conversion processes.
Other companies such as PowerEscape perform similar analysis on abstract models of the system to provide power estimation within the system. This tool assumes that a processor consumes approximately the same amount of power independent of the actual instruction it is executing. It is thus possible for this analysis to run even faster as the code can be executed directly on the host machine rather than being interpreted in an instruction set simulator. The software is annotated, in a pre-compile step, with the additional code needed to keep track of the information such as cache and memory operations that will affect the power being consumed. This information is subsequently analyzed and displayed so that optimizations in the architecture can be made. While using this host execution technique enables the tool to run a lot faster, it is also less accurate for a number of reasons. Bus timing is no longer as accurate as it is not possible see exactly when bus or memory contention would happen. In addition it is not possible to run the code using compiler optimizations, or the inserted code may modify the optimizations that are possible. The fidelity and accuracy of results will thus suffer. Clearly, there is a tradeoff to be made between the higher accuracy of results and the speed of simulation.
This kind of analysis cannot be done with a traditional simulator, even if the models are available. Capturing the necessary data requires careful instrumentation of the models such that cause and effect can be seen and properly annotated back onto the software. Most hardware centric simulators cannot even provide a view into the software world, let alone map the results of simulation into these views. Another problem is that once a model has been refined down to the implementation level it is difficult to recover a transaction level view of what is happening in the system. One company attempting to do this is Spiratech, Ltd . Its Cohesive product recovers the lost transaction data which can then be displayed in debug solutions from companies such as Novas Software.
Performance analysis is a very valuable tool that can help you to get the most out of the hardware and software resources available. When conducted early in the development flow, costly architectural mistakes can be found and corrected before significant effort has been expended on creating and implementing a solution. The earlier this analysis is performed, the easier it is to make these changes. It can also show you where you are over designing the system such that significant cost savings can be made. If left until the RTL models are complete, simulation times will not only be slow, but it will also be difficult to extract the type of information necessary to quickly discover the source of the problems. Why look for a needle in the haystack when the tools already exist for making this a simple problem to solve.
By Brian Baily, Poseidon Design Systems, Inc.
Go to the Poseidon Design Systems, Inc. website to learn more. |