March 30, 2012 -- The increasing prevalence of multi-core design and concurrent software execution makes it ever more critical to be able to validate hardware and software processes in concert under system-level scenarios. This level of verification cannot be conducted at the RTL design stage for several reasons. Simulation is too slow to execute any meaningful software and simulate realistic scenarios; RTL redesign is too costly; and finding system-level bugs at the RTL abstraction is too complicated.
Virtual prototyping gives software engineers the ability to influence the hardware design before the RTL is implemented, reducing the overall RTL and final integration-verification efforts. Beyond the RTL, when a hardware device fails to meet performance under a given scenario or consumes too much power in a certain mode, it is very difficult to isolate the root cause of the problem in the actual device. Behavioral bugs that escape RTL simulation are usually not trivial and stem from corner case usage under complex scenarios that embody the interaction between software and hardware processes. These scenarios must be addressed early in the design cycle to avoid the enormous cost and revenue hit of stumbling upon them in the field.
The best way to meet all of these challenges lies in the emergence of hardware-aware virtual prototyping. Hardware-prototyping techniques for validating designs in the context of software usually involve running the software on real chips and boards either at the back-end of the hardware design process or on physical prototypes. While using emulation and programmable boards allows for implementing limited modifications in the hardware, the ability to influence performance or power at this stage is extremely small.
Virtual prototyping major benefits
Virtual prototyping techniques provide significant benefits over hardware prototyping by using abstracted functional models of the hardware. Virtual prototyping lets software engineers use their software debuggers of choice in conjunction with visibility at the hardware-transaction and register levels for debugging complex HW/SW interactions. It also provides a comprehensive set of performance- and power-analysis capabilities that let software engineers optimally control the hardware while optimizing the software.
Virtual prototypes can be conducted at a much-earlier design stage than hardware prototypes, even before any RTL is designed, increasing productivity, and shrinking design schedules. Early validation of software on a virtual prototype — before RTL specs are finalized and implemented — lets software teams readily change the hardware design topologies and influence the RTL design specification. At this design stage, it is still easy to add or remove compute resources, add a hardware accelerator block for a performance-critical function, and optimize the design for low power.
Because virtual platforms are highly abstract, the code representing the hardware is much smaller and simpler. As a result, virtual prototypes simulate orders of magnitude faster than RTL code, capturing bugs manifested in complex scenarios that are hard to simulate at the RTL and making debug easier than at the RTL. Furthermore, virtual platform models can be used as golden reference models, reducing the time to construct RTL self-checking environments.
Virtual prototypes allow the early hand-off of a chip as a pre-silicon reference platform, allowing concurrent system and application software design before the chip design is complete. Not only does this improve time-to-market, but it also allows all members in the design chain to create a better, more-optimized product.
Even after the device and chip are fabricated, virtual prototypes provide a post-silicon reference platform for simulating scenarios that are difficult to replicate and control on the final device. They also give visibility into internal performance, power, and design signals that is simply not available on the physical chip. Virtual platforms can be used for isolating field-reported problems and for exploring and fixing a problem through software patches or design revisions.
Key features of hardware-aware virtual prototyping
To deliver these benefits in the context of the design challenges faced by design teams today, a virtual-prototyping technology must possess specific attributes that enable it to address current and future design challenges, as proposed here with hardware-aware virtual prototypes.
Industry-compliant SystemC TLM2.0 (transaction level models) can simulate on any industry-compliant SystemC simulator without requiring proprietary extensions. In addition, TLM2.0 contains specific enhancements that enable very efficient communication for optimal simulation speed. A scalable transaction-level modeling methodology separates communication, functionality, and the architectural aspects of timing and power into distinct models. Such a model can run in Loosely Timed (LT) mode at a very high speed, or it can switch to Approximately Timed (AT) mode for more detailed granularity of the hardware model and for evaluating performance and power under software control.
At the heart of the virtual prototype are processor models that run the embedded software. These models may define the overall simulation performance based on their modeling and communication efficiency. Just-In-Time (JIT) modeling allows the embedded code running on the target processor TLM to be natively compiled into the processor host instruction code structure. This is done by preserving thread safety and by correctly supporting multiple instances of the same processor or different processors on the host.
Virtual platform creation transforms the individual industry-standard-based processors, peripherals, buses, and memory TLMs into a virtual platform suitable for executing the software. It is important to provide the right level of visibility into the hardware state along with the execution of the software. An integrated debug environment on a virtual platform lets software engineers use their software debuggers of choice to debug and optimize the software along with standard hardware visualization techniques.
Software optimization must be executed to meet the device performance and low power goals. This can only be accomplished by using performance- and power-analysis graphs; such as data throughput, latency, and dynamic and static power for each software routine executing on the platform. The software designer can see the direct impact of software changes on the virtual platform design.
Finally, the virtual-prototyping solution must provide the capabilities to conduct what-if analysis for finding the optimal hardware configuration and software partitioning for optimal, differentiated design.
The following example illustrates some of these key capabilities as executed using the Mentor Graphics Vista virtual prototyping solution.
ARM Cortex-A9 hardware-aware virtual platform
The example design consists of the system depicted partially in the block diagram shown in Figure 1. In this system, three external data sources exist: an Ethernet controller, a USB controller, and an external bus interface. The data sources save the data from their packets and exchange messages through memory with the software that runs on the ARM Cortex-A9 multi-core processor. The software processes the data and manages an array of peripherals connected to the AHB or APB buses (not shown in Figure 1). The software has stringent performance and power requirements.
 |
|
Figure 1. ARM Cortex-A9 virtual platform block diagram.
|
The virtual platform has many parameters through which the performance of the system can be influenced and optimized against the actual software:
-
The number of cores in the ARM Cortex-A9.
-
The data and instruction cache sizes for the L1 caches.
-
The size of the L2 cache.
-
The line size and the number of lines in a set for each of the caches.
-
The write and replacement policies for each of the caches.
-
The size and speed of each memory.
-
The priorities of the different masters for each bus.
-
The maximal number of outstanding transactions for each master and slave of the AXI bus.
-
Whether to add an SRAM that is connected to the AXI bus.
|
The implementation in hardware of these attributes is automatic and does not involve user-written code. Appropriately, the attributes can be explored in simulation using parameters files without changing the code.
Multiple scenarios were created with different rates of packets having different priorities arriving on the external data sources. The results were averaged over the multiple scenarios by the weight of their frequencies. The worst case was picked from all of the test cases.
The best way to measure the efficiency of the configuration is to view the latencies of all the ports of the A9 cores. The latencies have a close correlation to the software run-time; the lower the latencies, the faster the software runs. The latencies are affected by all of the parameters that were enumerated above. In order to look deeper into the cause for differences in latencies, the various cache hit ratios are viewed. The number of cores was set to four.
The ARM Cortex-A9 supports L1 cache sizes of 16kBytes, 32kBytes, and 64kBytes. The cache size does not have to be equal for each core. The L2 cache size ranges from 16kBytes to 8MBytes. The number of outstanding transactions going out of the L2 cache ranges from one to four. Hence, the number of different configurations that should be checked is overwhelming, even if only these parameters are evaluated. However, it helps that the virtual platform, running in AT mode, can simulate a critical time span of 8ms for each scenario in about 2 minutes.
It was decided, therefore, to start with a basic configuration having equal caches for all the ports of the cores, evaluate roughly the effect of each of the parameters, and select the matrix of the values to be checked exhaustively. In this basic configuration, each L1 cache size was 32kBytes and the L2 cache size was 256kBytes. The number of outstanding transactions generated from the L2 cache was limited to one. The graph of the transaction latencies for this configuration is given in Figure 2.
 |
|
Figure 2. Transaction latencies.
|
Figure 3 depicts the different cache-hit ratios. The thing that catches the eye is the low hit ratio of the L2 cache (16%), shown as the black line in this figure.
 |
|
Figure 3. Cache-hit ratios.
|
The next step was to run three different trials. In the first one, the L2 cache was enlarged to 1MByte. The corresponding hit ratio rose to 25%; however, the average latency was reduced from 73ns to only 62ns.
In the second trial, the size of each L1 cache was increased to 64kBytes. In this scenario, the average latency was only reduced to 70ns. The third trial set the number of outstanding transactions generated from the L2 cache to four. The average latency was reduced to 50ns, even though the hit ratios of the caches dropped somewhat.
It was decided, therefore, to simulate the system with the following configurations: L1 cache size (16kBytes or 32kBytes), L2 cache size (256kBytes or 512kBytes), and the maximal number of the outstanding transactions (1 or 4). Table 1 depicts the effects of the parameters on the average transaction latency, the average L1 cache-hit ratio, and the L2 cache-hit ratio.
 |
|
Table 1. Effect of parameters on average transaction latency, average L1 cache-hit ratio, and L2 cache-hit ratio.
|
What stands out in this configuration is the effect of the number of outstanding transactions on the latency of the memory accesses. It is higher than the effect of the cache-hit ratios. This significant effect could not have been measured without the detailed granularity provided when simulating the bus accesses, and it could not have been achieved by running simulation in LT mode.
The data shown in Table 1 lets a user evaluate the trade-offs between area and speed and, therefore, find the sweet spot that still matches the system requirements. The result is quite accurate because the actual software runs on the system. Since software development can be continued before the hardware is finalized, the same evaluation can be repeated until the best hardware configuration that matches the most advanced software version is determined. The same process can be done, for example, to evaluate the effect of adding an SRAM to the system (this appears in the Figure 4 block diagram but was disabled in the simulation). Power estimation can be added easily to the models so that power consumption can be considered as a part of the trade-offs.
In summary
With the amount of functionality implemented in software running on multi-core processors continuing to grow, how well software and hardware interact defines device performance, power consumption, and cost attributes. Hardware-aware virtual prototyping is the best way to optimize these important attributes and enable concurrent hardware/ software development throughout the design flow.

By Shabtay Matalon

and Yossi Veller
Shabtay Matalon is ESL Market Development Manager for Mentor Graphics Corp.'s Design Creation and Synthesis Division. He has been active in system level design and verification tools and methodologies for over 20 years and published several publications in these areas.
Yossi Veller is the Chief Scientist in the Mentor Graphics ESL Division. During his long software career, Yossi has led ADA compiler, VHDL, and C simulation development groups. He was also the CTO of Summit Design.
Go to the Mentor Graphics Corp. website to learn more. |