November 2, 2009 -- With the trend towards increasing IC design size and complexity showing no sign of slowing, the computing load required to complete projects on time is exploding. But, the growth in computing power is coming from multiple cores instead of higher clock speeds, which means IC implementation software must run efficiently on the latest multi-core hardware.
Although some implementation tasks (e.g., placement and routing) have been parallelized to a certain extent, parallelizing the very core of the physical design system — the timing analysis and optimization engines — is the smart way to improve runtimes and maintain tight design schedules. Scalable, parallel timing analysis and optimization capabilities can dramatically cut overall design time. “Parallel timing,” means that multiple timing calculations — which include extraction, delay and slew calculations, power, and signal integrity analysis — run concurrently via multiple threads on all available CPUs on all cores. This capability is not widely available in physical-implementation and timing-analysis tools, but can make the critical difference in time-to-closure.
Causes of run-away implementation cycle time
There are a number of reasons that a given design at 45nm will take longer to complete than a similar design at 90nm. One is that the advanced node design is bigger, packing in more functionality and complexity. It might have twice as many transistors, more IP, power domains, operational modes, process corners, and manufacturing-variability models. The design also has tighter and conflicting constraints for timing and power, which requires concurrent analysis and optimization to correctly implement the design. The design tools built to handle previous-generation designs simply lack the ability to manage the huge size and complexity of today’s designs.
For example, just the addition of more modes and corners significantly affects time to design closure. Figure 1 shows a profile of the runtimes on a recent 15-million gate design with multiple modes, corners and power domains. In the first run, the design was implemented for 24 mode/ corner scenarios, with a total runtime of approximately 90 hours. When one additional power domain was added, represented by the second bar, runtime increased by about 50%. With twice the number of mode/ corner scenarios (48), the runtime doubled.
Figure 1. Profile of the runtimes on a recent 45-nm, 15-million gate design with multiple modes, corners and power domains. The Y-axis is in hours.
Designs will not be getting smaller, or less complex, in the coming years. Without a new solution for restraining the growth of runtimes, and corresponding increase in overall time-to-closure, the situation will quickly become unmanageable.
A reasonable question to ask is why design cycle time is exploding now, when design sizes have consistently been doubling every two years. The answer is that in the past, compute hardware was able to keep up with the demand because of the increasingly more powerful processors. However, issues of power density have brought the days of increased processor speed to a halt. Now, the trend in CPUs is towards multiple cores. Intel, for example, is already publicizing its quad-core processor for desktops and a six-core processor for larger servers.
But multi-core CPUs have not controlled time-to-closure as did the previous generations of increasing processor speed because most EDA software can not effectively use all the available cores. Further, and importantly, the parts of the physical design flow that are easily parallizable do not result in the greatest reduction in the overall design time. The part of the design flow that typically takes longest — timing analysis and optimization — is arguably the most important to parallelize.
The promise of parallel timing analysis and optimization
The timing engine is the backbone of any place-and-route system. Timing is one of the fundamental "cost functions" for most place-and-route decisions, and virtually every change in a layout can impact timing. Timing-analysis calculations also tend to take more time in the overall design cycle than any other step of the design-implementation flow. Because of the constant use and computational intensity of timing functions, parallelizing the timing analysis and optimization can have the largest impact on the overall implementation time.
Figure 2 illustrates the different stages of the design flow and the percentage of computing time spent on timing analysis and optimization for each stage. Apart from the floorplanning stage, timing analysis and optimization consumes a major portion of the runtime for every step of the flow. Overall, timing analysis and optimization is responsible for up to 70% of the runtime in place-and-route.
Figure 2. Runtime for each place-and-route stage due to timing analysis and optimization (green) and to tasks performed by other engines, such as placement, CTS, or routing (red).
Many place-and-route tools support limited multi-threading for the tasks that can be easily decomposed with a minimal amount of data dependency. For example, routing on one block of a design does not affect the routing in another block that is physically located in another area of the die. Parallelizing tasks that inherently have complex data and signal interdependencies, such as timing analysis and optimization, is far more difficult. Timing analysis involves many computational tasks, including:
- RC extraction
- Delay and slew calculation
- Arrival time propagation
- Signal integrity ‘bump’ calculation
- Required-time calculation
- Down-delay propagation (for clock trees)
Finding ways to efficiently parallelize the timing-related tasks, without creating a very complex synchronization scheme for timing analysis, is key to significantly reducing time-to-closure.
Barriers to parallel timing
Parallelizing timing functions is difficult for two reasons. First, legacy code is designed for serial execution, and second, timing calculations are highly interdependent.
Traditional software techniques for multi-threading and parallelization can give poor results and lead to instability due to race conditions and data corruption. For example, a rudimentary multi-threading system applied to MCMM timing analysis would probably run each mode/ corner scenario on a separate CPU, and then combine the results at the end. This strategy limits the runtime gains and presents the potential for non-convergent result. In addition, the synchronization overhead limits the gains achievable by increasing the number of CPUs or cores.
The timing engines in traditional place-and-route platforms were not built for multiple cores, and it is next to impossible to re-architect these aging timing engines to take advantage of the multi-core trend. So, in general, traditional parallelization approaches applied to traditional place-and-route tool architectures simply don’t scale well on hardware platforms with many CPUs.
Even if starting with new software architecture, running timing calculations in parallel is difficult because of heavy dependency between the nodes (pins) being analyzed. This interdependency is a challenge to the sequential nature of the software execution, which if broken, might lead to non-deterministic behavior. Non-determinism means that at any given step in a computation, there could be more than one result, which is very bad news. If implemented correctly, however, even timing-related tasks can be parallelized with no hit to accuracy, no data corruption, and a nearly linear improvement in processing time as more CPUs and cores are added.
Solutions for effective parallelization of timing
The latest generation of place-and-route software uses new strategies to allow timing analysis and optimization to be successfully scaled to any number of cores.
One such strategy, node-level data-flow analysis, is used to determine which points in the circuit are independent with respect to parasitic extraction, delay calculation, MCMM-SI (multi-corner multi-mode signal integrity) analysis, and timing and power analysis. A topological ordering of pins based on their location in the data or clock networks offers a natural place for parallelization, provided the place-and-route software architecture and algorithms are designed at the core to handle parallelization.
Once non-dependent pins are identified, all the pin-level tasks for a given design can be decomposed into heterogeneous sets or "chunks" of tasks, as illustrated in Figure 3. This allows the analysis of independent pins at a given level to be computed concurrently across any number of cores.
Figure 3: Node-level data-flow analysis facilitates topological identification of tasks that can be parallelized.
Node-level analysis also allows for lock-less synchronization of data. Because the parallel runs are based on truly independent task lists, there is no need to synchronize results at the end of runs, thereby eliminating a risky and expensive compute step. Calculations for independent pins are scheduled as separate computing tasks that can be done in parallel on any processor core without blocking. The large number of heterogeneous independent tasks can dynamically go to any available resource. This helps to keep all the processing resources busy. Further, because the task decomposition is done on the node level, the parallelization is equally effective for a single timing mode/ corner scenario as for MCMM scenarios.
This scheme for effective parallelization of timing analysis and optimization works in practice. Figure 4 shows the post-route timing analysis runtimes using Mentor’s Olympus-SoC place-and-route tool on 1, 2, 4, and 8 CPUs. Overall, on 8 CPUs, parallel timing capabilities allow for up to 7X speedup in timing analysis.
Figure 4. Performance scaling data for four representative designs using Mentor’s Olympus-SoC parallel timer capabilities. The X-axis shows four different designs on 1, 2, 4, and 8 CPUs. The Y-axis shows hours of runtime for a full timing analysis.
Maintaining expected time-to-closure requires a new approach to performance scaling as SOC designs explode in size and complexity. Designers can no longer count on increasing CPU speeds to provide the needed computing power. Processing power is now being delivered in the form of multi-core CPUs, but there is a gap between the compute cycles available in new multi-core processors, and EDA software’s ability to put them to use. The ability to fully use multi-core platforms will improve, as demonstrated by innovative new solutions that can parallelize the most compute-intensive steps of the design flow — timing analysis and optimization. This technology is already proven to cut overall time to closure by about 4X for leading SOC design teams.
By Sudhakar Jilla.
Sudhakar Jilla is the marketing director at Mentor Graphics. Over the past 15 years, he has held various application engineering, marketing, and management roles in the EDA industry. He holds a Bachelors degree in Electronics and Communications from University of Mysore, a Master's degree in Electrical Engineering from the University of Hawaii, and a MBA from the Leavey School of Business, Santa Clara University.
Go to the Mentor Graphics Corp. website to learn more.