December 12, 2008 -- Consumers judge battery-powered products by both their standby life – how long the device lasts between battery charges – and how responsive they are when actively used. Both the sub-threshold and gate-leakage power components are of concern at 65-nm geometries and below. The quest to control leakage current is leading design teams to isolate and power down functional blocks across their chips. But saving and restoring the block’s logic state takes time and costs energy.
The basic approach to reducing the leakage when a sub-system is inactive is to turn off its power rail externally to the system-on-chip (SoC), or to provide local on-die power gating in the form of series transistor switches that turn off the "virtual" power or ground rail.
Power gating reduces the leakage of both logic and registers, but typically introduces an energy cost associated with having to restart the sub-system; all the registers in the power-gated sub-system need to be reset from their unknown state at power-up once the power and clock networks have stabilized. The power-gating turn-on transients in particular need careful management to avoid power-supply "dips" that could impact or corrupt surrounding logic on the same primary power rail.
In re-booting the power-gated sub-system, there is also a real-time delay that must be taken into account. For example, a processing sub-system that services a time-critical interrupt will now have additional latency for being "woken-up" by the interrupt and restoring the processor context to service the interrupt.
Increased latency in power-gated systems can be addressed by adding deeper FIFO buffering. But this comes at a system-level cost in terms of increased area and overall system-level re-verification, making the power-gated sub-system potentially less reusable in derivative designs.
State retention approaches
Hardware approaches to IP sub-systems are attractive if the state save-and-restore functionality can be added near-transparently. If no changes are required to the operating system (OS) or firmware, then adding the hardware will not affect the overall project and product timescales.
Software approaches that specifically add support for saving and restoring system or task-level state may be required in more complex systems and may affect the OS kernel or device drivers.
Edge-clocked RTL design can be powered down and back up again in between active clock edges providing the register state is precisely preserved. All the combinatorial logic between register stages simply re-evaluates state inputs and regenerates valid outputs. Power supplies must be allowed sufficient time to re-stabilize, and timing constraints must be honored.
Hardware schemes for state retention include:
- Distributed state retention provided within extra special-purpose latch cells.
- Sequenced state check-pointing to on- or off-chip memory (re-using scan chains)
- Ad-hoc schemes that "freeze" the clocks and maintain state at reduced voltage (sometimes referred to as "drowsy" state retention)
State-retention registers that include a third low-leakage latch in addition to the standard master/ slave latch provide a useful abstraction for implementing state retention. The state retention latch (or "balloon") supports an independent back-up power supply and some form of sample-and-hold function with voltage isolation. The RTL is synthesized with such registers in a standard flow, and then additional control mechanisms are added to allow the sequencing of saving state (before power down of the logic and the master/ slave portion of the basic register) and restoring state once primary power is stable.
Optimal RTL for synthesis is highly amenable to implementation with retention flops but there is a need to add an external retention signal control sequencer and provide some extra validation sequence testing for entry to and exit from retention. Such retention registers are very fast and efficient to save and restore state, but every register with a back-up power maintained latch structure incurs an area penalty.
Implementation flows that substitute standard flip-flops with scan-flops, which can subsequently be hooked up for efficient ATPG testing, provide efficient support for manufacturing test of synthesized designs.
With care, it is possible to reuse these manufacturing scan chains in a shift-register based approach where use of the scan-enable control enables saving of the functional state. The state must be shifted back in again before resuming normal circuit operation.
The area costs are minimal compared to the use of distributed retention registers, but there is a dynamic energy cost in shifting the register contents out and back, which is dependent on the state value. Care is required not to exceed simultaneous switching IR-drop requirements and there is also added system control complexity because the scan-enable control of scan chains typically overrides wait states and clock gating control.
The real-time impact is dependent on scan chain length – having more parallel shorted scan chains reduces this impact. However the minimal area cost may be compelling in some situations.
In a voltage-scaling environment there is a third approach that can provide reduced-leakage state retention. Providing that the clocks and resets can be parked in an inactive condition and the outputs are clamped when the sub-system is put to sleep, then the standard synthesized circuitry can have the voltage rail lowered to a point where the register latches still hold state. However, the registers used in the implementation must have low-retention supply voltage requirements. This approach can be deployed on an independently scalable power rail domain (typically this requires an independent power regulator, which has a cost implication) and will have the lowest area impact, but the real-time cost is dependent on the voltage-scaling ramp times. The leakage savings are harder to quantify compared to on-off power gating.
Full state retention is the easiest starting point in all of the above hardware approaches.
For IP-centric designs, an attractive alternative is to add a software Application Programming Interface (API) to support reading out a copy of the context state and saving to memory. The existing hardware reset mechanism is used to reinitialize the IP state after re-powering and then a software flag is used to indicate whether this is to be treated as a "warm restart," in which case the saved state is to be re-written back.
The underlying IP design must support reading and writing of key architectural states under program control. For security reasons, this full-state access might only be made available via a specific privileged access region or protocol in order to prevent accidental or malicious reconfiguration of the sub-system.
The amount of state to be saved and restored can be tailored to the architectural defined state, or a consistent subset of the defined programmer’s model register state. The save and restore API functions then become part of the IP validation model and deliverables. The need to read and copy state affects real-time cost – the energy costs are highly state dependent.
For a cached microprocessor there are additional complications. Saving and restoring state may pollute part of the cache (rebuilding the cache on resuming operation has an indirect energy cost) or have a more significant real-time impact if the code is run un-cached (to minimize displacement of cache contents).
The main issue, however, is the impact on project or product timescales if the operating system or device drivers need to be enhanced and verified with extra API support for retention.
System level impact
From a system design perspective, what really matters is the real-time wake-up response time and the energy cost of saving and restoring state. If the real-time wake-up latency is significant then the design team may have to increase FIFO depths to guarantee that interrupt buffers are never overrun. This is painful from a system architecture perspective because most design teams want a smooth roadmap to develop technology derivatives from pre-existing, proven, system architectures. The system will require re-validation of the buffering and real-time scenarios for slightly more advanced static or quiescent-mode leakage-control states.
The primary requirement is a set of realistic activity profiles which reflect the short and longer-term sleep state behavior. Then, given a technology-dependent set of area/ energy/ wake-latency cost functions, the architected power states and number of levels of sleep that actually have value at a particular technology node can be evaluated. An example activity profile, annotated by a "heat-scale," is shown in Figure 1. The energy consumed is the product of power over time and the hotter the silicon the higher the active and power-gated leakage.
Figure 1. Example activity profile shows energy consumed over time.
Some form of scheduler extension is required to set the appropriate level of sleep to get most efficient use of the retention strategies and modes implemented.
There may be value in using hardware retention schemes for active profiles as well as software-based schemes that are invoked only when longer-term sleep patterns are invoked. For example, the cache memory of a CPU might well be maintained for the rapid-response ‘light sleep’ retention implemented in hardware, but a software routine would be more appropriate to clean the cache to be switched-off for deeper-sleep mode support.
State retention registers
Ideally, adding state retention to a design would be as easy as supporting clock gating has become, which is because of the wide-scale availability of standardized clock-gating latch elements in libraries. However, the control aspects do require some basic control sequencing that is specific to the style of retention register and the exact form of power gating employed.
RTL designers need visibility of the underlying sub-system. To enable this, a primary requirement is that the retention control needs to have "priority" over the clock and set/ reset functions of the underlying register. This is important because the high fan-out clock and initialization networks are typically leaky structures and so power gating them is a sensible strategy. In addition, the synthesizable RTL code needs to have the existing asynchronous and synchronous reset, clock and enable terms maintained exactly in terms of priority before an overriding retention entry/ exit scheme is superimposed.
In summary, retention registers may have different system-control mechanisms, however, the RTL design should be independent of the specific implementation of a targeted retention register.
Figure 2 shows a conceptual schematic for a retention register that has a single-port retention control. The retention latch is powered from the VDDRET/VSSRET global supplies (always on when in retention) and the conventional functional master/ slave latches powered from the power-gated VDD/VSS rails. In this case, the figure shows a single NRETAIN port. It is inactive high for normal functional operation, and asserted active low for retention mode.
Figure 2. Retention register with single-port retention control.
The pseudo-schematic is made somewhat complex by providing hardware interlocks to support clock-phase-independent save and restore (which allows master or slave latch to be forced on state restore). More typically, to minimize the transistor count, a library cell may have specific requirements on the clock level that must be set up before entering and exiting retention mode.
The challenges of partial register state retention
Partial state retention sounds attractive since the area cost of retention can be scaled down in proportion to the state being retained. Every retention register contributes additional leakage power so partial state retention should result in lower standby lower, and the high-fan-out buffering of retention controls should also be reduced.
But partial state retention is more complex to design and verify than full-state-retention design. Taking the case of a microprocessor core, there is typically a certain state (register banks, processor status flags and mode information, for example) that is visible to the programmer and must be preserved from the software perspective by any hardware state-retention scheme. There is also a specific micro-architecture state (such as pre-fetch buffering or branch predictors) that provides more efficient dynamic execution behavior at the expense of higher leakage when halted. If this "hardware accelerator" state is not saved then the micro-architected hardware reload costs in terms of energy and time can be critical in area and power-sensitive designs.
Any state that is not retained but power-gated, will, of course, have arbitrary state when re-powered and will typically have to be explicitly reinitialized. The verification state-space grows massively because of the interaction between retained and non-retained state. It affects the RTL view and implementation optimizations, such as clock gating, which are built on the premise that state is globally persistent.
By way of an example, all register state that is factored into the cone of logic of a further control state term needs to be analyzed to ensure that it is deadlock-free or non-state-corrupting. This is not typically viable for generic designs of any real complexity.
Verification state space explosion
Verification of systems with partially retained state is always going to be a challenge. RTL and gate-level representations assume that state is persistent, or they provide explicit support through cyclic redundancy coding schemes to regenerate consistent state, or they indicate error conditions and require system-level intervention or re-initialization. With partial-retention schemes, it is the design team’s responsibility to ensure that there are mechanisms to cleanly flush and restart functional units. This is particularly true where pipelined operation over a number of cycles is concerned. For microprocessors, such functionality may well be designed-in and simply needs to be harnessed by the state-retention control sequencer. However, without detailed knowledge of the design it is very hard to determine whether a part of circuit can be independently reset without deadlocking or corrupting the retained state.
In summary, full state retention is attractive partly because it is easier to verify. Partial state retention requires detailed design knowledge, and extensive verification will be required to thoroughly validate that the retained state is not masked or corrupted for all legal retained state values and re-initialized control implementations.
Interaction of retention with clock gating
Clock gating aims to reduce clock power by gating a latched enable term with the clock waveform. Clock-gating terms have some form of transparent latch structure which is unlikely to have retention support. In a full-state-retention design all the state terms that are factored into a clock gate enable must re-evaluate to the same enable logic value. However, in a partial-state-retention design, the full cone of fan-in logic state values has to be guaranteed to result in the correct or safe value that enables or inhibits the first clock after retained state is restored.
Clock gate insertion and optimization tools use static analysis of the clock-enable terms. If, for example, both rising and falling edges of a clock are used in a design then it may be impossible to arrange for a clock level that leaves the latch enables open that can re-evaluate the clock gating terms afresh after restored state for one or other clock phase. In fact, this latter concern is true for full-state-retention designs and is another good reason why single-edge clocking is strongly encouraged.
Power gating with local state retention, scan-based state retention and software state-retention approaches are all possible using standard library components and views as well as traditional low-power implementation flows. Voltage scaling has the potential to significantly reduce standby and leakage power for retention schemes, but this approach needs further work to provide design automation and control IP.
There is no simple, single answer as to which approach to power gating with state retention is the most effective. Designers must consider the following:
- The sleep/ wake-up activity profile. A processor’s OS scheduler queues and active device drivers give a good indication of this.
- The technology node. Significant leakage power savings are possible with fast, leaky processes and low-leakage processes where gate leakage makes a larger contribution.
- The thermal profile. The most effective leakage reduction technique at any point in time depends on the temperature of the die, which in turn depends on the dynamic behavior of the on-chip functional sub-systems (including unrelated neighboring IP blocks).
- Most suitable corners for optimizing the implementation. Timing always has to be signed off to guarantee meeting worst case conditions, but sleep states require optimization at realistic temperatures for typical process silicon.
- The libraries and characterization views. Designers must understand that optimizing a design for the worst case leakage corner (highest voltage and temperature) will not necessarily give the best solution for consumers. They will judge how good the standby life of a product really is in relation to competitive products operating in normal conditions.
State retention is very important where (third-party) software or long-term persistent state is involved. The cost/ benefit analysis of which state-retention approach to select is an issue for the system designer. Providing that suitable library IP and design views are available the baseline RTL design should be amenable to overlaying state retention schemes so long as clean synthesis coding styles have been observed and total state retention is acceptable.
Partial state retention may appear attractive from an area overhead perspective, but will typically require more invasive work on the sub-system RTL in terms of hierarchical partitioning and explicit independent reset networks for architectural (retained) and re-initialized (non-retained) state, and a verification approach that can validate the state partitioning.
Finally, it is useful to understand where state retention is not sensible or worthwhile. In systems that are primarily data-flow driven, such as graphics or DSP pipelines, and where the processing engine primarily generates outputs from memory-based data and coefficients, then optimization of these for performance may be the best choice. These units can then simply be power gated to save leakage power when not in use. At power up, a standard power-on reset can be employed to return them to a functional state. From a system design perspective this is actually a specific case of partial state retention – some state has to be maintained at the system level to manage the power-gated sub-system.
ARM and Synopsys are actively collaborating on 45-nm and 32-nm technology demonstrator programs to address these challenges for our mutual customers.
By David Flynn
and Alan Gibbons
David Flynn, a Fellow in R&D at ARM Ltd, has been with the company since 1991, specializing in System-on-Chip IP deployment and methodology. He is the original architect behind ARM’s synthesizable CPU family and the AMBA on-chip interconnect standard. His current research focus is low-power system-level design. He is currently Visiting Professor with the Electronics and Computer Science Department at Southampton University, UK.
Alan Gibbons is a Principal Engineer at Synopsys with a focus on advanced methodology for ARM based system design. His current focus is on the development of low power solutions for ARM platforms. He has been involved in IC engineering for over 20 years specializing in the design and development of processor based ICs for data and wireless communication applications.
Go to the ARM website to learn more.