January 8, 2007 -- The course of electronic systems design changed irreversibly on November 15, 1971 when Intel introduced the first commercial microprocessor, the 4004. Before that date, system design consisted of linking many hardwired blocks, some analog and some digital, with point-to-point connections. After the 4004 microprocessor’s public release, electronic system design began to change. The first and most obvious change was the injection of software or firmware into the system-design lexicon. Over the next 30 years, microprocessor-based design has become the nearly universal approach to systems design and, with the advent of practical 8-bit microprocessors in the mid 1970s, microprocessors began to permeate system design. Ever since that time, microprocessor vendors have been under great pressure to constantly increase their products’ performance as system designers think of more tasks to execute on processors.
There are some obvious methods to increase a processor’s performance and processor vendors have used three of them. The first and easiest performance-enhancing technique used was to increase the processor’s clock rate. Intel introduced the 8086 microprocessor in 1978. It ran at 10MHz, five times the clock rate of the 8080 microprocessor introduced in 1974. Ten years later, Intel introduced the 80386 microprocessor at 25MHz, faster by another factor of 2.5. In yet another ten years, Intel introduced the Pentium II processor at 266MHz, better than a 10x clock-rate increase yet again. Figure 1 shows how packaged microprocessor clock rates have risen over the years from less than 1MHz to several GHz.
Figure 1. In the quest for more performance, processor clock rates have risen from less than 1 MHz in 1971 to several GHz in this century.
At the same time, microprocessor data-word widths and buses widened so that processors could move more data during each clock period. Widening the processor’s bus is the second way to increase processing speed and I/O bandwidth. For packaged microprocessors, widening the processor bus adds pins and, therefore, cost. The third way to increase processor performance and bus bandwidth is to add more buses to the processor’s architecture. Intel did exactly this with the addition of a separate cache-memory bus to its Pentium II processor. The processor could simultaneously run separate bus cycles to its high-speed cache memory and to other system components attached to the processor’s main bus. Figure 2 shows that packaged processor pin counts, which started at 16 pins in 1971, have increased to several hundred in this century as vendors have widened buses and added buses to their designs.
Figure 2. Packaged processor pin counts, which started at 16 pins in 1971, have increased to several hundred in this century as vendors have widened buses and added buses to their designs.
Faster clock rates coupled with more and wider buses increase processor performance at a price. The penalty for increasing clock rate is higher power dissipation. In fact, power dissipation rises superlinearly with the clock-rate increase (due to the need to raise the processor core’s minimum operating voltage to allow the higher clock rates). Consequently, microprocessor power dissipation and energy density have been rising exponentially for three decades, as shown in Figure 3.
Figure 3. Power density in packaged microprocessors has risen exponentially for decades (Source: F Pollack, keynote speech, “New microarchitecture challenges in the coming generations of CMOS process technologies,” MICRO-32, Haifa, Israel, 1999.)
Unfortunately, the fastest packaged processors today are bumping into the heat-dissipation limits of their packaging and cooling systems. Since their introduction in 1971, cooling design for packaged microprocessors progressed from no special cooling to:
- Careful system design to exploit convection cooling.
- Active air cooling without heat sinks.
- Active air cooling with aluminum and then copper heat sinks.
- Larger heat sinks.
- even larger heat sinks.
- Dedicated fans directly attached to the processor’s heat sink.
- Heat pipes.
- Heat sinks incorporating active liquid cooling subsystems.
Each step up in microprocessor heat dissipation has increased the cost of cooling, increased the size of required power supplies and product enclosures, increased cooling noise (for fans), and decreased system reliability due to hotter chips and active cooling systems that have their own reliability issues.
SOCs cannot employ the same sort of cooling now used for PC processors. Systems that use SOCs generally lack a PC’s cooling budget. In addition, a processor core on an SOC is only a small part of the system. It cannot dominate the cost and support structure of the finished product the way a processor in a PC does. Simple economics dictate a different design approach.
Further, SOCs are developed using an ASIC design flow, which means that gates are not individually sized to optimize speed in critical paths the same way and to the same extent that transistors in critical paths are tweaked by the designers of packaged microprocessors. Consequently, clock rates for embedded processor cores used in SOC designs have climbed modestly over the past two decades to a few hundred MHz, but synthesized SOC processors cannot run at multi-GHz clock rates like their PC brethren, and probably never will.
Increasing processor performance in the micro world
Lacking the access to the high clock rates available to PC processor designers, processor core designers turn to alternative performance-enhancing strategies. Use of more buses and wider buses are both good strategies for SOC-centric processor design. In the macro world of packaged microprocessors, additional processor pins incur a real cost. Packages with higher pin counts are more expensive, they’re harder to test, and they require more costly sockets.
However, in the micro world of SOC design, additional pins for wider buses essentially cost nothing. They do incur some additional routing complexity, which may or may not increase design difficulty. However, once routed, additional pins on a microprocessor core do not add much to the cost of chip manufacture, except for a fractionally larger silicon footprint. In much the same way, additional microprocessor buses also incur very little cost penalty but provide a significant performance benefit.
Multitasking and processor core clock rate
Multitasking is another system-design choice that tends to increase processor clock rates. Processor multitasking predates the introduction of microprocessors by at least two decades. Because early computers of the 1940s, 1950s, and 1960s were very expensive, computer time was also very expensive. One way to distribute such high hardware costs was to give each computer user a share of the computer’s time – timesharing. Early experimental timeshare systems appeared in the 1950s and commercial timeshared operating systems started to appear on computers by 1961. Multitasking is timesharing, recast. Multitasking operating systems queue multiple tasks (rather than users) and give each task a time-multiplexed share of the computer. Multitasking makes one processor appear to be doing the work of several. When computers were big and expensive, multitasking made perfect sense.
Initially, microprocessors were also expensive like their mainframe predecessors. The first production units of the earliest processor chips cost several hundred dollars so there was significant financial incentive for the expensive processor to execute as many concurrent tasks as possible to amortize the processor’s cost across tasks rather than using many expensive processors to implement the multiple tasks. An entire industry has grown up around the development of real-time operating systems for the specific purpose of making microprocessors execute multiple concurrent tasks.
Microprocessor multitasking encourages clock-rate escalation. A faster clock rate allows a processor to execute more concurrent tasks and more complex tasks. As long as processors are expensive, the system-design scales tip towards multitasking because larger power supplies and cooling components (incurred when running a processor at a higher clock rate) are probably not as expensive as a second processor. However, when processors are cheap, the scales tip against multitasking.
In 1968, a dollar bought one packaged transistor and you needed thousands of transistors to build a computer. In the 21st century, a dollar buys several million transistors on an SOC. It takes roughly 100,000 transistors to build a 32-bit RISC processor, which now cost less than a penny. Moore’s law has made transistors – and therefore processors – cheap, but conventional system-design techniques that conserve processors in exchange for increased clock rates are based on design habits and rules of thumb developed when processors cost many dollars instead of less than a penny.
System design evolution
Semiconductor advances achieved through the relentless application of Moore’s law have significantly influenced the evolution of system design since microprocessors became ubiquitous in the 1980s. Figure 4 shows how minimum microprocessor feature size has tracked Moore’s Law since the introduction of the Intel 4004, which used 10-micron (10,000-nm) lithography. The figure also incorporates ITRS (International Technology Roadmap for Semiconductor) projections to the year 2020, when the minimum feature size is expected to be an incredibly tiny 16nm. Each reduction in feature size produces a corresponding increase in the number of transistors that will fit on a chip. Already in the early 21st century, SOCs routinely contain tens of millions to several hundred million transistors.
Figure 4. The relentless decrease in feature size that has slavishly followed Moore’s Law for decades fuels rapid complexity increases in SOC designs.
Consequently, present-day SOC design has started to break with the 1-processor system model that has dominated since 1971. Figure 5 shows a 2-processor SOC design with a control-plane processor and a data-plane processor. Each processor has its own bus and shares a peripheral device set by communicating over bus bridges to a separate peripheral bus.
The terms “control plane” and “data plane” came into use during the Internet boom of the late 1990s and early part of the 21st century. At first, these terms referred largely to the design of multiple-board networking systems. High-speed network data passed though high-performance processors and hardware accelerators on a high-speed circuit board – called the data plane. Overall system control did not require such high performance so the control task was given to a general-purpose processor on a separate circuit board – called the control plane. These terms have now become universal because they suitably describe many processing systems such as video-encoding and video-decoding designs that must handle high-speed data and perform complex control.
A multi-processor design approach has also become very common in the design of voice-only mobile telephone handsets. A general-purpose processor (almost universally an ARM RISC processor due to legacy-software and type approval considerations) handles the handset’s operating system and user interface. A DSP handles the mobile phone’s baseband processing (essentially DSP functions such as FFTs and inverse FFTs, symbol coding and decoding, filtering, etc.). The two processors likely run at different clock rates to minimize power consumption. Processing bandwidth is finely tuned to be just enough for voice processing, which minimizes product cost—mobile-phone handset designs are sensitive to product-cost differentials measured in fractions of a penny because they sell in the hundreds of millions of units per year—and also minimizes power dissipation—which maximizes battery life, talk time, and standby time.
Heterogeneous- and homogeneous-processor system-design approaches
Figure 5 shows the use of two different microprocessor cores, one general-purpose processor and one DSP. Such a system is called a heterogeneous multiprocessor system. A heterogeneous-multiprocessor design approach has the advantage of matching processor cores with application-appropriate features to specific on-chip tasks.
Figure 5. On-chip tasks are now distributed among multiple processors assigned to the system’s control and data planes.
Selecting just the right processor core or tailoring the processor core to a specific task has many benefits. First, the processor need have no more abilities than required by its assigned task set. This characteristic of heterogeneous-multiprocessor system design minimizes processor gate counts by trimming unneeded features from each processor.
One of the key disadvantages of heterogeneous-multiprocessor design is the need to use a different software-development tool set (compiler, assembler, debugger, instruction-set simulator, real-time operating system, etc.) for each of the different processor cores used in the system design. Either the firmware team must become proficient in using all of the tool sets for the various processors or – more likely – the team must be split into groups, each assigned to different processor cores. However, this situation is not always the case for heterogeneous-processor system designs. Each instance of a configurable processor core can take on exactly the attributes needed for a specific set of tasks and all of the variously configured processor cores, based on a common ISA, can use the same software-development tool suite so that software team members can use familiar tools to program all of the processors in an SOC.
Some SOC designs, called homogeneous multiprocessor systems, use multiple copies of the same processor core. To a first approximation, this design approach appears to simplify software development because all of the on-chip processors can be programmed with one common set of development tools. However, processor cores are not all created equally able. For example, general-purpose processors are not generally good at DSP applications because they lack critical execution units and memory-access modes. A high-speed multiplier/accumulator (MAC) is essential to efficient execution of many DSP algorithms but MACs require a relatively large number of gates so few general-purpose processors have them. Similarly, the performance of many DSP algorithms can benefit from a processor’s ability to fetch two data words from memory simultaneously, a feature often called XY memory addressing. Few general-purpose processors have XY memory addressing because the feature requires the equivalent of two load/store units and most general-purpose processors have only one such unit.
Although the early voice-only mobile-telephone handsets generally had only two processors, incorporation of multimedia features (music, still image, and video) has placed additional processing demands on handset system designs and the finely tuned, cost-minimized 2-processor system designs for voice-only phones lacked processing bandwidth for these additional functions. Consequently, the most recent handset designs with new multimedia features are adding either hardware-acceleration blocks or “application processors” to handle the processing required by the additional features. The design of multiple-processor SOC systems is now sufficiently common to prompt a new term that describes the system-design style that employs more than one processor; multiple-processor SOCs are called MPSOCs. Statistics on licensees of Tensilica’s Xtensa microprocessor cores indicate that the average Tensilica customer incorporates between five and six configurable processor cores per SOC design.
In 1990, very few ASICs incorporated even one processor core. Back then, there simply weren’t enough gates available on economically sized chips to make on-chip processors practical. ASICs designed during that the early 1990s primarily served as glue-logic chips that connected packaged processor ICs to memories and peripheral devices, as discussed previously. Ten years later, at the turn of the century, nearly every ASIC incorporated at least one on-chip processor core because the semiconductor technology permitted it and because it was more efficient and less expensive for the processor to reside on the chip with other system components.
By 1995, RISC microprocessor cores were just starting to emerge as ASIC building blocks because their compact nature (relatively low gate count) gave them a very attractive capability-per-gate ratio and their programmability provided useful design flexibility. As semiconductor technology and design styles evolved over the next 10 years, processor cores became sufficiently pervasive in ASIC design to give rise to a new name, the system on chip or SOC, which has essentially replaced the term ASIC in the system designer’s lexicon.
Of course, once systems designers started using one on-chip processor core, it was only a matter of time before they started using two, three, or more processor cores to achieve processing goals. As discussed above, the average number of processor cores per SOC design is already six processors for Tensilica’s customers and the high-water mark to date for an SOC incorporating Tensilica’s processor cores is 192 processors.
Veering away from processor multitasking in SOC design
The contemporary design trend towards increasing the number of on-chip processor cores is a significant counter trend to decades-old multitasking. Where multitasking loads multiple tasks onto one processor – which increases software complexity, forces processor clock rates up, and therefore increases processor power dissipation – use of multiple on-chip processors takes system design in a different direction by reducing the number of tasks each processor must execute. This design style simplifies the software by reducing the possibility of intertask interference, cutting software overhead (it takes extra processor cycles to just schedule and track multiple software tasks), and thus moderating the rise of processor clock rates.
Multitasking was developed when microprocessors were expensive; when every processor cycle was precious; and when the use of multiple processors was completely out of the question for reasons of engineering economics, circuit-board real estate, and power dissipation. However, Moore’s Law has now reduced the cost of silicon for an on-chip processor to mere pennies (or less) and these costs will further decline as the relentless application of Moore’s Law continues to advance the semiconductor industry’s ability to fit more transistors on a chip. Microprocessor cores on SOCs are no longer expensive – and they get cheaper every year. All system-design techniques based on the old assumptions about processor costs must be rethought just as system-design techniques had to be rethought when logic-synthesis tools started to appear. Decades-old, pre-SOC system-design techniques that conserve processors (which are now cheap) in the face of increasing power dissipation, software complexity, and development cost are clearly obsolete in the 21st century.
Processors: The original, reusable design block
Microprocessors became successful back in the 1970s because they were the first truly universal, reusable block of logic to become available. With firmware reprogramming, microprocessors could be made to perform a very wide range of tasks with no changes to the hardware. This characteristic allowed system designers to use fixed-ISA, packaged processor ICs in an ever expanding number of systems. As the popularity of these universal system building blocks grew, an entire software-development tool industry grew up around packaged microprocessors. Large numbers of compiler and RTOS (real-time operating system) vendors popped into existence in the 1980s, the decade when microprocessors became firmly established in the system designers’ lexicon.
When system design began to migrate from the board level to the chip level, it was a natural and logical step to continue using fixed-ISA processor cores in SOCs. Packaged processors had to employ fixed ISAs to achieve economies of scale in the fabrication process. System designers became well versed in the selection and use of fixed-ISA processors and the related tool sets for their system designs. Thus, when looking for a processor to use in an SOC design, system designers first turned to fixed-ISA processor cores. RISC microprocessor cores based on processors that had been designed for personal computers and workstations from ARM and MIPS Technologies were early favorites due to their low gate count.
However, when designing custom silicon, there’s no technical need to limit a design to fixed-ISA microprocessor cores as there was for board-level systems based on discrete, pre-packaged microprocessors. If there’s legacy software to reuse, there’s certainly a reason to retain a particular microprocessor ISA from one system design to the next. However, if there is no legacy code or if the legacy code is written in C, system designers have a freer hand in selecting a processor with a different ISA – if such a processor improves the system’s performance, power dissipation, or manufacturing cost.
What is a Configurable Processor?
The first practical configurable microprocessor cores started to appear in the late 1990s. A configurable processor core allows the system designer to custom tailor a microprocessor to more closely fit the intended application (or set of applications) on the SOC. A “closer fit” means that the processor’s register set is sized appropriately for the intended task and that the processor’s instructions also closely fit the intended task. For example, a processor tailored to efficiently execute digital audio applications may need a set of 24-bit registers for the audio data and a set of specialized instructions that operate on 24-bit audio data using a minimum number of clock cycles.
A full-featured configurable processor toolkit consists of a pre-defined processor core and a design-tool environment that permits significant adaptation of that base processor design for specific application requirements. Typical forms of configurability include additions, deletions, and modifications to memories, external bus widths and handshake protocols, and commonly used processor peripherals. Extensible processors, an important superset of configurable processors, provide system designers with the ability to add instructions to the processor that may have never been considered or imagined by designers of the original architecture.
The addition of highly customized instructions matched to a specific application gives configurable processors the ability to deliver performance levels rivaling RTL while gaining the benefits of pre-verified IP (intellectual property). Configurable processors are delivered as RTL code that is synthesized into an FPGA or SOC design. The best configurable processors also come with matching software development tools that reflect the hardware instructions added through designer-defined architectural extensions.
A configurable processor can implement datapath operations that closely match those of RTL functions. For example, the Tensilica Instruction Extension language (TIE, a simplified version of Verilog) is and example of a design tool that allows system developers to extend Tensilica’s Xtensa 32-bit processor architecture for specific applications. TIE is optimized for high-level, functional specification of datapath functions in the form of instruction semantics and encoding. A TIE description is both simpler and much more concise than RTL because it omits all structural hardware and sequential-logic descriptions. Complex state-machine sequencing is thus relegated to firmware that controls the newly added function units. Unlike RTL hardware, firmware can be altered even after the silicon is fabricated, which reduces design risk. Tensilica’s automated tools generate the complex hardware for the new function units, which have been described in TIE, and automatically integrate these units into the processor’s pipeline.
Consequently, SOC designers need not be processor designers to tailor an Xtensa processor for specific applications. The automated design tools supply the processor-design expertise. Rather, the SOC designers need only describe the function of the new instructions, and the hardware is automatically generated. The new processor instructions and registers described in TIE are available to the firmware programmer via the same compiler and assembler that target the processor’s base instructions and register set. All operation sequencing within the processor’s datapaths is controlled by firmware, through the processor’s existing instruction-fetch, decode, and execution mechanisms. Finally, state-machine firmware can usually be written in a high-level language such as C or C++ because of the high performance provided by tailored microprocessor architectures.
Processor tailoring offers several benefits. Tailored instructions perform assigned tasks in fewer clock cycles. For real-time applications such as audio processing, the reduction in clock cycles directly lowers operating clock rates, which in turn cuts power dissipation. Lower power dissipation extends battery life for portable systems and reduces the system costs associated with cooling in all systems. Lower processor clock rates also allow the SOC to be fabricated in slower IC-fabrication technologies that are both less expensive and dissipate less static power.
Even though the technological barriers to freer ISA selection have been torn down by the migration of systems to chip-level design, system-design habits are hard things to change. Many system designers who are well versed in comparing and evaluating fixed-ISA processors from various vendors elect to stay with the familiar, which is perceived as a conservative design approach. When faced with designing next-generation systems, these designers immediately start looking for processors with higher clock rates that are just fast enough to meet the new system’s performance requirements. Then they start to worry about finding batteries or power supplies with extra capacity to handle the higher power dissipation that accompanies operating these processors at higher frequencies. They also start to worry about finding ways to remove the extra waste heat from the system package. In short, the attachment to fixed-ISA processors is not nearly as conservative as it is perceived; it is merely old fashioned.
In conclusion, SOC designers can greatly reduce the size and import of these system-design problems by using the full breadth of the technologies available instead of limiting themselves to older design techniques developed when on-chip transistors were not so plentiful. Moore’s Law has provided new and better ways to deal with the challenges of rising system complexity, market uncertainties, and escalating performance goals.
By Steve Leibson. Technology Evangalist for Tensilica, Inc.
Note: This article is based on Chapter 1 of the book “Designing SOCs with Configured Cores,” published by Morgan Kaufmann.
Go to the Tensilica, Inc. website to learn more.