FPGA-Based Hardware Acceleration of C/C++ Based Applications: Part 1

Publication: EE Times Programmable Logic Designline
Contributor: DRC Computer Corp.

July 25, 2007 -- Most software today is written so that instructions are executed in sequence, and to speed up execution programmers have typically pushed the hardware designers to build processor with ever higher clock rates. That has given rise to heavily pipelined processors that operate at clock rates of 3GHz and more. These processors also include architectural tricks such as large caches, and functions such as out-of-order execution to get the most out of every clock cycle.

However, faster processors generate lots of heat and today, clock speeds have, for the most part, leveled off since the heat generated by the faster circuits ends up constraining the clock speeds. To continue the march towards ever-faster execution, hardware designers have switched from a single processor on a chip to dual, quad, and even more CPU cores on a single chip. The operating system can then allocate the processors to different applications, all running in parallel. The next level down from there is to find ways to parallelize the code running in each application and then run that parallelized code on multiple engines either within the CPU or in a companion coprocessor that is optimized to execute that particular segment of parallelized code.

The latest generation field programmable gate arrays (FPGAs) and the new open-standard Torrenza coprocessor interface on Opteron system platforms defined by Advanced Micro Devices, and the Intel QuickAssist Technology Acceleration Abstraction Layer (AAL) provide designers with the hardware portion of the parallelization goal. By downloading configurations into an FPGA tied to either a Torrenza or QuickAssist motherboard platform, designers can accelerate computationally-complex algorithms such as encryption, compression, search and sort, up to 1000X over a general-purpose processor. Also, DSP algorithms that need billions of integer or floating-point operations per second for image and audio processing, and much more, can readily be accelerated by an FPGA-based coprocessor.

The challenge now becomes how to find the code within your current application that can be parallelized, and then how to parallelize that code so that it can be executed on an array of computational elements configured in an FPGA or ASIC. Where does one start with this analysis? One good starting point is to first profile the code to find the computationally-intensive portions of the code and then find ways to isolate the code so that it does not have many data dependencies. Once this code is isolated, you must find ways to optimize the code so that it can be executed on the resources available on the coprocessor. Optimizing the code to fit the architecture is difficult on architectures like graphics processing units and the Cell processor, where users are not given all the data for the device. FPGAs, on the other hand, allow you to define an optimized architecture on a case-by-case basis.

By Steve Casselman. (Casselman is CTO and co-founder of DRC Computer Corp.)


Reprinted from SOCcentral.com, your first stop for ASIC, FPGA, EDA, and IP news and design information.
Copyright 2002 - 2011 Tech Pro Communications, 1209 Colts Circle, Lawrenceville, NJ 08648