June 1, 2005 -- A wide range of high-performance computing applications are migrating from microprocessors to FPGAs to take advantage of the potentially huge increases in performance, IO bandwidth, size, weight and power. But large and complex applications - the ones that are best suited for FPGAs - present the challenge of connecting together many algorithm blocks. Today, these challenges are typically addressed manually, with the result that system communications generally consume 80% of the design time of FPGA computing applications. This article will outline a new approach that implements communications networks inside and across FPGAs in order to create scalable applications that harness as many algorithms and computing resources as needed. System developers are using this new approach to meet today's performance requirements and quickly scale up to meet the challenges of tomorrow.
The vast majority of today's high-performance computing applications are built around fixed-architecture, processor-based systems. This approach is highly sequential at its core in that each processing unit delivers a single data path moving across a memory array mapped onto one or two ports. The limitations in this approach have led to a movement towards FPGA-based systems in which many data paths are used and the memory architecture can be customized to the algorithm being implemented for dramatic performance improvements. For example, the memory in a single FPGA could be configured to provide 74 streams of data operating in parallel with six dedicated memory blocks for each stream that can store independent coefficients for enormous performance improvements. For example, a Pentium 4 processor operating at 2.4 GHz can typically generate 0.2 sustained Gigaflops in real applications and 4.8 theoretical peak Gigaflops while an FPGA card can provide 18 sustained Gigaflops and 53.2 theoretical peak Gigaflops.
System design challenge of FPGA architectures
But while the performance benefits are huge, designing the computing engine around the problem presents major challenges. Developing the algorithm blocks themselves is not particularly difficult. High level tools such as System Generator, DIMEtalk, Handel-C, Mitrion-C and Impulse-C make this a relatively straight-forward process. The hardest task in larger applications usually boils down to how to connect all these algorithm blocks together and partition them across multiple boards. A number of modular standards have been developed over the years to provide a bus communications system designed to assist in the conversion of FPGAs into high-performance computing engines. But even when one of these standards are used, systems communications infrastructure design is difficult, especially when multiple FPGAs are involved.
Suppose that we are implementing a single 10-tap FIR filter targetted onto a single FPGA. Getting data from the host to the register is not difficult. Now, let's increase the challenge by having two such filters. We'll need to create a path through to each of the filters and hence we need control paths to get data to each filter individually. This is still not that difficult with VHDL. Now, let's move up to a system with 10 filters spread across 5 FPGAs. To get to the third filter we have to create paths via the first FPGA to access this third filter and so on. This is a pretty complicated job using manual coding methods. Let's suppose that midway through the project it's determined that a 20-tap filters are needed to meet performance requirements. But, currently, only 10 filters will fit on a 5 FPGAs, therefore we need to add addition FPGAs to the system. So, we need to start all over on the communications part of the design. This example helps explain why systems communications infrastructure design typically occupies 80% of total design time on high-performance computing FPGA applications.
Embedding a data network within an FPGA
How can we simplify this process? Let's consider how similar challenges are addressed beyond the FPGA world. The traditional computing system-level architecture often uses an Intel microprocessor as the hardware platform, Windows or Linux as the system software and Ethernet as the communications platform. Data communications are carried out over flexible, scalable and easy-to-understand data networks. While it obviously would not make any sense to map this bulky infrastructure onto an FPGA, an analogous approach is entirely possible. COTS hardware can serve as the host for multiple FPGA systems and system software is available to provide hardware interfacing and control. The systems architecture is designed specifically to abstract the benefits of FPGAs to a system level. The highly parallel architecture provides high bandwidth busses between FPGAs and supports up to nine FPGAs on a single host card.
As far as system communications go, the obvious choice is to embed a data network inside an FPGA. A high-level software tool like Nallatech's DIMEtalk enables designers to easily develop communications networks in FPGA systems. It's designed to speed deployment of system designs across multiple modules by providing a stable, proven set of high-level data transfer components. A simple software interface makes it easy to deliver data between the host and all hardware processes running on the FPGA system so users can get data to and from and move it within their designs with minimal effort. The user simply specifies the points in the hardware systems from which data is to be sent and delivered and the network design tool generates all the appropriate VHDL hardware files needed to get the job done. If the design challenge changes, for example the 10-FPGA, 20-tap FIR filter changes to a 20-FPGA, 50-tap filter, the designer simply reconnects the data pipes to the new nodes and regenerates the VHDL code.
The new design flow
The basic network elements used in this approach are: 1) routers which direct data around the network; 2) nodes that can be connected to design elements such as registers; 3) bridges that move data between physical devices such as from one FPGA to another; and 4) edges that move data into and out of the network using data transfer standards such as PCI, VME and USB. The basic blocks can be connected to/from networks of unlimited complexity. A tool is provided to provide automatic checking of basic network functions.
The user first develops the algorithms using standard FPGA design-entry flows such as HDL, System Generator, AccelChip and Handel-C. The design is then compiled to a Xilinx compatible netlist or used "as is" in the case of RTL HDL code design flow. The user application VHDL code or Xilinx compatible netlist is then imported to the network code. The designer assigns nodes to various design elements such as FIFO, BRAM and memory maps. Then the user joins the dots that represent network elements. Once the network is constructed, the user hits the "generate VHDL" button and the application creates all the ISE related files needed to generate the bitfiles. The generated code includes comments that make it easy to add or instantiate code within ISE rather than the network design tool. The whole design is the synthesized using standard synthesis tools and implemented using standard implementation tools. The application then operates in runtime, using API functions to communicate across the network.
This new approach to system communications makes it possible to take full advantage of the huge potential benefits offered by FPGA computing. Time-to-market is drastically reduced by using a simple plug-and-play application to provide systems communications rather than writing code. The application can easily be scaled to designs of all sizes and distributed across multiple FPGAs simply by updating the network configuration and with little or no code development. This new approach provides the fastest way to move from an algorithm problem to an FGPA computing engine that can far outstrip the computing performance of the traditional microprocessor architecture.
By Dr. Malachy Devlin, Senior Vice-President and Chief Technology Officer, Nallatech.
Dr. Devlin joined Nallatech in 1996, as CTO. He is recognized worldwide, as an industry expert on FPGA technologies. Dr Devlin obtained his PhD in Signal Processing from Strathclyde University and joined Nallatech in 1996. He is a software specialist with several years experience in various companies including the National Engineering Laboratory, Telia in Sweden and Hughes Microelectronics (now part of Raytheon). He is part of the team that developed Nallatech's DIME modular technology based on FPGAs.
Go to the Nallatech, Ltd. website to learn more.