With Stratix® 10 high-end and Arria® 10 mid-range FPGA and SoC FPGA products Altera wants to surge ahead of Xilinx in critical infrastructure—such as wireless remote radio units (RRUs), 100G/400G wireline channel (line) cards and data centers—as well as military, medical and broadcast scenarios by relying on ARM Cortex-A53 IP (Intellectual Property) and Intel Custom Foundry’s 14 nm Tri-Gate (FinFET) process services for Stratix 10, and ARM Cortex-A9 IP and TSMC 20 nm 20SoC process for Arria 10 with OpenCL for FPGAs capability for both. It will also be possible to begin designs with the Arria 10 portfolio of 20 nm FPGA devices, and then take advantage of pin-for-pin design migration pathways from Arria 10 FPGA and SoC products to Stratix 10 FPGA and SoC products as they become available.
This was my conclusion when the news came out that Altera Announces Quad-Core 64-bit ARM Cortex-A53 for Stratix 10 SoCs [press release, Oct 29, 2013] and then I answered three questions for myself, followed by understanding a little bit more deeply two other issues as well:
- Why FPGAs? Why more FPGAs?
- Why SoC FPGAs?
- Why ARM with FPGA on the Intel Tri-Gate (FinFET) process, and why now?
- OpenCL for FPGAs
- Altera SoC FPGAs
For introduction here is Altera Stratix 10 SoC & ARM perspective – ARM TechCon ’13 [ARMflix YouTube channel, Oct 31, 2013]
To shed more light on the direction of breakthrough by Altera, here is additional introductory information from: Arria 10 Device Overview* [Altera, Sept 4, 2013]
*As there is no similar document yet for Stratix 10
Altera’s Arria® FPGAs and SoCs deliver optimal performance and power efficiency in the midrange. By using TSMC’s 20-nm process technology on a high-performance architecture, Arria 10 FPGAs and SoCs deliver higher performance than previous-generation high-end FPGAs while simultaneously reducing power by offering a comprehensive set of power-saving technologies. Altera’s Arria 10 family is reinventing the midrange.
Altera’s Arria 10 SoCs offer a second generation SoC product that both demonstrates a long-term commitment to the SoC product line and extends Altera’s leadership in programmable devices that feature the ARM-based hard processor system (HPS).
Important innovations in Arria 10 devices include:
– Enhanced core architecture delivering 60% higher performance than the previous generation midrange (15% higher performance than previous fastest high-end FPGAs)
– Integrated transceivers with short reach rates up to 28.05 Gbps and backplane capability up to 17.4 Gbps
– Hard PCI Express Gen3 intellectual property (IP) blocks
– Hard memory controllers and PHY up to 2666 Mbps
– Variable precision digital signal processing (DSP) blocks
– Fractional synthesis PLLs
– Up to 40% lower power compared to prior midrange FPGAs and up to 60% lower power compared to prior generation high-end FPGAs due to a comprehensive set of advanced power-saving features
– 2nd generation ARM® Cortex™-A9 hard processor system (HPS) for SoC variants
– Integrated 10GBASE-KR/40GBASE-KR4 Forward Error Correction (FEC)
Arria 10 devices are ideally suited for high performance, power-sensitive, midrange applications in such diverse markets as:
– Wireless—for channel and switch cards in remote radio heads and mobile backhaul
– Broadcast—for studio switches, servers and transport, videoconferencing, and pro audio/video
– Wireline—for 40G/100G muxponders and transponders, 100G line cards, bridging, and aggregation
– Compute and Storage—for flash cache, cloud computing servers, and server acceleration
– Medical—for diagnostic scanners and diagnostic imaging
– Military—for missile guidance and control, radar, electronic warfare, and secure communications
Target Markets for Arria 10 FPGAs and SoCs
Arria 10 devices meet the performance, power, and bandwidth requirements of next generation wireless infrastructure, broadcast, compute and storage, networking, and medical and military equipment.
By providing such a highly integrated device, Arria 10 FPGAs and SoCs significantly reduce BOM cost, form factor, and power consumption. Arria 10 devices allow you to differentiate your product through customization by implementing your intellectual property in both hardware and software.
For these applications, Arria 10 devices integrate both logic functions and processor functions in a highly integrated single device. The integrated ARM-based SoCs provide all the functionality of traditional FPGAs, eliminate the need for a local processor, and increase system performance by taking advantage of the tightly coupled high bandwidth interface between the core fabric and the hard processor system.
For Wireless infrastructure particularly remote radio unit, the industry has standardized onARM-based ASSPs and SoCs for several generations. ARM is widely recognized as the industry leader in low power solutions. At 20 nm, the Dual ARM Cortex MPCore provides the best power efficiency of any GHz class of process. When combined with Altera’s industry leading programmable technology, this provides an ideal platform to address the performance, power, and form factor requirements of wireless remote radio unit and small cell base stations.
For Wireline communication equipment such as access, metro, core,and transmission equipment where the FPGA performs critical functions such as protocol bridging, packet framing, aggregation, and I/O expansion, SoCs now offer all this as well as integrated intelligent controland link management, sometimes referred to as Operations, Administration, and Maintenance (OAM). OAM typically is software that executes when a link is established or fails during operation. The integrated ARM processor can also be used for statistics and error monitoring and minimize system downtime when a link is compromised or oversubscribed. Tight coupling of the processor and the data path (implemented in the core logic) saves time and results in significant savings in terms of operating expenses associated with system downtime and loss of quality of service.
For Compute and storage equipment, flash cache storage, the integrated ARM processor can be used to manage Flash sectors and improve overall life and reliability as well as offload the host processor and provide control for search and hardware acceleration functions for cloud storage equipment. The integrated ARM based HPS can configure the hard PCIe interfaces in PCIe root port configuration and also run link layers for SAS and SATA interfaces.
For Next generation Broadcast equipment, where “4K readiness” is the key technology driver, the integrated ARM processor subsystem eliminates the need for a local GHz class processor, which is commonly used for functions such as audio processing, video compression, video link management, and PCIe root port.
For Military applications, new security features such as Secure Boot, Encryption, and Authentication have been introduced for secure wireless and wireline communications, military radar, military intelligence equipment.
For Test and Medical applications, combining ARM HPS with support for high speed memory devices such as DDR4, and Hybrid Memory Cube (HMC) as well as high speed transceivers and embedded controllers such as PCIe Gen3, Arria 10 SoCs are ideal for next generation test and medical equipment.
Then you can also read The Next-Node Battle Begins – Altera Announces “Generation 10” [EE Journal, June 11, 2013] from I will quote here the following:
For the past three nodes or so, we’ve seen a back-and-forth battle between Altera and Xilinx. Most people think that Altera got the upper hand in 40/45nm products with their Stratix IV family. Two years later, Xilinx struck back hard at 28nm with Virtex-7. Now, it’s time for the “next” generation, and Altera is apparently ready to get the party started. The company has just announced their upcoming “Generation 10” FPGA families – and it looks like this node is gonna be a doozy!
as well as the ARMing a New Generation – Altera Announces Processor Architecture for Gen X [EE Journal, Oct 29, 2013] from which it is wort to quote the following:
Altera is currently in a race with archrival Xilinx, whose first FinFET FPGAs will be riding in on TSMC’s 16nm FinFET process. Which horse is faster? Intel is widely believed to have superior process technology and has already been shipping 22nm FinFET-based devices. Those points go to Intel. TSMC, on the other hand, has vastly more experience as a merchant fab and has announced that they are working closely with Xilinx to accelerate their FinFET program, in a blitz whose marketing name is “FinFAST.”
At this point, therefore, it is unclear who will be shipping first, (and, except for bragging rights between the two companies, probably few people care.) It is likely that we will not see production devices from either company before 2015, so we are definitely in “future” mode here. It is also unclear how the performance attributes of the two companies’ offerings will stack up. Altera has shown more of their hand thus far, and their predictions are impressive – up to four million LUT-4 equivalent 1GHz programmable fabric, 56Gbps SerDes, better power efficiency, tons-o-RAM – and a high-powered processing subsystem in the SoC version. What’s the processing subsystem look like? That’s why we are gathered here today.
There was speculation that the architecture might be other-than-ARM since the manufacturer is none-other-than-Intel. As far as we know, Intel hasn’t historically been too keen on manufacturing competing processor architectures. However, two other, more important market forces are at work in this situation. First, Altera has made a huge commitment to the ARM architecture with their current-generation SoC FPGAs. Getting their customers committed to the ARM/FPGA architecture and then jumping ship and forcing them to migrate after only one generation would be a major inconvenience, and it would be a big black eye for Altera. It would have been very unlikely that Altera would have inked the Intel deal knowing that they couldn’t continue their ARM commitment.
Second, Intel is obviously trying to make a go at it in the merchant fab business. If the company had a hard-and-fast policy of never manufacturing a chip with an ARM architecture on board, they’d be severely limiting their market. While Intel has already been building FPGAs for both Tabula and Achronix, getting Altera in their stable is a whole ‘nuther deal. Putting aside petty concerns about processor architecture is a small price to pay for better street cred in the merchant fab business.
1. Why FPGAs? Why more FPGAs?
As one of the greatest strengths of the FPGA is its ability to perform highly pipelined and complex algorithmic computations on the data brought onchip Altera says that we can do better with explicit parallelism on FPGAs than on GPUs:
The spectrum of software-programmable devices is now evolving significantly. The emphasis is shifting from automatically extracting instruction-level parallelism at run time to explicitly identifying thread-level parallelism at coding time. Highly parallel multicore devices are beginning to emerge with a general trend of containing multiple simpler processors where more of the transistors are dedicated to computation rather than caching and extraction of parallelism. These devices range from multicore CPUs, which commonly have 2, 4, or 8 cores, to GPUs consisting of hundreds of simple cores optimized for data-parallel computation. To achieve high performance on these multicore devices, the programmer must explicitly code their applications in a parallel fashion. Each core must be assigned work in such a way that all cores can cooperate to execute a particular computation. This is also exactly what FPGA designers do to create their high-level system architectures.
(Source: Implementing FPGA Design with the OpenCL Standard
(v. 2.0 Altera whitepaper, November 2012])
Field Programmable Gate Arrays
FPGAs are integrated circuits that can be configured repeatedly to perform an infinite number of functions. Low level operations such as bit masking, shifting, and addition are all configurable and can be assembled in any order. FPGAs achieve a high level of programmability by integrating combinations of lookup tables (LUTs), registers, on-chip memories, and arithmetic hardware (for example, digital signal processor (DSP) blocks) through a network of reconfigurable connections to implement computation pipelines. LUTs are responsible for implementing various logic functions. For example, reprogramming a LUT can change an operation from a bitwise AND logic function to a bit-wise XOR logic function.
The key benefit in using FPGAs for algorithm acceleration is that they support wide and heterogeneous pipelines. Each pipeline implemented in the FPGA fabric can be wide and unique. This characteristic is in contrast to many different types of processing units such as symmetric multiprocessors (SMPs), DSPs, and graphics processing units (GPUs). In these types of devices, parallelism is achieved by replicating the same generic computation hardware multiple times. In FPGAs, however, parallelism can be achieved by duplicating only the logic that will be exercised by your algorithm.
A processor implements an instruction set that limits the amount of work that can be performed each clock cycle. For example, most processors do not have a dedicated instruction that can execute the following C code:
E = ((((A + B) ^ C) & D) >> 2;
Without a dedicated instruction for this C code example, a CPU, DSP, or GPU must execute multiple instructions to perform the operation. You can configure an FPGA to perform a sequence of operations that implements the code above in a single clock cycle. An FPGA implementation connects specialized addition hardware with a LUT that performs the bit-wise XOR and AND operations. The device then leverages its programmable connections to perform a right shift by two bits without consuming any hardware resources. The result of this operation can be connected to subsequent operations to form complex pipelines. You may think of an FPGA as a hardware platform that can implement any instruction set that your software algorithm requires.
Altera SDK for OpenCL Pipeline Approach
The key difference between the pipeline generated by the Altera Offline Compiler (AOC) and a typical processor pipeline is that the FPGA pipeline is not limited to a statically defined set of pipeline stages or instruction set.
The custom pipeline structure provided by the AOC speeds up computation by allowing operations within a large number of threads to occur concurrently.
(Source: Altera SDK for OpenCL Optimization Guide
[for v. 13.0 SP1.0 by Altera, June 2013])
GPU and FPGA Design Methodology
GPUs are programmed using either Nvidia’s proprietary CUDA language, or an open standard OpenCL language. These languages are very similar in capability, with the biggest difference being that CUDA can only be used on Nvidia GPUs.
FPGAs are typically programmed using HDL languages Verilog or VHDL. Neither of these languages is well suited to supporting floating-point designs, although the latest versions do incorporate definition, though not necessarily synthesis, of floating-point numbers. For example, in System Verilog, a short real variable is analogue to an IEEE single (float), and real to an IEEE double.
OpenCL for FPGAs
OpenCL is familiar to GPU programmers. An OpenCL Compiler for FPGAs means that OpenCL code written for AMD or Nvidia GPUs can be compiled onto an FPGA. In addition, an OpenCL Compiler from Altera enables GPU programs to use FPGAs, without the necessity of developing the typical FPGA design skill set.
Using OpenCL with FPGAs offers several key advantages over GPUs. First, GPUs tend to be I/O limited. All input and output data must be passed by the host CPU through the PCI Express® (PCIe®) interface. The resulting delays can stall the GPU processing engines, resulting in lower performance
OpenCL Extensions for FPGAs
FPGAs are well known for their wide variety of high-bandwidth I/O capabilities. These capabilities allow data to stream in and out of the FPGA over Gigabit Ethernet (GbE), Serial RapidIO® (SRIO), or directly from analog-to-digital converters (ADCs) and digital-to-analog converters (DACs). Altera has defined a vendor-specific extension of the OpenCL standard to support streaming operations. …
FPGAs can also offer a much lower processing latency than a GPU, even independent of I/O bottlenecks. It is well known that GPUs must operate on many thousands of threads to perform efficiently, due to the extremely long latencies to and from memory and even between the many processing cores of the GPU. In effect, the GPU must operate many, many tasks to keep the processing cores from stalling as they await data, which results in very long latency for any given task.
The FPGA uses a “coarse-grained parallelism” architecture instead. It creates multiple optimized and parallel datapaths, each of which outputs one result per clock cycle. The number of instances of the datapath depends upon the FPGA resources, but is typically much less than the number of GPU cores. However, each datapath instance has a much higher throughput than a GPU core. The primary benefit of this approach is low latency, a critical performance advantage in many applications.
Another advantage of FPGAs is their much lower power consumption, resulting in dramatically lower GFLOPs/W. FPGA power measurements using development boards show 5-6 GFLOPs/W for algorithms such as Cholesky and QRD, and about 10 GFLOPs/W for simpler algorithms such as FFTs. GPU energy efficiency measurements are much hard to find, but using the GPU performance of 50 GFLOPs for Cholesky and a typical power consumption of 200 W, results in 0.25 GFLOPs/W, which is twenty times more power consumed per useful FLOPs.
(Source: Radar Processing: FPGAs or GPUs? (v. 2.0 Altera whitepaper, May 2013])
Altera also says that the need for ever-increasing bandwidth and flexibility drives the need for a breakthrough in capability:
The increased capabilities in smartphones and other portable devices are the reason for the dramatic leap in system performance that we will see in next-generation FPGAs. The explosion of mobility bandwidth requirements are putting a huge demand on the wireless, wired, and data center infrastructure capabilities. While the number of smartphones is growing at single digit percentage rates, the customers of these devices continue to drive more bandwidth with the ever-increasing smartphone capability. Much of this is due to the increased video content. In 2012, average smartphone data usage grew by 81 percent. Cisco expects mobile traffic to increase 66 percent per year through 2017 and two-thirds of all mobile traffic will be video content. At this time, mobile network speed is expected to increase by seven times and 4G networks to comprise 45 percent of all traffic (1) (see Figure 1).
A brief overview of three infrastructure applications below are examples of why hardware and software developers are looking to FPGAs to address their next-generation products bandwidth, performance, power, and cost goals.
■ Wireless remote radio units
■ 400G wireline channel cards
■ Data centers
Wireless Remote Radio Units
In the capital-intensive wireless infrastructure market, telecommunications operators desire to provide more bandwidth faster and cheaper. The faster these operators can do cost reductions, the more deployments they can do, the more area they can cover, and the faster they can serve customers—a huge advantage. The product strategy of these companies is to keep the datapath width the same and increase the clock frequency for as many generations as they can. Upcoming remote radio units will look for FPGAs to push close to 500 MHz of core performance for complex functions, such as implementing digital pre-distortion algorithms. This will preserve their investment in their radio architecture and allow them to cover a broader spectrum of radio frequency (RF) bandwidth. In doing so they look to have a better return on investment because less work needs to be done re-architecting a solution. Furthermore, their time-to-market advantage improves by getting these new products out faster. They must also lower their operating costs to drive cost per bit down because revenues per mobile subscriber grow at a far less rate than the data traffic per subscriber. Thus by not widening their datapath, and creating power efficient designs on smaller more power-efficient FPGAs, allows them to achieve this goal.
400G Channel Cards
Another driving force in improving FPGA performance is the need to upgrade the network communications infrastructure. Next-generation 400G versus existing 100G channel cards will dramatically push system capabilities. The bandwidth jump of four times in the next-generation systems is much greater than in previous iterations. Because the market for this is still new, companies cannot risk building ASICs or ASSPs to achieve this goal. Integration of multiple 56 gigabits per second (Gbps) and 28 Gbps transceiver solutions to accommodate this level of bandwidth is needed, but only a part of the solution. More and faster logic to accommodate this higher bandwidth is also required. However since the dimensions of the chassis do not change, the power envelope is limited. The network infrastructure cannot tolerate solutions where power increases at a linear rate with bandwidth capability. For packet processing and traffic management applications at 400G bandwidth at 600 million packets per second, scaling the data path width and frequency can relieve the data path processing function but cannot scale for control path processing such as scheduling. Therefore high performance in all aspects of device capability is required: processing, memory interfacing, IO interfaces, and others. FPGAs remain the most attractive solution, but companies will need investments in higher performance per watt architectures, transceivers, and process technology to address this large leap in capabilities and challenges.
All the data and video that are being pushed and downloaded from these new wireless deployments and transported through the new 400G packet processing infrastructure also needs to be stored and processed. Computations per watt and computations per dollar is a key metric in data centers. FPGA’s are increasingly used in the data center for data access, algorithm, and networking acceleration. Data center servers are bottlenecked getting access to data. The latest processors have more and more cores, but the bandwidth to external memory and data is not keeping pace with the increase in computing power. Many of these servers are running at average utilization rates and are well under peak processing power. These servers are good candidates for FPGA acceleration. Hardware acceleration through FPGAs becomes an attractive alternative to replacing these processors by focusing on the performance bottlenecks that software on processors cannot overcome.
Other applications are also looking to FPGAs to support their increased bandwidth requirements, such as video content providers moving to 4K video, cloud computing, and intelligence applications in defense. These applications face similar issues. (Source: Expect a Breakthrough Advantage in Next-Generation FPGAs (v. 1.0 Altera whitepaper, June 2013])
2. Why SoC FPGAs?
Altera’s Vision of Silicon Convergence: system solutions by merging coarse and fine grained programmable hardware [IEEE Computer Society Santa Clara Valley YouTube channel, recorded on Sept 10, 2012, published on June 10, 2013]
What Is a PLD?
- A programmable logic device (PLD) is a type of semiconductor
- Most semiconductors can be programmed only once to perform a specific function
- PLDs are reprogrammable—functions can be changed or enhanced during development or after manufacturing
Flexibility Makes PLDs Lower Risk and Faster to
Design Than Other Types of Semiconductors
3. Why ARM with FPGA on the Intel Tri-Gate (FinFET) process, and why now?
Altera Announces Quad-Core 64-bit ARM Cortex-A53 for Stratix 10 SoCs [press release, Oct 29, 2013]
Manufactured on Intel’s 14 nm Tri-Gate Process, Altera Stratix® 10 SoCs Will Deliver Industry’s Most Versatile Heterogeneous Computing Platform
Altera Corporation (NASDAQ: ALTR) today announced that its Stratix 10 SoC devices, manufactured on Intel’s 14 nm Tri-Gate process, will incorporate a high-performance, quad-core 64-bit ARM Cortex™-A53 processor system, complementing the device’s floating-point digital signal processing (DSP) blocks and high-performance FPGA fabric. Coupled with Altera’s advanced system-level design tools, including OpenCL, this versatile heterogeneous computing platform will offer exceptional adaptability, performance, power efficiency and design productivity for a broad range of applications, including data center computing acceleration, radar systems and communications infrastructure.
The ARM Cortex-A53 processor, the first 64-bit processor used on a SoC FPGA, is an ideal fit for use in Stratix 10 SoCs due to its performance, power efficiency, data throughput and advanced features. The Cortex-A53 is among the most power efficient of ARM’s application-class processors, and when delivered on the 14 nm Tri-Gate process will achieve more than six times more data throughput compared to today’s highest performing SoC FPGAs. The Cortex-A53 also delivers important features, such as virtualization support, 256TB memory reach and error correction code (ECC) on L1 and L2 caches. Furthermore, the Cortex-A53 core can run in 32-bit mode, which will run Cortex-A9 operating systems and code unmodified, allowing a smooth upgrade path from Altera’s 28 nm and 20 nm SoC FPGAs.
“ARM is pleased to see Altera adopting the lowest power 64-bit architecture as an ideal complement to DSP and FPGA processing elements to create a cutting-edge heterogeneous computing platform,” said Tom Cronk, executive vice president and general manager, Processor Division, ARM. “The Cortex-A53 processor delivers industry-leading power efficiency and outstanding performance levels, and it is supported by the ARM ecosystem and its innovative software community.”
Leveraging Intel’s 14 nm Tri-Gate process and an enhanced high-performance architecture, Altera Stratix 10 SoCs will have a programmable-logic performance level of more than 1GHz; two times the core performance of current high-end 28 nm FPGAs.
“High-end networking and communications infrastructure are rapidly migrating toward heterogeneous computing architectures to achieve maximum system performance and power efficiency,” said Linley Gwennap, principal analyst at The Linley Group, a leading embedded research firm. “What Altera is doing with its Stratix 10 SoC, both in terms of silicon convergence and high-level design tool support, puts the company at the forefront of delivering heterogeneous computing platforms and positions them well to capitalize on myriad opportunities.”
By standardizing on ARM processors across its three-generation SoC portfolio, Altera will offer software compatibility and a common ARM ecosystem of tools and operating system support. Embedded developers will be able to accelerate debug cycles with Altera’s SoC Embedded Design Suite (EDS) featuring the ARM Development Studio 5 (DS-5™) Altera® Edition toolkit, the industry’s only FPGA-adaptive debug tool, as well as use Altera’s software development kit (SDK) for OpenCL to create heterogeneous implementations using the OpenCL high-level design language.
“With Stratix 10 SoCs, designers will have a versatile and powerful heterogeneous compute platform enabling them to innovate and get to market faster,” said Danny Biran, senior vice president, corporate strategy and marketing at Altera. “This will be very exciting for customers as converged silicon continues to be the best solution for complex, high-performance applications.”
Altera® programmable solutions enable designers of electronic systems to rapidly and cost effectively innovate, differentiate and win in their markets. Altera offers FPGAs, SoCs, CPLDs, ASICs and complementary technologies, such as power management, to provide high-value solutions to customers worldwide. Follow Altera viaFacebook, Twitter, LinkedIn, Google+ and RSS, andsubscribe to product update emails and newsletters. altera.com
Altera to Build Next-Generation, High-Performance FPGAs on Intel’s 14 nm Tri-Gate Technology [alteracorp YouTube channel, March 11, 2013]
From: Intel takes big step in chip foundry business [Reuters, Feb 25, 2013]
Altera Chief Executive John Daane told Reuters in a phone interview that Altera, which depends on communications infrastructure for about half of its business, is the only major programmable chipmaker that will have access to Intel’s plants.
“We are essentially getting access like an extra division of Intel. As soon as they’re making the technology available to their various groups to do design work, we’re getting the same,” he said.
Daane said Intel’s manufacturing technology will give Altera’s chips a several-year advantage against Xilinx, its main competitor in programmable chips. He said Altera would continue to make other chips with TSMC, its long-time foundry.
Altera to Build Next-Generation, High-Performance FPGAs on Intel’s 14 nm Tri-Gate Technology [press release, Feb 25, 2013]
Altera Corporation and Intel Corporation today announced that the companies have entered into an agreement for the future manufacture of Altera FPGAs on Intel’s 14 nm tri-gate transistor technology. These next-generation products, which target ultra high-performance systems for military, wireline communications, cloud networking, and compute and storage applications, will enable breakthrough levels of performance and power efficiencies not otherwise possible.
“Altera’s FPGAs using Intel 14 nm technology will enable customers to design with the most advanced, highest-performing FPGAs in the industry,” said John Daane, president, CEO and chairman of Altera. “In addition, Altera gains a tremendous competitive advantage at the high end in that we are the only major FPGA company with access to this technology.”
Altera’s next-generation products will now include 14 nm, in addition to previously announced 20 nm technologies, extending the company’s tailored product portfolio that meets myriad customer needs for performance, bandwidth and power efficiency across diverse end applications.
“We look forward to collaborating with Altera on manufacturing leading-edge FPGAs, leveraging Intel’s leadership in process technology,” said Brian Krzanich, chief operating officer, Intel. “Next-generation products from Altera require the highest performance and most power-efficient technology available, and Intel is well positioned to provide the most advanced offerings.”
Adding this world-class manufacturer to Altera’s strong foundation of leading-edge suppliers and partners furthers the company’s ability to deliver on the promise of silicon convergence; to integrate hardware and software programmability, microprocessors, digital signal processing, and ASIC capability into a single device; and deliver a more flexible and economical alternative to traditional ASICs and ASSPs.
Altera claims that only Intel’s 14 nm Tri-Gate Process offers a second generation of proven production technology:
Transistor Design Background
In 1947 the first transistor, a germanium ‘point-contact’ structure, was demonstrated at Bell Laboratories. Silicon was first used to produce bipolar transistors in 1954, but it was not until 1960 that the first silicon metal oxide semiconductor field-effect transistor (MOSFET) was built. The earliest MOSFETs were 2D planar devices with current flowing along the surface of the silicon under the gate. The basic structure of MOSFET devices has remained substantially unchanged for over 50 years.
Since the prediction or proclamation of Moore’s Law in 1965, many additional enhancements and improvements have been made to the manufacture and optimization of MOSFET technology in order to enshrine Moore’s Law in the vocabulary and product planning cycles of the semiconductor industry. In the last 10 years, the continued improvement in MOSFET performance and power has been achieved by breakthroughs in strained silicon, and High-K metal gate technology.
It was not until the publication of a paper by Digh Hisamoto and a team of other researchers at Hitachi Central Research Laboratory in 1991 that the potential for 3-D, or ‘wraparound’ gate transistor technology, to enhance MOSFET performance and eliminate short channel effects, was recognized. This paper called the proposed 3-D structure ‘depleted lean-channel transistor’, or DELTA(1). In 1997 the Defense Advanced Research Projects Agency (DARPA) awarded a contract to a research group at the University of California, Berkeley, to develop a deep sub-micron transistor based on the DELTA concept. One of the earliest publications resulting from this research in 1999 dubbed the device a ‘FinFET’ for the fin-like structure at the center of the transistor geometry(2).
Important Turning Point in Transistor Technology
Continued optimization and manufacturability studies on 3-D transistor structures continued at research and development organizations in leading semiconductor companies. Some of the process and patent development has been published and publicly shared, and some development remained in corporate labs.
The research investment interests of the semiconductor industry are driven by the International Technology Roadmap for Semiconductors (ITRS), which is coordinated and published by a consortium of manufacturers, suppliers, and research institutes. The ITRS defines transistor technology requirements to achieve continued improvement in performance, power, and density along with options which should be explored to achieve the goals. The ITRS and its public documentation captures conclusions and recommendations regarding manufacturing capabilities like strained silicon and High-K metal gate, and now the use of 3-D transistor technologies to maintain the benefits of Moore’s law. Based on documents produced by the ITRS and an examination of academic papers and patent filings, research into 3-D transistor technologies has grown dramatically in the last decade.
Adoption and Research
Two important pronouncements occurred in the last two years that have propelled the 3-D transistor structure into the industry spotlight, and into a permanent place in the technology story of MOSFET transistors.
The first announcement was by Intel Corporation on 4th of May, 2011, about their Tri-Gate transistor design that had been selected for the design and manufacture of their 22 nm semiconductor products. This was preceded by a decade of research and development taking advantage of the work of Hisamoto and others in FinFET development and optimization. It represented both a solid acknowledgment of the feasibility and cost-effectiveness of the the Tri-Gate transistor structure in semiconductor production, as well as a continued declaration of leadership by Intel in semiconductor technology.
The second announcement was the publication of ITRS technology roadmaps, with contributions from many other semiconductor manufacturing companies that identified 3-D transistor technology as the primary enabler of all incremental semiconductor improvement beyond the 20 nm or 22 nm design node.
Intel’s Leadership in Transistor Technologies
In several public forums, including the Intel Developer’s Forums and investor’s conferences, Intel identifies where they have demonstrated technology leadership in a variety of advances that have sustained the pace of Moore’s Law. As shown in Figure 3, Intel has identified the number of years of production leadership they have achieved in bringing strained silicon and High-K metal gate technology to full production. In the case of 3-D Tri-Gate transistor technology, Intel estimates a lead of up to four years based on their production rollout of Tri-Gate technology at 22 nm in 2011.
According to former Intel CEO, Paul Otellini in their 16 April 2013 Earnings Call(8):
“In the first quarter [of 2013], we shipped our 100 millionth 22 nanometer [Tri-Gate] processor, using our revolutionary 3-D transistor technology, while the rest of the industry works to ship its first unit.”
Another leadership advantage that will be held by Intel in their rollout of 14 nm technology can be traced to their very public ‘Tick-Tock’ strategy in process and microarchitecture introduction. A ‘tick’ cycle of product introduction relies on the implementation of microarchitecture changes in their CPU products, followed by a ‘tock’ cycle of semiconductor process manufacturing geometry shrink. Intel is firmly committed to a full process shrink in their move from 22 nm to 14 nm; comparable semiconductor technology processes in development at other manufacturers have been less clear whether their process roadmaps include the benefits of a process shrink.
(Source: The Breakthrough Advantage for FPGAs with Tri-Gate Technology (v. 1.0 Altera whitepaper, June 2013])
Altera says beginning with 14 nm Tri-Gate technology, the highest performance FPGAs will simply be the ones built on demonstrably superior transistor technology:
Accessing the Benefits of Tri-Gate Technology Through Altera FPGAs
Taking advantage of the significant benefits of Intel’s Tri-Gate technology is only possible for users of Altera® high-density and high-performance FPGAs on the 14 nm technology process. This is the result of an exclusive manufacturing partnership between the two companies referenced in the introduction to this paper.
The substantial advantages of Tri-Gate silicon technologies will allow Altera to deliver previously unimaginable performance in FPGA and SoC products. This will include a historic doubling of core performance as compared to other high-end FPGAs, bringing FPGAs to the Gigahertz performance level. Overall active and static power numbers will reduce by 70 percent through a combination of process, architecture, and software advances.
Although the details and schedules of the 14 nm manufacturing process are not yet publicly available from Intel Corporation, Altera users can begin designs today that take advantage of the significant performance and power efficiency benefits of Tri-Gate technology in FPGAs. This is possible by beginning designs with the Arria® 10 portfolio of 20 nm FPGA devices. Users can then take advantage of pin-for-pin design migration pathways from Arria 10 FPGA and SoC products to Stratix® 10 FPGA and SoC products as they become available.
This allows you, as an FPGA user and system architect, to begin designing products that can accommodate both the Arria 10 and Stratix 10 product families with minimal changes, modifications, and reengineering. This will allow you to get products to market with the highest performance and lowest power FPGAs that leverage 20 nm process technology and power reduction techniques, then advance these same products to the previously unimaginable performance and power efficiency of Intel’s 14 nm Tri-Gate manufacturing process.
(Source: The Breakthrough Advantage for FPGAs with Tri-Gate Technology (v. 1.0 Altera whitepaper, June 2013])
Altera Announces Breakthrough Advantages with Generation 10 [press release, June 10, 2013]
- Stratix 10 FPGAs and SoCs leverage Intel’s 14 nm Tri-Gate process and an enhanced architecture to deliver core performance two times higher than current high-end FPGAs, while enabling up to 70 percent power savings.
- Arria 10 FPGAs and SoCs reinvent the midrange by simultaneously surpassing high-end FPGAs in performance while delivering 40 percent lower power than today’s midrange devices.
Altera Corporation (NASDAQ: ALTR) today introduced its Generation 10 FPGAs and SoCs, offering system developers breakthrough levels of performance and power efficiencies. Generation 10 devices are optimized based on process technology and architecture to deliver the industry’s highest performance and highest levels of system integration at the lowest power. Initial Generation 10 families include Arria® 10 and Stratix® 10 FPGAs and SoCs with embedded processors. Generation 10 devices leverage the most advanced process technologies in the industry, including Intel’s 14-nm Tri-Gate process and TSMC’s 20 nm process. Early access customers are currently using the Quartus® II software for Generation 10 product development.
“Our Generation 10 products will strengthen the penetration of programmable logic into new markets and applications and further accelerate the implementation of FPGAs into systems traditionally served by ASSPs and ASICs,” said Patrick Dorsey, senior director of product marketing at Altera. “The optimizations we made in our Generation 10 devices allow customers to develop highly customized solutions that dramatically increase system performance and system integration while lowering operating expenses.”
Delivering the Unimaginable with Stratix 10 FPGAs and SoCs
Stratix 10 FPGAs and SoCs are designed to enable the most advanced, highest performance applications in the communications, military, broadcast and compute and storage markets, while slashing system power. Leveraging Intel’s 14 nm Tri-Gate process and an enhanced high-performance architecture, Stratix 10 FPGAs and SoCs have an operating frequency over one gigahertz, 2X the core performance of current high-end 28 nm FPGAs. For high-performance systems that have the most strict power budgets, Stratix 10 devices allow customers to achieve up to a 70 percent reduction in power consumption at performance levels equivalent to the previous generation.
Altera is announcing the technology details of Stratix 10 FPGAs and SoCs today as part of the Generation 10 portfolio introduction, and will disclose more details on the product at a later date. Stratix 10 FPGAs and SoCs provide the industry’s highest performance and highest levels of system integration, including:
- More than four million logic elements (LEs) on a single die
- 56-Gbps transceivers
- More than 10-TeraFLOPs single-precision digital signal processing
- A third-generation ultra-high-performance processor system
- Multi-die 3D solutions capable of integrating SRAM, DRAM and ASICs
Reinventing the Midrange with Arria 10 FPGAs and SoCs
Arria 10 FPGAs and SoCs are the first device families to roll out as part of the Generation 10 portfolio. The device family sets a new bar for midrange programmable devices, delivering both the performance and capabilities of current high-end FPGAs at the lowest midrange power. Leveraging an enhanced architecture that is optimized for TSMC’s 20 nm process, Arria 10 FPGAs and SoCs deliver higher performance at up to 40 percent lower power compared to the previous device family.
Arria 10 devices offer more features and capabilities than today’s current high-end FPGAs, at 15 percent higher performance. Reflecting the trend toward silicon convergence, Arria 10 FPGAs and SoCs offer the highest degree of system integration available in midrange devices, including 1.15 million LEs, integrated hard intellectual property and a second-generation processor system that features a 1.5 GHz dual-core ARM® Cortex™-A9 processor. Arria 10 FPGAs and SoCs also provide 4X greater bandwidth compared to the current generation, including 28-Gbps transceivers, and 3X higher system performance, including 2666 Mbps DDR4 support and up to 15-Gbps Hybrid Memory Cube support.
Development Suite Delivers Breakthrough Productivity to Generation 10
Generation 10 devices are supported by Altera’s Quartus II development software and tools for higher level design flows that include a software development kit for OpenCL™, a SoC Embedded Design Suite and DSP Builder tool. This leading-edge development tool suite enables design teams to maximize productivity while making it easier for new design teams to adopt Generation 10 FPGAs and SoCs in their next-generation systems. The Quartus II software will continue to deliver the industry’s fastest compile times by providing Generation 10 FPGAs and SoCs an 8X improvement in compile times versus the previous generation. The substantial reduction in compile times is the result of leading-edge software algorithms that take advantage of modern multi-core computing technologies.
Early access customers are currently using the Quartus II software for development of Arria 10 FPGA and SoCs. Initial samples of Arria 10 devices will be available in early 2014. Altera will have 14 nm Stratix 10 FPGA test chips in 2013 and Quartus II software support for Stratix 10 FPGAs and SoCs in 2014. For more information, visit www.altera.com/gen10, or contact your local Altera sales representative.
Altera and TSMC Continue Long-Term Partnership [press release, Feb 25, 2013]
Altera Corporation (NASDAQ: ALTR) and TSMC (TWSE: 2330, NYSE: TSM) today reaffirmed their commitment to a long-term partnership to set new milestones in FPGA innovation. TSMC is Altera’s primary foundry, supplying a wide array of processes to fulfill Altera’s product portfolio, including soon-to-be released 20 nm products, existing mainstream products, and long-lived legacy components.
Altera is fully engaged with TSMC on developing products based on next-generation process technologies. Altera’s next major product family leverages TSMC’s cost-effective 20SoC process for optimal power and performance and will include several significant product and technology innovations for both companies. Altera will continue to leverage future TSMC process technologies in its tailored product portfolio for performance, bandwidth, and power efficiency needs across diverse end applications.
“Over the course of our 20-year collaboration, Altera and TSMC have achieved many industry milestones that have greatly benefitted both companies,” said John Daane, president, CEO and chairman of Altera. “TSMC remains an important part of our future product development. We look forward to continuing our close partnership to jointly develop technologies for next-generation products.”
Morris Chang, TSMC’s chairman and CEO added,”The history of collaboration between Altera and TSMC has exemplified the way fabless and foundry have nurtured each other to become a powerful force in the semiconductor industry. TSMC would not be where it is today without customers like Altera, and I firmly believe this partnership will continue to flourish.”
Altera Demonstrates Industry’s First 32-Gbps Transceiver with Leading-Edge 20 nm Device [press release, April 8, 2013]
Demonstration Highlights Latest Success in Altera’s 20 nm FPGA Early Access Program
San Jose, Calif., April 8, 2013– Altera Corporation (NASDAQ: ALTR) today announced the company achieved another significant milestone in transceiver technology by demonstrating the industry’s first programmable device with 32-Gbps transceiver capabilities. The demonstration uses a 20 nm device based on TSMC’s 20SoC process technology. This achievement validates the performance capabilities of 20 nm silicon and is a positive indicator to the more than 500 customers in Altera’s early access program who are looking to use next-generation Altera devices in the development of performance demanding, bandwidth-centric applications. A demonstration video showing the industry’s first operational 20 nm transceiver technology operating at 32 Gbps is available for viewing on Altera’s website at www.altera.com/32gbps-20nm.
Demonstrating 32-Gbps transceiver data rates provides Altera insight into how high-performance transceiver designs behave on TSMC’s 20SoC process. The transceiver technology Altera is demonstrating today will be integrated into its 20 nm FPGA products, fabricated on TSMC’s 20SoC process. These devices enable customers to design next-generation serial links with the lowest power consumption, fastest timing closure and the highest quality signal integrity. Altera has a proven track record in integrating leading-edge transceiver technology into its devices. Altera is the only company today shipping production 28 nm FPGAs with monolithically integrated low-power transceivers operating at 28 Gbps. Being the first FPGA vendor to reach the 32-Gbps milestone in 20 nm silicon further extends Altera’s leadership in transceiver technology.
The demonstration video on Altera’s web site shows 20 nm transceivers operating at 32 Gbps with just over nine picoseconds of total jitter and extremely low random jitter of 240 femtoseconds. The results show good margin to key industry specifications requited for next-generation 100G systems.
“Today’s news represents a significant milestone for the industry and for the transceiver development team at Altera,” said Vince Hu, vice president of product and corporate marketing at Altera. “These 20 nm devices contain the key IP components that will be included in our next-generation FPGAs and validating them now provides us confidence we will deliver to the market 20 nm FPGAs on schedule.”
Altera’s next-generation transceiver innovations enable system developers to support the rapidly increasing amount of data that is being transmitted through the world’s networks. The transceivers in Altera’s next-generation devices will drive more bandwidth with lower power per channel versus the previous nodes and will support increasing port density by interfacing directly to 100G CPF2 optical modules.
Altera and Micron Lead Industry with FPGA and Hybrid Memory Cube Interoperability [joint press release, Sept 4, 2013]
Altera Corporation (NASDAQ: ALTR) and Micron Technology, Inc.(NASDAQ: MU) (“Micron”) today announced they have jointly demonstrated successful interoperability between Altera Stratix® V FPGAs and Micron’s Hybrid Memory Cube (HMC). This technology achievement enables system designers to evaluate today the benefits of HMC with FPGAs and SoCs for next-generation communications and high-performance computing designs. The demonstration provides an early proof point that production support of HMC will be delivered with Altera’s Generation 10 portfolio, in alignment with market timing, and includes both Stratix 10 and Arria 10 FPGAs and SoCs.
HMC has been recognized by industry leaders and influencers as the long-awaited answer to address the limitations imposed by conventional memory technology, and provides ultra-high system performance with significantly lower power-per-bit. HMC delivers up to 15 times the bandwidth of a DDR3 module and uses 70 percent less energy and 90 percent less space than existing technologies. HMC’s abstracted memory allows designers to devote more time leveraging HMC’s revolutionary features and performance and less time navigating the multitude of memory parameters required to implement basic functions. It also manages error correction, resiliency, refresh, and other parameters exacerbated by memory process variation. Micron expects to begin sampling HMC later this year with volume production ramping in 2014.
“As one of the founding developers of the HMC Consortium, Altera’s support for and involvement with HMC has been invaluable,” said Brian Shirley, vice president of DRAM solutions for Micron Technology. “The combination of Altera FPGAs with Micron’s HMC solution will help customers leverage the technology’s performance and efficiency in a wide range of next generation networking and computing applications.”
Altera’s 28 nm Stratix V FPGAs are an ideal demonstration of HMC technology since they are the highest performance FPGAs in the industry with a two speed-grade advantage over the nearest competitor. This performance enables the FPGA to leverage the full bandwidth, efficiency and power benefits of HMC by using a full 16 transceiver HMC link.
“By demonstrating Stratix V and HMC working together now, we are enabling our customers to leverage their current development with Stratix V FPGAs and prepare for production deployment in Altera’s Generation 10 devices, knowing they will have proven HMC support,” said Danny Biran, senior vice president of marketing and corporate strategy at Altera. “The partnership between Altera and Micron to deliver this capability puts our customers at the forefront of innovation.”
Altera’s Generation 10 Devices Deliver Performance
Arria 10 FPGAs and SoCs are the first device families in the Generation 10 portfolio and will be the first devices to support HMC technology in volume production. Leveraging an enhanced architecture optimized for TSMC’s 20 nm process, Arria 10 FPGAs and SoCs will use HMC to extend the benefits by providing both 15 percent higher core performance than today’s highest performance Stratix V FPGAs and up to 40 percent lower power compared to the lowest power Arria V midrange FPGAs. Arria 10 FPGAs and SoCs will offer up to 96 transceiver channels, enabling customers to take full advantage of the bandwidth that HMC has to offer.
Stratix 10 FPGAs and SoCs will enable the most advanced, highest performance applications across communications, military, broadcast and compute and storage markets. These high-performance applications often require the highest memory bandwidth, which drives the need for an HMC-ready architecture. Leveraging Intel’s 14 nm Tri-Gate process and an enhanced high-performance architecture that integrates with HMC technology, Stratix 10 FPGAs and SoCs will enable system solutions with an operating frequency over one gigahertz, and two times the core performance of current high-end 28 nm FPGAs. Stratix 10 devices will also allow customers to achieve up to a 70 percent reduction in power consumption at performance levels equivalent to the previous generation.
4. OpenCL for FPGAs
Altera SDK for OpenCL is First in Industry to Achieve Khronos Conformance for FPGAs [press release, Oct 16, 2013]
Altera Passes OpenCL Conformance with High-Performance Stratix V FPGA and Demonstrates SDK for OpenCL on ARM-based Cyclone V SoCs
San Jose, Calif., October 16, 2013—Altera Corporation (NASDAQ: ALTR) today announced its SDK for OpenCL is conformant to the OpenCL 1.0 standard and is now included on the Khronos Group list of OpenCL conformant products. Altera is the only company to offer an FPGA-optimized OpenCL solution, allowing software developers to harness the massively parallel architecture of an FPGA for system acceleration. Altera will demonstrate its OpenCL solutions at the 2013 Linley Processor Conference, being held October 16-17 in Santa Clara, Calif.
Achieving conformance allows Altera to provide a validated cross-platform programming environment that can be used to dramatically accelerate algorithms at significantly lower power versus alternative computer hardware architectures. To become conformant, Altera successfully completed more than 8500 conformance tests using its SDK for OpenCL, targeting a high-performance Stratix® V FPGA. The tests involved continuously running a Stratix V FPGA accelerator card in a server farm resulting in zero errors.
“Our continued investment in OpenCL is enabling Altera to drive the industry toward using FPGAs for acceleration of computationally-intensive applications,” said Alex Grbic, director of software, IP and DSP marketing at Altera. “Our SDK for OpenCL is used by some of the world’s leading developers of high-performance computing systems. These developers require Khronos group OpenCL conformance and Altera is the only FPGA vendor to achieve it, proving the readiness of our solution.”
Software developers can easily take advantage of the high-performance, low-power that FPGAs offer. Altera’s SDK for OpenCL provides an industry-standard open source programming interface and Altera’s Preferred Board Partner Program for OpenCL provides off-the-shelf FPGA boards that are optimized for Altera devices. A list of preferred board partners, as well as a variety of design examples that demonstrate the advantages of using FPGAs in high-performance systems, can be found at www.altera.com/opencl.
OpenCL Ray Tracer Demonstration Targeting Single-chip SoCs
In addition to support for its high-performance Stratix V FPGAs, Altera developed its SDK for OpenCL to support its low-power, low-cost Cyclone® V SoCs, which integrates an ARM® Cortex®-A9 processor into a 28 nm FPGA. Altera recently used its SDK for OpenCL to develop and demonstrate a complete heterogeneous system using a Cyclone V SoC. The demonstration shows how a ray tracing algorithm used to render 3D graphics can be accelerated using the Altera SDK for OpenCL and a Cyclone V SoC – achieving a speed up of 40X in comparison to running the same algorithm purely on a discrete ARM processor system. For software developers unfamiliar with hardware design languages, no hardware expertise is required to implement the OpenCL kernels.
Altera SDK for OpenCL at Linley Processor Conference
Altera will demonstrate its OpenCL solutions at the 2013 Linley Tech Processor Conference, being held October 16-17 in Santa Clara, Calif. Altera’s participation includes a presentation titled “Implementing Deep Packet Inspection Using OpenCL Channels” that will show how to express a DPI application using OpenCL with Altera FPGAs. Altera will also demonstrate its SDK for OpenCL solutions to attendees.
Pricing and Availability
LEAP 2013 : Developing High-Performance Low-Power Solutions using FPGAs and OpenCL by Craig Davis — Altera Corporation [LEAPconf YouTube channel, recorded on May 21, 2013, published on Sept 12, 2013]
From presentation slides (PDF) I will copy here the following ones:
FPGA programming model: RTL
Involves state machines, datapaths, arbitration, buffering, and others
Processor programming model: C/C++
Typically sequential, involves subroutines and functions
Need a programming model that represents a heterogeneous system (CPU + FPGA)
A processor with hardware accelerators
A configurable multicore device
An ideal single hardware and software design environment
More information: Implementing FPGA Design with the OpenCL Standard (v. 2.0 Altera whitepaper, November 2012]
Altera SDK for OpenCL Combined with an Ecosystem of Development Boards Delivers Power-efficient, High-performance Solution for Heterogeneous Computing
Altera Corporation (NASDAQ: ALTR) today announced the broad availability of its SDK for OpenCL™ and supported third-party production boards. Availability of the SDK for OpenCL enables software programmers to access the high-performance capabilities of programmable logic devices. Also part of today’s news, Altera announced a Preferred Board Partner Program, allowing third-party board vendors to work closely with Altera to design optimized production boards based on Altera’s programmable devices. The availability of supported third-party boards through the Preferred Board Partner Program and an SDK for OpenCL enables software programmers to easily target high-performance FPGAs using a high-level language.
Altera’s SDK for OpenCL allows software programmers to take their OpenCL code and easily exploit the massively parallel architecture of an FPGA. Software programmers targeting FPGAs achieve higher performance at significantly lower power compared to alternative hardware architectures.
“Because FPGAs enable parallel processing, they are critical for specialized server workloads that demand real-time performance. We are pleased that our clients are now able to take full advantage of this technology on Power Systems using Altera’s SDK for OpenCL,” said Robert L. Swann, vice president, IBM Power Systems. “With this standards-based approach, our clients can leverage a vibrant ecosystem of commercial and research contributions to accelerate emerging compute intensive workloads.”
The SDK for OpenCL is designed to increase system performance in highly data-parallel computing applications featured in financial, military, broadcast, medical and a variety of other markets. Altera’s OpenCL solutions are supported by a robust ecosystem consisting of board partners, design partners, software tools and university collaboration. Altera and its partners provide the tools, hardware, libraries, reference designs and design resources necessary for developers to implement their OpenCL designs into FPGAs and reduce time-to-market.
The Altera Preferred Board Partner Program for OpenCL ensures third-party production boards are optimized for current Altera device architectures. Initial preferred board partners included in the program are BittWare, Nallatech and PLDA, with additional board partners to be added in the future.
“For years, Altera and BittWare have partnered to deliver timely high-end signal processing board-level solutions that significantly reduce technology risk for our mutual customers,” said Darren Taylor, senior vice president of sales and marketing at BittWare. “Leveraging the latest hardware technology from Altera, which now includes an SDK for OpenCL, we are able to dramatically reduce the complexity for applications in the computing, financial and military markets.”
“An OpenCL implementation provides an ideal fit for Nallatech’s hardware-accelerated computing solutions,” said Allan Cantle, president and founder of Nallatech. “We simplify the deployment of FPGAs in heterogeneous platforms via direct purchase of our cards or pre-integrated in leading vendors’ high density servers and blades. Customers developing high-performance computing applications using Altera’s SDK for OpenCL will benefit from a dramatic increase in performance per watt, per dollar over traditional computing architectures.”
“PLDA has a successful track record of supporting Altera’s customers with their high-performance applications,” said Stephane Hauradou, vice president and CTO of PLDA. “The SDK for OpenCL will open up a significantly broader group of software developers who can now fully leverage Altera’s leading-edge solutions.”
Pricing and Availability
The Altera SDK for OpenCL is currently available for download on Altera’s website. The annual software subscription for the SDK for OpenCL is $995 for a node-locked PC license. For additional information about the Altera Preferred Board Partner Program for OpenCL and its partner members, or to see a list of all supported boards and links to purchase, visit the OpenCL section on Altera’s website.
Software Development Kit for OpenCL Enables Developers to Take Advantage of the Performance and Power-efficiencies of FPGAs
Altera Corporation (Nasdaq: ALTR) today announced the FPGA industry’s first Software Development Kit (SDK) for OpenCL™ (Open Computing Language) which combines the massively parallel architecture of an FPGA with the OpenCL parallel programming model. The SDK allows system developers and programmers familiar with C to quickly and easily develop high-performance, power-efficient FPGA-based applications in a high-level language. The Altera SDK for OpenCL enables FPGAs to work in concert with the host processor to accelerate parallel computation, at a fraction of the power compared to hardware alternatives. Altera will demonstrate the performance and productivity benefits of OpenCL for FPGAs at SuperComputing 2012 in booth #430.
“The industry’s approach for boosting system performance has evolved over time from increasing frequency in single-core CPUs, to using multi-core CPUs, to using parallel processor arrays,” said Vince Hu, vice president of product and corporate marketing at Altera. “This evolution leads us to today’s modern FPGAs, which are fine-grained, massively parallel digital logic arrays architected to execute computations in parallel. Our SDK for OpenCL enables customers to easily adopt FPGAs and leverage the performance and power benefits the devices provide.”
Altera SDK for OpenCL Design Flow
OpenCL is an open, royalty-free standard for cross-platform, parallel programming of hardware accelerators, including CPUs, GPGPUs and FPGAs. The Altera SDK for OpenCL offers a unified, high-level design flow for hardware and software development that automates the time-consuming tasks required in typical hardware-design language (HDL) flows. The OpenCL tool flow automatically converts OpenCL kernel functions into custom FPGA hardware accelerators, adds interface IPs, builds interconnect logic and generates the FPGA programming file. The SDK includes libraries that link to OpenCL API calls within a host program running on the CPU. By automatically handling these steps, designers are able to focus their development efforts on defining and iterating their algorithms rather than designing hardware.
The portability of the OpenCL code enables users to migrate their designs to different FPGAs or SoC FPGAs as their application requirements evolve. With SoC FPGAs, the CPU host is embedded into the FPGA, providing a single-chip solution that delivers significantly higher bandwidth and lower latency between the CPU host and the FPGA compared to using two discrete devices.
Using FPGAs to Extract Maximum Parallelism in Heterogeneous Platforms
The Altera SDK for OpenCL enables programmers to leverage the massively parallel, fine-grained architectures featured in FPGAs to accelerate parallel computation. Unlike CPUs and GPGPUs, where parallel threads are executed across an array of cores, FPGAs allow kernel functions to be transformed into dedicated, deeply pipelined hardware circuits that are multithreaded using the concept of pipeline parallelism. Each of these pipelines can be replicated many times to provide even more parallelism by allowing multiple threads to execute in parallel. The result is an FPGA-based solution that can deliver >5X performance/Watt compared to alternative hardware implementations.
Altera is working with several board partners to deliver COTS board solutions to customers. Currently, boards from BittWare and Nallatech are designed to support Altera OpenCL. Additional third-party boards will be supported with future releases of the SDK.
Altera has performed a variety of benchmarks that show the productivity savings and the performance and power efficiency gained by using an OpenCL framework for FPGA development. Based on early benchmarks and working with customers in a variety of markets, the SDK shaved months off one customer’s development time for their video processing application and boosted performance by 9X versus a CPU in another customer’s financial application.
The Altera SDK for OpenCL is production ready and is available to customers through an early access program. To discover the high performance, power-efficient acceleration that OpenCL provides with FPGAs, contact a local Altera sales representative. For additional information regarding OpenCL and the benefits of targeting FPGA through an OpenCL implementation, visithttp://www.altera.com/products/software/opencl/opencl-index.html.
OpenCL for Altera FPGAs: Accelerating Performance and Design Productivity [Altera, Nov 5, 2012]
Combining the Open Computing Language (OpenCL™) programming model with Altera’s massively parallel FPGA architecture provides a powerful solution for system acceleration. The Altera® SDK for OpenCL* provides a design environment for you to easily implement OpenCL applications on FPGAs.
Benefits of OpenCL on FPGAs
As a software developer, how can you benefit from OpenCL on FPGAs?
As the “power wall” continues to prevent higher frequencies to be achieved in processors, multi-core processors have become the norm. This has opened the door for parallel processing techniques and thus FPGAs, which are inherently parallel, to start playing a bigger role in the embedded systems world.
The approaches to finding parallelism can be a different way of thinking for some software programmers, where FPGA designers tend to naturally think this way. You can take the scatter-gather approach for data parallelism, sending input data to the appropriate parallel resources and combining the results later, or the divide and conquer method for task parallelism, where you decompose the problem into sub problems and run them on the appropriate resources.
Using OpenCL, you continue to develop your code in the familiar C programming language but target certain functions as OpenCL kernels using the additional OpenCL constructs. Then these kernels can be sent to the available system resources, such as an FPGA, without having to learn the low level Hardware Description Language (HDL) coding practices of FPGA designers.
- HDL coding is the equivalent to coding in assembly to software developers. OpenCL keeps you in a higher level coding language that you are already familiar with, C, with some new OpenCL construct.
- Profile your code and determine the performance intensive inner loop functions that make sense to hardware accelerate as kernels in an FPGA.
- It’s about performance per watt. You’re balancing high performance with a power-efficient solution in an FPGA.
- With the FPGAs fine-grain parallelism architecture, the Altera SDK for OpenCL generates only the logic you need to deliver with as low as 1/5 of the power of other hardware alternatives.
- Kernels can target FPGAs, CPUs, GPUs, and DSPs seamlessly to produce a truly heterogamous system.
As an Embedded or DSP Designer, how can you benefit from OpenCL on FPGAs?
- Achieve significantly faster time to market compared to the traditional FPGA design flow.
- Describe your algorithms using the OpenCL C (based on ANSI C) parallel programming language instead of the traditional low-level HDL.
- Perform design exploration quickly by staying at a higher level of design abstraction.
- Obsolescence-proof your designs as you can retarget your OpenCL C code to current and future FPGAs.
- Obsolescence-proof your designs as you can retarget your OpenCL C code to current and future FPGAs.
- Generate an FPGA implementation of your OpenCL C code in a single step, bypassing the manual timing closure efforts and implementation of communication interfaces between the FPGA, host, and external memories.
The growing need for higher performance and faster time to market through parallel programming in software is seen in many markets, including the Computer & Storage, Military, Medical, and Broadcast markets.
- Buy a board from one of our preferred partners
- Download the Altera SDK for OpenCL
- Take an OpenCL training course
- Register for updates on Altera’s OpenCL solution for FPGAs
- Implementing FPGA Design with the OpenCL Standard (PDF)
- Fractal Video Compression in OpenCL: An Evaluation of CPUs, GPUs, and FPGAs as Acceleration Platforms (PDF)
- Using OpenCL to Evaluate the Efficiency of CPUs, GPUs, and FPGAs for Information Filtering (PDF)
- 40Gbit AES Encryption Using OpenCL and FPGAs (PDF)
Computer and Storage [Altera, Nov 5, 2012]
Computer and storage technology is evolving rapidly. Today, cloud computing is enabling the consolidation of traditional IT functions with entirely new capabilities. For example, many large-scale data centers are now providing traditional IT services along with new data analytics services.
Hence, these large-scale data centers require highly efficient server and storage systems. Traditional CPU technology limits performance, as the use of frequency scaling as a way to increase performance has ended. The end of frequency scaling has caused a shift to multicore processing. However, multicore processing has diminishing returns in terms of increasing true application performance due to limits in I/O and memory bandwidth.
Altera® FPGAs can be used to accelerate the performance of large-scale data systems. Altera FPGAs enable higher speed data processing by providing customized high-bandwidth, low-latency connections to network and storage systems. In addition, Altera FPGAs provide compression, data filtering, and algorithmic acceleration.
With the Altera SDK for OpenCLTM, you can now rapidly develop acceleration solutions for computer and storage systems. The Altera SDK for OpenCL enables even software developers to easily design with FPGAs by allowing them to utilize a high-level programming language for developing acceleration functions.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.
OpenCL for Military [Altera, Oct 10, 2013]
Radar backend processing is a compute-intensive operation using various algorithms such as a FIR filter, which utilize custom pipeline parallelism. Increased performance is achieved by off loading from the host processor onto an FPGA.
Custom processors can be created using the OpenCL™ toolflow that are more efficient than multicore CPUs or GPUs both in computational capability and power requirements.
Figure 1: Radar Back-End Processing Alternatives Using OpenCL
For more information regarding Altera’s OpenCL for Military, please contact us at email@example.com.
Medical: Hardware Acceleration with OpenCL [Altera, Feb 16, 2013]
Ultrasound, X-ray, CT, and PET applications all require intensive back-end compute operations for algorithms such as fast Fourier transform (FFT) using custom pipeline parallelism. Increased algorithm performance is achieved by off loading from the host processor onto an FPGA.
Custom processors created using the OpenCL™ toolflow are more efficient than multicore CPUs or GPUs, both in computational capability and power requirements.
OpenCL for Altera FPGAs web page
Broadcast: Advanced Systems Development Kit [Altera, Oct 25, 2012]
The Advanced Systems Development Kit is a platform that can pack multi-channel 4K video ingest, processing, and streaming into a server-ready board. It features industry-leading PCIe gen3x16 interface, plus over 1 million FPGA Logic Elements to handle the toughest video processing algorithms, matched by over 1500Gbps of external memory bandwidth – enough to tackle 4 channels of 4K UHDTV video streams. This platform provides an order of magnitude improvement in existing development kit hardware capabilities; in addition to innovations in the soft content and business model that come together to significantly accelerate end-product deployment.
Figure 1: Altera’s Advanced Systems Development Kit
Typical development kits are intended for lab-use only, because they lack the on-board resources to develop the entire end product. It is common for engineers to design their own board and software from scratch – until now. The Advanced Systems Development Kit breaks through all those barriers and significantly shortens your design cycle in many ways, including:
- A complete OmniTek BSP (board support package) for video applications, with firmware, and Windows and Linux drivers
- An evaluation design featuring OmniTek’s PCI Express DMA engine that efficiently streams multiple channels of videos between I/O and host memory
- A flexible front-panel FMC I/O expansion connector, allowing for connectivity to popular standards such as SFP+, fiber, QSFP, gigabit Ethernet, etc.
- Dual Stratix V FPGAs to integrate functions such as multi-channel format conversions, video codecs, ingest/playout connectivity, etc.
- Over 1500Gbps of external memory bandwidth – enough to handle multiple 4k channels
- PCIe gen3x16 to handle even the most demanding video streaming and acceleration
- PCIe form-factor compliant for use in both custom-built chassis and commercial off-the-shelf (COTS) servers
- Licensable full manufacturing rights to the board design, which enables you to easily make cost-optimizations and derivatives for rapid deployment of your products.
The Advanced Systems Development Kit resolves common broadcast challenges related to:
- Increased channel density
- 4K and beyond-HD resolutions
- High frame rate applications
- The fine balance between future-proofing and cost-efficiency
A rich partner ecosystem significantly accelerates and simplifies system-level advanced development. For example, Embrionix’s emSFP modules convert SDI to a number of physical layer standards, allowing you to rapidly release products and still future-proof the hardware with a simple upgrade of the emSFP. This provides a new level of flexibility for manufacturers. The combination of capabilities and physical design positions this platform perfectly for the convergence of broadcast and IT technologies.
Figure 2: Embrionix’s embedded SFP modules for high-density video connectivity
Altera’s OpenCL Toolflow
In addition to accelerating hardware designs, the Advanced Systems Development Kit will also support Altera’s unique OpenCL™ toolflow to elevate software productivity. OpenCL enables viable software implementations of complex video algorithms, and dramatically lowers the cost of the end product. Examples of broadcast applications include:
- Acquisition: Real-time debayering of raw camera data, scaling for multiviewers, etc.
- Post-production: Color grading, motion estimation, special effects rendering, etc.
- Distribution: 3D/temporal noise reduction, H.264 compression, etc.
- Consumption: JPEG2000 decoding for 4K digital cinema playout, block artifact reduction filters, etc.
The OpenCL toolflow leverages parallel processing on the underlying hardware, and achieves an order of magnitude performance improvement compared to sequential CPU processing. Furthermore, running OpenCL on the Advanced Systems Development Kit gives you several unique advantages including:
- The best performance per watt consumed, so you enjoy OpenCL’s benefits without power and heat issues from GPUs
- The ability to assimilate, manipulate, and transport multichannel video on a single board
- The highest level of integration to achieve maximum channel density for your end product
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.
5. Altera SoC FPGAs
Generation 10 FPGAs and SoCs [Altera, May 16, 2013]
Altera’s Generation 10 FPGAs and SoCs optimize process technology and architecture to deliver the industry’s highest performance and highest levels of system integration at the lowest power. Initial Generation 10 families include Stratix® 10 and Arria® 10 FPGAs and SoCs with embedded processors.
Read the White paper: Expect a Breakthrough Advantage in Next-Generation FPGAs (PDF) [June 2013]
Read the White paper: Meeting the Performance and Power Imperative of the Zettabyte Era with Generation 10 (PDF) [June 2013]
Watch the video: Arria 10 FPGAs and SoCs — Reinventing the Midrange [June 2013]
Read the White paper: The Breakthrough Advantage for FPGAs with Tri-Gate Technology (PDF) [June 2013]
Generation 10 FPGAs and SoCs are supported by a leading-edge suite of development tools delivering:
- 8x improvements in compile times
- Higher level design flows that support hardware and software designers
Stratix 10 FPGAs and SoCs [Altera, June 10, 2013]
Stratix® 10 FPGAs and SoCs offer breakthrough advantages in bandwidth and system integration, including the next-generation hard processor system (HPS), to deliver the industry’s highest performance and most power- efficient FPGAs and SoCs. Stratix 10 devices are manufactured on the revolutionary Intel 14 nm 3D Tri-Gate transistor technology, which delivers breakthrough levels of performance and power efficiencies that were previously unimaginable. When coupled with 64 bit quad-core ARM® CortexTM-A53 processors and advanced heterogeneous development and debug tools such as the Altera® SDK for OpenCLTM and SoC Embedded Design Suite (EDS), Stratix 10 devices offer the industry’s most versatile heterogeneous computing platform.
White paper: The Breakthrough Advantage for FPGAs with Tri-Gate Technology [June 2013]
Industry’s First Gigahertz FPGAs and SoCs
- New ultra-high performance FPGA architecture
- 2x the core performance of prior generation high-end FPGAs
- >10 TFLOPs of single-precision floating-point DSP performance
- >4x processor data throughput of prior-generation SoCs
Break the Bandwidth Barrier with Unimaginable High-Speed Interface Rates
- 4x serial transceiver bandwidth from previous generation FPGAs for high port count designs
- 28 Gbps backplane capability for versatile data switching applications
- 56 Gbps chip-to-chip/module capability for leading edge interface standards
- Over 2.5 Tbps bandwidth for serial memory with support for Hybrid Memory Cube
- Over 1.3 Tbps bandwidth for parallel memory interfaces with support for DDR4 at 3200 Mbps
Lower Capital Expenditures (CapEx)
- Largest monolithic FPGA device with >4M logic elements offer an unprecedented level of integration capability
- Heterogeneous multi-die 3D solutions including SRAM, DRAM, and ASICs
- Next-generation HPS
Lower Operating Expenses (OpEX)
- Leveraging Intel’s leadership in process technology, Stratix 10 FPGAs offer the most power-efficient technologies
- 70% lower power than prior generation high-end FPGAs and SoCs
- 100 GFlops/Watt of single-pecision floating point efficiency
- Integrated host processor for operation, administration, and maintenance minimizes system down time
Versatile Heterogeneous Computing for Performance and Power-Efficient SoC Design
- 64 bit quad-core ARM Cortex-A53 processor optimized for ultra-high performance per watt
- Heterogeneous C-based modeling and hardware design with Altera SDK for OpenCL
- Heterogeneous debug, profiling, and whole chip visualization with Altera SoC EDS featuring ARM Development Suite™ (DS-5™) Altera Edition Toolkit
- Fastest compile times in the industry
- C-based design entry using the Altera SDK for OpenCL, offering a design environment that is easy to implement on FPGAs
- Start developing with Arria 10 devices and then migrate to footprint-compatible Stratix 10 devices
- Complementary Enpirion PowerSoCs will offer customers higher performance, lower system power, higher reliability, smaller footprint, and faster time-to-market to power Stratix 10 FPGAs and SoCs
Altera to Build Next-Generation, High-Performance FPGAs on Intel’s 14nm Tri-Gate Technology
Stratix 10 FPGAs and SoC family is ideal to meet your high-performance, high-bandwidth, and low power requirements in the communication infrastructure, cloud computing and data centers, high-performance computing, military, broadcast, test and measurement, and other applications.
- White paper: The Breakthrough Advantage for FPGAs with Tri-Gate Technology (PDF) [June 2013]
- White paper: Expect a Breakthrough Advantage in Next Generation FPGAs (PDF) [June 2013]
- White paper: Meeting the Power and Performance Imperative of the Zettabyte Era with Generation 10 (PDF) [June 2013]
- Press Release: Altera Announces Breakthrough Advantage with Generation 10 FPGAs and SoCs [June 2013]
- Generation 10 Portfolio
Arria 10 SoC [Altera, June 10, 2013]
Arria 10 SoCs: Reinventing the Midrange
The 20 nm Arria® 10 ARM-based SoCs deliver optimal performance, power efficiency, small form factor, and low cost for midrange applications. Arria 10 SoCs, based on TSMC’s 20 nm process technology, combine a dual-core ARM® Cortex™-A9 MPCore™ hard processor system (HPS) with industry-leading programmable logic technology. Arria 10 SoCs offer a processor with a rich feature set of embedded peripherals, variable-precision digital signal processing (DSP) blocks, embedded high-speed transceivers, hard memory controllers, and protocol IP controllers – all in a single highly integrated package.
Arria 10 SoCs: Across-the Board Improvements
Arria 10 SoCs combine architectural innovations with TSMC’s 20 nm process technology to deliver improvements in performance and power reduction:
- 87% higher processor performance with up to 1.5 GHz CPU operation per core
- 60% higher performance versus the previous generation, over 500 MHz-capable core performance (15% higher performance than previous SoC)
- 4X more transceiver bandwidth versus the previous generation (2X more bandwidth versus previous high-end FPGAs)
- 4X higher system performance (2666 Mbps DDR4, Hybrid Memory Cube support)
- More than 3300 18×19 multipliers implemented on variable-precision DSP
- 40% lower power with process technology improvement and innovative techniques for power reduction
Note: See full list of memory devices supported
Designed for Productivity
Design productivity is one of the driving philosophies of the Arria 10 SoC architecture. Arria 10 SoC offer full software compatibility with previous generation SoCs, a broad ecosystem of ARM software and tools, and the enhanced FPGA and DSP hardware design flow.
- Extensive ecosystem of ARM for software development
- Altera SoC Embedded Design Suite featuring the ARM Development Studio 5 (DS-5™) Altera Edition Toolkit
- Board support packages for popular operating system including Linux, Wind River’s VxWorks, Micro-C OS II, and more
- Full software compatibility between 28 nm Cyclone V and Arria V SoCs and Arria 10 SoCs
- Quartus® II FPGA Design Suite featuring:
- High-level automated design flow with OpenCL™ compiler from Altera
- Model-based DSP hardware design with Altera DSP Builder
Arria 10 SoCs have been designed to meet the performance, power, and cost requirements for applications such as:
- Wireless infrastructure equipment including remote radio unit and mobile backhaul
- Compute and storage equipment including flash cache, cloud computing, and acceleration
- Broadcast studio and distribution equipment including professional A/V and video conferencing
- Military guidance, control and intelligence equipment
- Wireline 100G line cards, bridges and aggregation, 40G GPON
- Test and measurement equipment
- Diagnostic medical imaging equipment
- Arria 10 Advance Information Brief (PDF)
- White paper: Meeting the Performance and Power Imperative of the Zettabyte Era with Generation 10 (PDF) [June 2013]
- White paper: Expect Breakthrough Capabilities in Next Generation FPGAs (PDF) [June 2013]
- Video: Arria 10 FPGAs and SoCs – Reinventing the Midrange [June 2013]
- SoC overview [June 2013]