Laszlo Kishonti at MWC 2013 (see the video embedded later, as well as the CLBenchmark data supporting the below statement):
[1:20] Currently Mali T-600 is the first and only GPU which can run this desktop grade software. [1:27]
The Great Equalizer 3: How Fast is Your Smartphone/Tablet in PC GPU Terms [AnandTech, April 4, 2013]
… At the end of the day I’d say it’s safe to assume the current crop of high-end ultra mobile devices [T604 based Nexus 10, Adreno 320 as in Nexus 4, Tegra 3 T33 @1.6GHz as in HTC One X+] can deliver GPU performance similar to that of mid to high-end GPUs from 2006.
The caveat there is that we have to be talking about performance in workloads that don’t have the same memory bandwidth demands as the games from that same era. While compute power has definitely kept up (as has memory capacity), memory bandwidth is no where near as good as it was on even low end to mainstream cards from that time period. For these ultra mobile devices to really shine as gaming devices, it will take a combination of further increasing compute as well as significantly enhancing memory bandwidth. Apple (and now companies like Samsung as well) has been steadily increasing memory bandwidth on its mobile SoCs for the past few generations, but it will need to do more. I suspect the mobile SoC vendors will take a page from the console folks and/or Intel and begin looking at embedded/stacked DRAM options over the coming years to address this problem.
Hisilicon K3V3 to use Mali-T658 GPU, ten times the performance of Mali-400 MP [GSM Insider, March 27, 2013]
At the Mobile World Congress 2013, many people expected Huawei to unveil the Hisilicon K3V3 processor. But the upcoming processor from the Chinese company is yet to unveil to date.
According to sources from China [obviously from this SHUMABAOBEI.NET article of March 26], the Hisilicon K3V3 processor is based on the 28nm technology and it is a quad-core processor. The Hisilicon is able to clock up to 1.8GHz. It has two sets of dual-core processor. The first set is an A15 architecture dual-core and the second set is an A7 architecture dual-core processor.
The most important is the GPU inside the Hisilicon. Sources reported that the Hisilicon K3V3 comes with Mali-T658 GPU. ARM stated that the Mali-T658 has ten times better performance than the Mali-400 MP and four times better than the Mali-T604. The Exynos 4412 in Samsung Galaxy S3 and Samsung Galaxy Note 2 is using the Mali-400 MP GPU.
Look like the Hisilicon K3V3 is focusing on the graphics rather than on the numbers of core. The Hisilicon K3V3 could launch in second quarter of the year.
–Mali-T658 GPU Extends Graphics And GPU Compute Leadership For High Performance Devices [press release, Nov 10, 2011] “To address high-end consumer requirements, the Mali-T658 GPU delivers up to ten times the graphics performance of the Mali-400 MP GPU, found in a wide range of today’s mainstream consumer products. It also features four times the GPU Compute performance of the Mali-T604 GPU, enabling a raft of new use-cases outside of traditional graphics processing, including computational photography, image-processing and augmented reality. … The ability of the Mali-T658 GPU to scale up to eight cores provides unprecedented energy-efficiency, flexibility and scalability to match the CPU and GPU performance points through one coherent interface.”
– ARM Mali-T658 GPU Arrives at the Japan Technical Symposium [ARM Multimedia blog, Nov 10, 2011] “It’s all about higher performance – twice as many shader cores and double the arithmetic pipelines per core [as the Mali-T604].”
– ARM’s Mali-T658 GPU in 2013, Up to 10x Faster than Mali-400 [AnandTech, Nov 9, 2011] which contains the following ARM roadmap clearly accelerated by a year or so, especially with the 2nd generation Mali T-600 Series 9 months later. Currently it is not clear why Mali-T658 is missing as a product on the ARM site. One reason might be that it was replaced by the more flexible 2nd generation Mal-T600 Series, especially since the PoP availability for that since January 2013 (see below).
– Hisilicon Licenses Range of ARM Mali Graphics Processors to Drive the Next-Generation of Smart Connected Devices [joint press release, May 21, 2012] “… including the market leading Mali-400 MP GPU and the latest high-performance Mali-T658 GPU.”
– Nufront and ARM Extend Partnership to Provide OEMs with Competitive Solutions for Next-Generation Smartphones, Tablets and Smart-TVs [joint press release, Sept 24, 2012] “Nufront has broadened its portfolio of ARM technology with licenses for the ARM® Cortex™-A15 MPCore™ Processor and ARM Mali™-T658 Graphics Processing Unit (GPU).”
Mali-T600 Series Completing the ARM 64-bit System Story [ARM Multimedia blog, Oct 30, 2012]
Today ARM announced the ARM® Cortex™-A50 processor series, which include ARMs first low-power 64-bit implementations of the ARMv8 architecture. These highly anticipated products bring with them not only an enhanced 32-bit CPU architecture but also open up the wider range of opportunities that 64-bit architectures offer for high performance energy efficient devices.
The second generation of the Cortex/Mali pairing – the Cortex-A15 and Mali-T604 is appearing now in consumer devices from Google (Samsung Chromebook and Nexus 10 Tablet) based on the Samsung Exynos 5250 which enables, like its predecessors, market leading devices in a wide range of markets
The combination of the Cortex-A50 and the Mali-T600 series brings to market the highest performance CPU/GPU pairing targeting energy efficient devices. The Mali-T600 series is already able to support 64-bit addressing and offers IEEE 754 compliant 64-bit floating point arithmetic; so really is “64-bit system” ready. This opens up the potential for developers to get started earlier on the GPU elements with real silicon. The Mali-T600 series of products have all been designed with support for the latest ARMv8 architecture for both 32-bit (AArch32) and 64-bit mode(AArch64). This close functional matching will become even more important as GPU Computing opens up more exciting use cases over the coming years, and ARM will continue to focus on delivering leading processor and system IP that silicon vendors can rapidly deploy. Keep watching..
Mali-T604 [ARM microsite, Nov 8, 2012]
This fourth-generation of Mali embedded graphics IP, designed to meet the needs of General Purpose computing on GPU (GPGPU), extends API support to include full profile as well as embedded Khronos™ OpenCL™ and Microsoft® DirectX®.
The Mali-T604 GPU delivers up to 5x performance improvement over previous Mali graphics processors and is scalable up to four cores
Mali Graphics plus GPU Compute
[ARM microsite, Nov 7, 2012]
ARM Mali Graphics with GPU Compute provides premium graphics solutions to high end electronic devices. The graphics performance capability of these products is higher than Graphics only roadmap. ARM Mali Graphics with GPU Compute Midgard Tri-pipe architecture and includes the Mali-T678, Mali-T628 and the Mali-T624.
See also: “The GPU king is doing well, long live Mali-450 MP” [ARM Multimedia blog, June 18, 2012]
Each of the products features a 50% performance increase* and are the first to include Adaptive Scalable Texture Compression (ASTC), a texture compression technique that originated from ARM. ASTC significantly optimizes GPU performance and increases battery life in devices, enabling an always-on, always-connected experience, and has now been adopted by the Khronos™ Group, an important industry consortium that focuses on open standards.
ARM continues to invest in GPU compute capabilities by integrating the leadership that ARM has in the CPU space, with ARM Cortex™ processors, and applying it to the Mali GPU architecture. GPU compute enables greater control when balancing tasks between the CPU and GPU, allowing performance of the right task by the most efficient architecture. This enables improved energy-efficiency for current and new math intensive activities, such as:
Computational photography: computational methods of enhancing or extending digital photography
Multi perspective views: the ability to have multiple views from different positions
Real-time photo editing on mobile devices: photo editing at your fingertips on your smartphone, tablet, etc.
GPU compute also extends the range of use cases possible on mass-market mobile devices, allowing features like photo editing and video stabilization to be available in a wider range of consumer products.
*Each of the second generation Mali-T600 Series GPUs features a 50% performance increase compared to first generation Mali-T600 products (based on industry standard benchmarks), on the same silicon process. This 50% increase has been facilitated by a combination of frequency improvements, such as optimizing the register transfer level (RTL) for increased performance, and micro-architectural improvements so that graphics are executed more efficiently.
The design of each new product addresses different performance points:
The Mali-T624 GPU offers scalability from one to four cores, whilst the Mali-T628 from one to eight cores provides up to twice the graphics and GPU compute performance of the Mali-T624, extending the graphics potential for smartphones and smart-TVs. These products provide breathtaking graphical displays for advanced consumer applications, such as 3D graphics, visual computing and real time photo editing for smartphones and smart-TVs.
The ARM Mali-T678 GPU offers the highest GPU compute performance available in the Mali-T600 Series of products, delivering a four-fold increase when compared with the Mali-T624 GPU through features, such as increased ALU support. This brings a wide range of performance points to address the vibrant tablet market. The Mali-T678 offers energy-efficient high-end visual computing applications, such as computational photography, multi perspective views and augmented reality.
What is ASTC?
ASTC supports a very wide range of pixel formats and bit rates, and enables significantly higher quality than most other formats currently in use. This allows the designer to use texture compression throughout the application, and to choose the optimal format and bit rate for each use case. This highly efficient texture compression standard reduces the already market-leading Mali GPU memory bandwidth and memory footprint even further, while extending mobile battery life.
All products are designed to support the following APIs; OpenGL® ES 1.1, OpenGL ES 2.0, OpenGL ES 3.0, DirectX 11 FL 9_3, DirectX® 11, OpenCL™ 1.1 Full Profile and Google Renderscript compute.
ARM Announces 8-core 2nd Gen Mali-T600 GPUs [AnandTech, Aug 6, 2012]
Both the T628 and T678 are eight-core parts, the primary difference between the two (and between graphics/GPU compute optimized ARM GPUs in general) is the composition of each shader core. The T628 features two ALUs, a LSU and texture unit per shader, while the T658 doubles up the ALUs per core.
Long term you can expect high end smartphones to integrate cores from the graphics & compute optimized roadmap, while the mainstream and lower end smartphones wll pick from the graphics-only roadmap. All of this sounds good on paper, however there’s still the fact that we’re talking about the second generation of Mali-T600 GPUs before the first generation has even shipped. We will see the first gen Mali-T600 parts before the end of the year, but there’s still a lot of room for improvement in the way mobile GPUs and SoCs are launched…
ARM Announces POP IP Technology for Mali-T600 Series GPUs [press release, Oct 11, 2012]
What: ARM® today introduced the first POP™ IP solution for ARM Mali™-T600 series graphics processor units (GPUs). This latest offering of POP IP — core-hardening acceleration technology that produces the best implementations of ARM processors in the fastest time-to-market — is optimized for the Mali-T628 and Mali-T678 on TSMC 28nm HPM process technology. Mali GPUs go into a variety of end devices, including a wide range of smartphones, from high performance to mass market, as well as tablets and smart TVs. It is critical that designers can optimize their Mali GPU for their selected end applications.
Developed in synergistic collaboration by ARM’s Media Processing and Physical IP divisions, the optimized POP IP technology has been created to produce the most efficient GPU implementations at 28nm. The POP IP enabled Mali-T600 series GPU implementation results in superior performance density/watt, and significant silicon savings. Benefits of this POP IP have been proven to deliver up to 27 percent higher frequency, 24 percent lower area, and 19 percent lower power than implementations which do not use POP IP.
POP IP technology is comprised of three critical elements necessary to achieve an optimized ARM processor or GPU implementation. First, it contains Artisan® physical IP standard cell logic and memory cache instances that are specifically tuned for a given ARM processor and foundry technology. Second, it includes a comprehensive benchmarking report to document the exact conditions and results ARM achieved for the processor implementation across an envelope of configuration and design targets. Finally, it includes the detailed implementation knowledge including floor plans, scripts, design utilities and a POP implementation guide, which enables the end customer to achieve similar results quickly and with lower risk.
Why: “As the industry moves toward 28nm, designers need options that can lower their risk and help them achieve the fastest time-to-market. ARM is pleased to bring the benefits that have been experienced with POP IP usage around Cortex process implementation to Mali GPUs,” said Pete Hutton, general manager, Media Processing Division at ARM. “POP IP for Mali GPUs is not about pre-determined benchmarks, it’s about giving our partners greater flexibility by leveraging ARM’s holistic approach to explore and find the right optimization customized to the specific end-application.”
When: The POP IP for Mali-T628 and T678 on TSMC 28HPM process is available for immediate license to both existing and new licensees. The IP will be available in January 2013.
How does Mali POP help …. from: Mali POP IP Efficient GPU implementations [Dec 5, 2012]
ARM Mali-T628 & TSMC 28nm HPM can be used in multiple target applications.
– The sheer number of available options can make selection difficult.
ARM has invested significant time & effort in investigating the ARM Mali-T62x PPA envelope
ARM have performed all our analysis using real GPU work load which has led to improvements in implementation and analysis
ARM and Synopsys Collaborate to Optimize ARM Mali GPU 20nm Implementation [joint press release, Feb 25, 2013]
- Combination of ARM® Artisan® physical IP, Mali™ GPU IP and Synopsys Galaxy Implementation Platform proven ready for 20nm and smaller
- On-going collaboration aims to optimize and deliver double patterning technology (DPT)-ready methodology for Mali GPU implementation
- First implementation of the Mali-T600 series of products in 20nm technologies, with learning from this implementation accelerating the product family into sub-20nm technologies
ARM (LON: ARM; Nasdaq: ARMH) and Synopsys, Inc. (Nasdaq: SNPS) today announced a collaboration to optimize performance of ARM® Mali™ graphics processing units (GPUs) in 20-nanometer (nm) and smaller process geometries using the Synopsys Galaxy™ Implementation Platform. The companies successfully taped out the first ARM Mali-T658 design using a 20nm process technology, ARM Artisan® physical IP and shader functionality. The resulting RTL-through-sign-off design flow includes double-patterning support throughout. The ongoing collaboration will help designers optimize the implementation of Mali GPUs for their target applications.
“Mali GPUs are found in most Android™ tablets and smart digital TVs currently shipping, and are one of the most popular graphics solutions for smartphones. Users’ demand for advanced graphics continues to increase, which means that optimizing GPUs for selected end devices is essential,” said Pete Hutton, general manager, Media Processing Division, ARM. “Building on a long history of successful collaborations with Synopsys, this implementation will enable designers to optimally implement ARM Mali-T600 family GPUs using Synopsys tools in sub 20nm leading-edge process technologies.”
The Mali-T600 series includes five members (Mali-T604, Mali-T624, Mali-T628, Mali-T658 and Mali-T678), which have all been designed to provide exceptional graphics performance and they feature the first graphics technology to bring GPU compute functionality into mobile devices. This combined functionality brings additional hardware complexity which is further compounded by the new double-patterning requirements introduced by 20nm and below technologies.
Smaller process technologies, such as 20nm and below, require a highly integrated design flow for fast closure while delivering optimal results. The collaboration used the Galaxy Implementation Platform to produce a methodology tuned for the Mali GPU with ARM Artisan physical IP in 20nm. Primary tools used included Synopsys’ Design Compiler® synthesis, Formality® formal verification, DFTMAX™ and TetraMAX® test, IC Compiler™ layout, StarRC™ extraction and PrimeTime® timing analysis and signoff. In addition, IC Validator In-Design capabilities for physical verification were used during the implementation process to speed design closure. The methodology also benefitted from the use of DC Explorer & Dataflow Analyzer to perform early exploration, especially of floorplans and macro placement so critical to GPU performance.
“Twenty-nanometer and smaller process technologies introduce new complexity requiring early and deep technical collaboration among semiconductor ecosystem partners,” said Antun Domic, senior vice president and general manager, Implementation Group, Synopsys. “Through this collaboration with ARM, the Synopsys Galaxy Implementation Platform with In-Design physical verification combines with the ARM Mali IP and Artisan physical IP to provide a proven, DPT-compliant solution that will help accelerate the time to design closure on complex SoCs at 20 nanometers and below.”
ARM Mali SeeMore Demo: Lighting Effects, OpenGL ES 3 & Enlighten Engine – GDC 2013 [ARMflix YouTube channel, March 28, 2013]
– Mali Developer Tools, Augmented Reality, Lighting, SDKs & More at GDC [ARM Multimedia blog, April 2, 2013]
– Meet the experts in mobile graphics at GDC 2013 [With Imagination Blog, March 20, 2013]
– Imagination delivers latest version of leading tools for game development at GDC 2013 [press release, March 25, 2013]
Kishonti CLBenchmark Mali-T600 GPU Compute (MWC 2013) [ARMflix YouTube channel, March 5, 2013]
Source: CLBenchmark Results Database as of April 6, 2013.
– Intel® Core™ i3-3240 Processor (2 cores, 4 threads, 3M Cache, 3.40 GHz)
– Intel® Celeron® Processor B820 (2 cores, 2 threads, 2M Cache, 1.70 GHz)
– AMD A4-5300 (2 cores, 1M Cache, 3.40 GHz)
– AMD A6-4400M (2 cores, 1M cache, 2.7 GHz)
The interpretation of the above benchmark apps see at the very end of this post
Note that in pure GLbenchmark performances against the latest Apple tablet the T604 is underperforming and even not significantly higher against some other tablets:
- Nexus 10 GPU: Mali T604 (four cores) @500MHz
- iPad Mini GPU: SGX543MP2 (two cores) @250MHz
- iPad (4th generation) GPU: SGX554MP4 (four cores) @300MHz
- iPad (iPad 3) GPU: SGX543MP4 (four cores) @250MHz
- Onda V812 and Onda V972 have an SGX544MP2 (two cores) GPU
This might explain quite well why ARM was heavily pushing ahead with its 2nd generation T600 Series. (See also AllWinner A31 and A31s with PowerVR graphics [my other ‘USD 99 Allwinner’ blog, Jan 3 – March 29, 2013] for complete understanding of Imaginations’s PowerVR competition).
OpenCL benchmark CLBenchmark running on Google Nexus 10 (Android 4.2.1)! [KishontiLtd YouTube channel, Feb 12, 2013]
The Future of Mobile Gaming Panel Interview at GDC 2013 [ARMflix YouTube channel, April 3, 2013]
More information: What is the Future of Mobile Gaming? GDC Panel Summary [ARM Multimedia blog, April 3, 2013]
… The panel got off to a fine start with a debate on the importance of AAA gaming in the mobile space. This brought out a range of opinions from AAA being the main path for mobile and the mobile experience, with many believing that consumers are looking for bigger and better experiences from gaming on their mobile devices, and that AAA is key in creating the ‘wow’ factor for the next generation mobile devices.
Consumers will need high-end content like AAA quality games to drive the use of higher performance mobile devices. The alternative opinion was that with innovation being applied to casual gaming, the expectation is that we will move away from the current categories of games with an even larger number of gaming categories – with elements of regional aspects being built into the gaming experience. David from Unity talked about how short the half-life of games were at only 2 years compared to films which are 5-10 years. …
Remark: AAA Game [By Warren Schultz, About.com Guide, May 23, 2012]
A AAA game, or pronounced “triple-A game”, is generally a title developed by a large studio, funded by a massive budget.
These games will have a marketing budget in the multiple-millions of dollars, and are planned to earn out in excess of one million titles sold. Investors/publishers expect a multiple-of-cost return on their investment. In order to recoup general development costs, publishers will generally produce the title for the major platforms (currently Xbox 360, PS3, and PC) to maximize profits, unless it is a console exclusive, in which case the console maker will pay for exclusivity to offset the loss of potential profit to the developer.
Pronunciation: triple-A game
Glue Mobile representative in the beginning of the above video is essentially stating that mobile only gaming sooner or later would disrupt the console industry. So it is worth to take a look at the relevant excerpts from Glu Mobile Corporate Overview, Presentation at Roth Capital Investor Conference [March 18, 2013]:
Interpret’s New GameByte™ Data Shows Only Half of All Gamers Play Retail Console Games [Interpret LLC press release via BusinessWire, April 4, 2013]
Interpret, a leading entertainment, media and technology market research firm, today announced top-level findings from GameByte™, a syndicated study designed to understand cross-platform digital gaming adoption and behavior in ten global markets.
The service, now in its second year, studies consumers (age 6-64) of every form of video gaming, including both traditional retail business models and digital business models. The latest data reveals that 96% of all US gamers have played some form of digital game in the past six months. By contrast, only 53% of US gamers have played a traditional retail console game in the same period.
“The trend carries across all ten countries covered by GameByte,” said Jason Coston, senior analyst at Interpret. “If you’re a gamer, you’re a digital gamer. Retail console games still capture a significant portion of gamers, but several digital business models now command just as much market share: mobile game apps, social network games on PC, and casual games on PC.”
GameByte data also confirms the ubiquity of digital gaming in other countries traditionally focused on consoles, such as the UK and Japan. Ninety-four percent of UK gamers now play digital games, as well as 87% of Japanese gamers.
Interpret will soon roll out in-depth reports covering revenue sizes and gaming attitude and behavior in each territory over the coming months.
EA: Demise of console gaming ‘very premature’ [GameSpot, April 1, 2013]
COO Peter Moore says even though mobile is growing, gamers continue to show enthusiasm for core titles.
The demise of traditional console gaming is not a reality the industry faces, according to Electronic Arts chief operating officer Peter Moore. Speaking with Bloomberg TV, Moore said even though the mobile space has grown, gamers still want core titles they can play on a big screen.
“The console business is still a core part of our business; it’s the majority of our business. The demise of console gaming is very premature as far as we’re concerned,” Moore said.
“We still have thousands of people focused on developing current-generation Xbox 360 and PS3 games, as well as people focused now on the next generation when that finally arrives,” he added. “And so, people still want core games. People want to sit back in their living rooms, take advantage of their HD TVs, and and play fully immersive games like [Battlefield 4].“
Also during the interview, Moore said he expects EA’s digital sales–which includes mobile, downloadable content, and subscriptions–to possibly overtake its traditional packaged goods business by 2015.
“In two years we could be looking at the tipping point where digital becomes bigger than the traditional core,” Moore said.
Moore is believed to be a leading candidate to take over as the next EA CEO. He would not comment on this conjecture, but praised John Riccitiello for leaving the company in “tremendous shape.” Moore said one thing the new EA CEO needs to do is execute.
“We did not executive to the level that we needed to in [fiscal year 2013] and [John Riccitiello] took accountability for that. And I think the future CEO will focus on pure execution because all the ingredients are there; we have the world’s best developers, we have a tremendous publishing pipeline, and we’ve made the hard decisions about our platform.
ARM TechCon 2012 – Consumer Products Announced based on ARM Mali-T604 [ARMflix YouTube channel, Nov 5, 2012]
The Mali-T604 is available only with Samsung Electronics as per Global Businesses Select ARM Mali GPU Technology [News on the Mali Developer Center of ARM, Feb 25, 2013]
“Samsung Smart TV has been leading market in transforming the viewing experiences of consumers in the living room. Through the adoption of the quad-core ARM Cortex-A15 processor and Mali-T604 GPUs, Samsung Smart TV, including the world’s first quad-core built-in F8000, will enable a new way of enjoying content on TV with innovative user interfaces and faster performance,” said Cheul-Hee Hahm, Master of R&D Team, Visual Display Business, Samsung Electronics, Co., Ltd.
In 2013 there will be a significant increase in the number of mass market smartphones based on Mali-400 and Mali-450 GPUs, and of high-end phones taking advantage of the high performance of the Mali-T600 family.
ARM® Mali™ Timbuktu2 based on Samsung® Exynos™ 5 Dual [ARMflix YouTube channel, Sept 10, 2012]
Note that mobile gaming as one should talk about the new Mali products in a more general context, such as: ARM Mali GPUs turn GPU Compute into reality at MWC [News at Mali Developer Center, Feb 22, 2013]
25th – 28th February 2013, MWC, Barcelona, Spain.
ARM stand at Mobile World Congress, Hall 6 Stand 6A31.
ARM will showcase a range of Mali™ GPU Compute use cases running on devices, demonstrating the benefits of Renderscript and OpenCL.
ARM Mali GPUs are the first to bring the benefits of GPU Compute to mobile devices. ARM is also the first IP vendor to pass OpenCL 1.1 Full Profile Khronos conformance test. GPU Compute ensures that the right task is placed in the right place at the right time, enabling greater performance efficiencies.
In a world where smartphones and tablets act as our primary compute platform for more than accessing the internet and social media, but also used to create and view videos and experience on-the-go gaming, leading companies are discovering new ways to ensure technology is making the phone last longer and do far more than ever before
You’ll discover how running a task on a GPU is faster, while enabling other tasks to be run at the same time. See firsthand how smart allocation of the tasks is far more efficient and is seamless to the user. GPU compute opens up new use cases whilst existing tasks are done more efficiently.
Mali GPUs are the first graphics technology to support Google Renderscript Compute, enabling real devices to bring new exciting features to consumers.
ARM is the first to offer Full Profile OpenCL™ support for mobile devices. ARM will show how OpenCL can be used in applications including high accuracy facial detection and multi-face detection – improving photography on mobile devices as well as creating significant performance improvements.
ARM continues to build a thriving and strong ecosystem around Mali GPU Compute, with strategic collaborations from leaders and experts across the whole industry. This is opening new markets for ARM partners and adding value to Mali GPU Compute users.
A key initial area to benefit from GPU compute – you will also be able to see the performance improvement possible when real-time image filters are applied to a camera feed and the performance improvements possible by moving the task from the CPU to the GPU. This demonstration shows the accelerations in image processing content made possible by Renderscript. ARM is committed to delivering more performance within a mobile power budget through innovative technologies which ensure a compute task is completed on the most energy efficient processing element. GPU Compute and big.LITTLE™ processing are the most recent examples of new technologies ensuring the right task can be run in the right place in the system.
By supporting GPU Compute ARM Mali GPUs are expanding the potential use cases for tablets and smartphones:
RS Benchmark from Kishonti will run for the first time on a mobile based GPU showing the key features that GPU enables – only possible with Mali-T604
GPU Compute is also improving the gaming experience. You will see how a combination of OpenGL® ES 3.0 and OpenCL APIs offer a wider range of effects not seen before on mobile devices. OpenCL opens new levels of physics simulations and OpenGL ES 3.0 showcases effects such as showing the application of high dynamic range, adaptive luminance tone mapping and atmospheric scattering – features only normally seen in PC or console level gaming experiences.
ARM Mali GPUs are the first GPUs focused on the mobile space showing GPU Compute is a reality. GPU compute will enable:
New use cases previously not possible to perform on a mobile device enhancing the user experience
Make previous tasks more efficient – in conjunction with ARM big.LITTLE technology, GPU Compute is critical to running tasks using the most efficient part of the SoC
Synthesis Super-Resolution Scaler Demo on Exynos 5 Dual Powered Tablet at MWC 2013 [SamsungExynos YouTube channel, March 19, 2013]
Note that Samsung selected a PowerVR SGX544MP GPU core from for its Samsung Exynos 5410 Octa processor (or simply Exynos 5 Octa) as indicated by The PowerVR SGX544, a modern GPU for today’s leading platforms [With Imagination blog, March 13, 2013]. For other information see Samsung Announces the Availability of Exynos 5 Octa for New Generation of Mobile Devices [Samsung Semiconductor press release, March 15, 2013]. This first big.LITTLE processor, also first by being manufactured using Samsung’s latest 28-nanometer (nm) HKMG (High-k Metal Gate) low power process and power-saving design, was released with the latest high-end and high-volume smartphones from Samsung, the Galaxy S 4 (“Samsung Altius” which also used in other half of the models a quadcore Qualcomm Snapdragon 600 APQ8064T SoC, manufactured by TSMC). See also: Samsung Introduces the GALAXY S 4 – A Life Companion for a richer, simpler and fuller life [March 14, 2013].
Samsung Exynos 5 Dual [Samsung microsite, Feb 28, 2012]
World First ARM Cortex A-15 based 1.7 GHz Dual-Core Mobile Application Processor
Exynos 5 Dual is the world’s first A-15 Dual Core mobile CPU, presented by Samsung Semiconductor. Using 32nm HKMG (High-K Metal Gate) process technology, the 1.7GHz dual core Exynos 5 Dual brings unmatched performance to your leading-edge mobile devices while maintaining low power consumption
Multitask with a Power, Energy Efficient SoC
Exynos 5 Dual, using 32nm HKMG*, is designed to meet your graphic-intensive, multi-task and power efficient requirements. It performs nearly two times faster than the existing Cortex A9-based dual core processor, with an amazing 30% lower power consumption than our previous Exynos process developed on a 45nm process. Exynos 5 Dual is well qualified to lead the high-end mobile application processor market.
*HKMG process – : High K Metal Gate Process
See more: Process Technology – 32/28nm | Samsung Semiconductor [Feb 16, 2012]
Enjoy New level of 3D gaming and reading experience
World`s highest class mobile 3D graphic processor makes games and images come alive! You will feel like you’re actually part of the game. Featuring stereoscopic 3D, Exynos 5 Dual could take you right to the middle of the cheering audience of your favorite football game. Enjoy reading? The Exynos 5 Dual supports WQXGA provides high resolution for clear readability. It’s nearly like reading an actual newspaper.
Get your Mobile devices well connected to WQXGA display!
With Exynos 5 Dual, enjoy web-surfing, e-mailing, photos and videos at the best possible resolution, WQXGA, currently available for mobile devices. Exynos 5 Dual is equipped with embedded Display Port (eDP) interface, compliant with panel self refresh (PSR) technology. The PSR function instructs the application processor not to send image data to the LCD panel when the set is displaying still image, reducing power consumption. Exynos 5 Dual provides 12.8 GB/s memory bandwidth with 2-port 800MHz LPDDR3 for heavy traffic operations. Plus, various scopes of booting interfaces, (SATA, UART, USB3.0, eMMC4.5) guarantees our end users crisp and sharp multimedia transmission.
Play 3D Stereoscopic video smoothly on your Full HD siplay without ever Encoding
Exynos 5 Dual`s powerful 8 megapixel resolution image signal processor fully supports best-in-class cameras with high resolution video recording and playback. The 1080p 60 fps multi format codec enables the highest quality FHD videos. Additionally, your device will be able to play almost any type of video format with integrated MFC (Multi Format Codec)
Samsung Announces Industry First ARM Cortex-A15 Processor Samples for Tablet Computers [Samsung Semiconductor press release, Nov 30, 2011] “The Exynos 5250 [>>> Exynos 5 Dual] is currently sampling to customers and is scheduled for mass-production in the second quarter of 2012.”
Samsung Opens New Opportunities for Mobile App Development with the Arndale Community Development Board [Samsung Semiconductor press release, Oct 26, 2012] “Arndale enables next generation of mobile applications development with the Exynos 5 Dual SoC, the world’s first production solution based on the ARM Cortex-A15 MPCore processor”
[Exynos 5 Dual] Arndale Board Video is NOW available! [Samsungsemi1 YouTube, Feb 7, 2013]
Enjoy the Ultimate WQXGA [2560×1600] Solution with Exynos 5 Dual [Samsung whitepaper, July 9, 2012]
World’s Best 3D Performance
Currently, the 3D graphics engine in mobile operating systems is used for 3D rendering and for all basic graphic work on the screen. Because the 3D graphic engine operates UI overlay, homescreen, 3D games, and more, 3D performance has become a very important feature for measuring Mobile AP’s overall performance. The 3D performance in the Exynos series has always been beyond compare; however, Exynos 5 Dual will raise the bar for mobile AP’s 3D performance even higher.
Screen resolution is directly related to 3D performance. WQXGA resolution is four times better than WXGA, meaning that mobile APs must deliver 3D performance at least two times better than the previous generation. To meet the standard of WQXGA resolution, mobile AP requires a new 3D engine and architecture.
Samsung System LSI worked closely with ARM to achieve the quad core Mali-T604, the most advanced mobile 3D engine to date. With Mali-T604, Exynos 5 Dual delivers two times better GPU performance than Exynos 4. Since Exynos 4 has more than enough 3D performance to satisfy WXGA [1280×800] resolution, Exynos 5 Dual is the only mobile AP that can handle WQXGA content with 60fps updates.
In addition, the 3D feature of Exynos 5 Dual fully supports GPGPU, including openCL v1.1 full profile.
GPGPU is a solution that distributes the CPU’s computation workload to the GPU. In GPGPU support, the floating point performance and precision of GPUs are the key factors. While CPUs can handle 64-bit floating point (double-precision), most mobile GPUs can only handle 32-bit floating point (singleprecision). Exynos 5 Dual is the first mobile AP that can run double precision floating point and full precision with outstanding 72GFlops floating point performance. With this functionality, a developer can handle more precise and heavy computation works by simultaneously using Exynos 5 Dual’s cortex-A15 dual cores and quad Mali-T604 cores performance.
Arndale Board Exynos 5250 ARM Cortex-A15 Mali-T604 Development Board [Charbax YouTube channel, Nov 1, 2012]
Samsung’s Exynos 5 Dual Application Processor Drives Industry’s Highest Resolution on Google’s Nexus 10 Tablet [Samsung Semiconductor press release, Nov 13, 2012]
Samsung Exynos 5 Dual Powers the New Google Chromebook [Samsung Semiconductor press release, Nov 21, 2012]
Samsung Exynos 5 Dual processor [Samsungsemi1 YouTube, Nov 2, 2012]
Samsung Exynos 5 Dual Processor (ARM® Cortex™-A15 based Dual core processor) at ARM techcon [Samsungsemi1 YouTube, Nov 1, 2012]
Exynos 5 Dual [Application Processor Product Catalogue | Samsung Semiconductor, April 26, 2012]
An application processor, or SoC (System on a Chip), is a microprocessor with a specialized architecture for deployment in embedded systems, such as digital still/video cameras, digital/smart TVs and set-top boxes, and automotive systems, among others. An SoC operates at frequencies from several hundred MHz to a few GHz, and is architected to deliver significant computing performances at low power consumption levels in limited board spaces. High-end SoCs often contain multiple cores, enabling them to deliver exceptional performances in applications such as digital imaging and multimedia devices.
Current-generation SoCs are capable of running full-fledged versions of modern operating systems, providing the user a rich, interactive interface on devices such as smartphones and tablet computers. Almost all the latest SoCs have the ability to decode a majority of multimedia codecs, and contain hardware engines to deliver enhanced multimedia experiences to the user. They also contain dedicated MMUs (memory management units) to manage the memory for applications being run on the device. Recent SoCs also have a multitude of peripheral connectivity solutions on the chip, offering the designer extensive control in providing connectivity options on the device. SoCs are application specific, and contain features targeted towards the intended deployment segment. Thus, an SoC designed for a mobile handset would include front-end GSM RF functionalities on-chip, which would be absent in an SoC designed for deployment in a digital still camera. An increasing number of SoCs, however, are now offering a wide range of features, making the processor suitable for deployment on any application. Samsung is a worldwide leader in providing the most advanced, efficient, and customizable SoC solutions for deployment on a wide range of platforms, such as digital imaging, multimedia, and mobile communication and computing. Samsung’s line of SoCs offers the highest performance, thermal stability, reliability, and I/O density in the smallest form factors at the lowest power consumption levels. Worldwide, Samsung is the preferred provider for SoC solutions for a majority of developers and OEMs for deployment on the broadest computing and communication devices and platforms.
- CortexA15 dual core subsystem with 64-/128-bit SIMD NEON
- 32KB (Instruction)/32KB (Data) L1 Cache and 1MB L2 Cache
- 128-bit Multi-layered bus architecture
- Internal ROM and RAM for secure booting, security, and general purposes
- Memory Subsystem
– 2-ports 32-bit 800MHz LPDDR3/DDR3 Interfaces
– 2-ports 32-bit 533MHz LPDDR2 Interfaces
- 8-bit ITU 601 Camera Interface
- Multi-format Video Hardware Codec: 1080p 60fps (capable of decoding and encoding MPEG-4/H.263/H.264 and decoding only MPEG-2/VC1/VP8)
- 3D and 2D graphics hardware, supporting OpenGL ES 1.1/2.0/Halti, OpenVG 1.1 and OpenCL 1.1 full profile
- Image Signal Processor : supporting BayerRGB up to 14bit input with 14.6MP 15fps, 8MP 30fps through MIPI CSI2 & YUV 8bit interfaces and special functionalities such as 3-dimensional noise reduction (3DNR), video digital image stabilization (VDIS) and optical distortion compensation (ODC)
- JPEG Hardware Codec
- LCD single display, supporting max WQXGA, 24bpp RGB, YUV formats through MIPI DSI or eDP
- Simultaneously display of WQXGA single LCD display and 1080p HDMI
- HDMI 1.4 interfaces with on-chip PHY
- 2-ports (4-lanes) MIPI CSI2 interfaces
- 1-port (4-lanes) eDisplayPort (eDP)
- 1-channel USB 3.0 Device or Host, supporting SS (5Gbps) with on-chip PHY
- 1-channel USB 2.0 Host or Device, supporting LS/FS/HS (1.5Mbps/12Mbps/480Mbps) with on-chip PHY
- 2-channel USB HSIC, supporting 480Mbps with on-chip PHY
- 1-channel HS-MMC 4.5
- 1-channel SDIO 3.0
- 2-channel SD 2.0 or HS-MMC4.41
- 4-channel high-speed UART (up to 3Mbps data rate for Bluetooth 2.1 EDR and IrDA 1.0 SIR)
- 3-channel SPI
- 1-channel AC-97, 2-channel PCM, and 3-channel 24-bit I2S audio interface, supporting 5.1 channel audio
- 1-channel S/PDIF interface support for digital audio
- 4-channel I2C interface support (up to 400kbps) for PMIC, HDMI, and general-purpose multi-master
- 4-channel HS-I2C (up to 3.1 Mbps)
- Samsung Reconfiguration Processor supports low power audio play
- MIPI-HSI v1.1, supporting 200Mbps full-duplex
- C2C, supporting through path between DRAM and MODEM
- Security subsystem supporting hardware crypto accelerators, ARM TrustZone and TZASC
- 32-channel DMA Controller
- Configurable GPIOs
- Real time clock, PLLs, timer with PWM, multi-core timer, and watchdog timer
CLBenchmark – High-performance compute benchmark for OpenCL 1.1 environment [CLBenchmark.com, Oct 16, 2012]
The first professional OpenCL benchmark for desktop OSes
CLBenchmark 1.1 Desktop Edition is an easy-to-use tool for comparing the computational performance of different platforms. It offers an unbiased way of testing and comparing the performance of implementations of OpenCL 1.1, a royalty-free standard for heterogeneous parallel programming maintained by Khronos Group. CLBenchmark compares the strengths and weaknesses of different hardware architectures such as CPUs, GPUs and APUs. The test results are listed in a transparent and public OpenCL performance database.
Physics: SPH Fluid Simulation
Physics simulation has a great history in computer science, as it’s original goal was to help scientists and engineers in their design efforts. With increased computing capacity, physics got into reach of virtual world simulations, for example games. Enabling physics simulation can uplift in-game interactions into a new dimension.
In our SPH Fluid simulation, we’ve created a particle based simulation consisting of 32k particles. The results of the simulation is displayed on a surface calculated by a Marching Cubes implementation. This technique is widely adopted among games, for simulating the movement of fluids, and even smoke, or other gases.
Raytracing is an image synthesis technique used in wide variety of applications such as simulation-visualization, design, and special effects in movie making. This technique is also getting more attention as it is going to be available in real-time rendering, especially for games, which will enable developers to implement life-like lighting and shading models in their titles.
Our ray trace test implements the traditional recursive ray trace algorithm and supports reflections and soft shadows and also uses global illumination rays to replace the ambient term. The renderer uses kd-tree acceleration structure with the kd-restart traversal technique. The scene consists of 600k triangles and is rendered at 2048×1024 resolution.
The problem domain is divided into a grid of tiles (or frustums) that are processed separately – this saves memory. In addition, multiple devices can process different tiles at the same time, so this test can stress even multi-GPU systems. Most of the calculations are happening in the ray traversal kernel, which tries to find the nearest triangle that intersects the ray.
Optical flow: Feature Matching
With this application we calculate the motion of the depicted object on a series of input images. For each image we calculate a vectorfield, which associates a motion vector to every pixel. These motion vectors are represented in colorspace. The color map used for this can be seen in the bottom left corner of the calculated vector field image.
In computer vision, we can consider anything as a feature which has a high vertical and horizontal gradient and thus easily recognizable. A good feature can be robustly detected over a sequence of images. By matching these features over these image sequences, we can track the movement of objects.
We implemented the Moravec interest operator for our application, because it is easily parallelizable and can be easily and effectively implemented for the OpenCL platform. We developed a block-based matching strategy for tracking features. We applied the results of feature matching in a sample application in which we aim to calculate the velocity for each pixel. For this, we use a patch-based approach, calculating the sum of square differences for the neighborhood of the features.
The algorithm works on pairs of images. The first step is feature detection and matching. Each pair of features defines a motion vector. This rare field of motion vector are then revised heuristically, to remove false matchings. The dense vector field is constructed from this revised field.
Feature detection and the dense vector field calculation heavily utilize the image IO of the device. The device should also handle an increased number of kernel launches during this application.
From UI visualizations to graphics content creation and photography, image filters are extensively applied. As the most frequently used image filters are suites of convolution filters, we have included the most important types in CLBenchmark. In order to thoroughly examine the capabilities of the underlying hardware architectures, we have developed multiple implementations for a single filter.
Gauss Filter A Gauss filter is widely used for “smoothing” effects and, as it is a low-pass filter in frequency domain, it is also useful as a pre-pass of image resizing (down-sampling).
Sobel Filter A Sobel filter has edge detecting property so it takes part in anti-aliasing filters and a variety of object recognition algorithms.
Median Filter Despite the Median filter is not a convolution filter, it is widely accepted in the area of noise reduction, particularly applicable against salt and pepper noise.
As a priority, we are trying to provide relevant real-world applications for benchmarking purposes. However, even a well selected set of use cases cannot match every possible workload, so we have added synthetic tests also. These are included in the Programming Principles group, containing multiple implementations of general problems which real-world parallel problems could be composited into.
Scanning Inclusive prefix sum calculation. It’s the base operation of dynamic data generation and various sorting algorithms like radix sort. Multiple implementations included, such as Parallel (logarithmic) Scan on Local memory chunks and a mostly sequential case.
Bucketing Making 5 homogeneous, compacted streams of a single heterogeneous array. Only Parallel Scan based version made.
Reduction Many-to-one kind of operators like “sum of an array” are used in reductions. We’ve found addition ideal, as the operator’s computation cost is the lowest possible, and we can focus on the algorithm itself. A more specific sum also included, implemented to measure atomic addition on both global and local memory addresses.
Bitonic Merge Sort Sorting algorithms are used in a wide variety of applications for example data structures, databases, computer graphics. Bitonic merge sort is parallel sorting algorithm, first ordering sub sequences in local memory, then merging the result in global memory.
Tree-search Parallel search for multiple elements on an unbalanced tree using depth first strategy. It’s ideal to stress the device’s resistance to branch-divergency.
CLBenchmark 1.1 Desktop Edition is available for community use and can be downloaded free of charge. This edition requires network connection and collects information about your OpenCL devices. This method let us supply you with proper, device specific OpenCL binaries and enables CLBenchmark to fully utilize your device and helps to achieve its peak performance.
For more information about downloading CLBenchmark 1.1. Desktop Communitiy Edition, please click here.
CLBenchmark 1.1 Desktop Edition is also accessible for licensing, which is aimed at industry-leading technology companies for testing and optimizing their OpenCL implementations and thus bringing stable and efficient solutions to the market. Click here for more details or send us a message at email@example.com! Windows, OS X and generic Linux.
For journalists, CLBenchmark 1.1 Desktop Edition is available in a special Media Edition. For more information, email us at firstname.lastname@example.org!