… while neither Amazon nor Google publicize their server designs yet
Designing Cloud Infrastructure for 1m+ Server Scale [“cloud scale”] – Kushagra Vaid (General Manager, Cloud Server Engineering, Microsoft) [Open Compute Project YouTube channel, Jan 29, 2014]
Today Microsoft announced that it will be joining the Open Compute Project Foundation (OCP) and will be contributing hardware specifications, design collateral (CAD and Gerber files), and system management source code for Microsoft’s cloud server designs. These specifications apply to the server fleet being deployed for Microsoft’s largest global cloud services, including Windows Azure, Bing, and Office 365. This significant contribution demonstrates our continued commitment to sharing our key learnings and experiences from more than 19 years of operating online services with the industry.
Microsoft manages a global portfolio of datacenters across all continents, has an installed base of over one million servers, and delivers more than 200 services for 1+ billion customers and 20+ million businesses in 90+ markets. Deploying and operating a huge cloud-scale [Cloud-Scale Data Centers, Feb 11, 2013; see also Microsoft Cloud-Scale Data Center designs [Microsoft Data Centers Blog, March 26, 2013]] infrastructure requires careful attention to several system design principles:
- Simplicity of the design is essential, since at cloud-scale the smallest issues can get magnified and potentially cause unexpected availability issues for customers.
- Efficiency gains across cost, power, and performance vectors are required to deliver the lowest total cost of ownership (TCO).
- Modular system design provides flexibility to accommodate hardware changes necessary for evolving workload requirements, plus it helps streamline the integration of new technologies.
- Supply chain agility is essential for adapting to rapid variations in server capacity demand signals.
- Ease of operations is key to ensuring system management at scale and cost effective servicing for hardware failures in the datacenter.
- Environmental sustainability is an important part of our cloud strategy. This includes minimizing material use and ensure re-use of components wherever possible across the server lifecycle.
Based on these guiding principles, Microsoft has designed an innovative system architecture that we believe will drive design and operational efficiency beyond the conventional commodity servers currently available in the market. The key design features include:
Chassis-based shared design for cost and power efficiency
- EIA rack mountable 12U Chassis leverages existing industry standards
- Modular design for simplified solution assembly: mountable sidewalls, 1U trays, high efficiency commodity power supplies, large fans for efficient air movement, management card
- Up to 24 commodity servers per chassis (two servers side-by-side), option for JBOD storage expansion
- Optimized for mass contract manufacturing
- Up to 40% cost savings and 15% power efficiency benefits vs. traditional enterprise servers
- Estimated to save 10,000 tons of metal per one million servers manufactured
Blind-mated signal connectivity for servers
- Decoupled architecture for server node and chassis enabling simplified installation and repair
- Cable-free design, results in significantly fewer operator errors during servicing
- Reduction of ‘No problem found’ incidents from loose cables
- Up to 50% improvement in deployment and service times
Network and storage cabling via backplane architecture
- Passive PCB backplane for simplicity and signal integrity risk reduction
- Architectural flexibility for multiple network types such as 10Gbe/40Gbe, Copper/Optical
- One-time cable install during chassis assembly at factory
- No cable touch required during production operations and on-site support
- Expected to save 1,100 miles of cable for a deployment of one million servers
Secure and scalable systems management
- X86 SoC-based management card per chassis
- Multiple layers of security for hardware operations: TPM secure boot, SSL transport for commands, Role-based authentication via Active Directory domain
- REST API and CLI interfaces for scalable systems management
- Support for server diagnostics and self-health checks
- Up to 75% improvement in operational agility vs. traditional enterprise servers
The Microsoft cloud server is a revolutionary design that brings the benefits of commoditization and cloud-scale operations to the industry. The specifications we’re contributing to OCP embody our long history and deep experience in datacenter architecture and cloud computing, and our commitment to sharing our cloud infrastructure best practices with the industry since 2007. As part of joining OCP, Microsoft will be making the following contributions for our Microsoft cloud server design and manufacturing collateral:
- Hardware specifications
- Server, mezzanine card, tray, chassis, and management card
- Management APIs and protocols (for chassis and server)
- Mechanical CAD models
- Chassis, server, chassis manager, and mezzanines
- Gerber files
- Management card, power distribution board, and tray backplane
- Source code for Chassis infrastructure
- Server management, fan and power supply control, diagnostics and repair policy
Microsoft will also be engaging in the OCP community via active participation in the various sub-committees and engineering forums. I am pleased to announce that Mark Shaw, Director of Hardware Development on my team, has been appointed as the Chair of the Server committee via the OCP community voting process. Additionally, MS Open Tech is releasing an open source implementation of the Chassis Manager specification [“As part of this effort, MS Open Tech is releasing an open source reference implementation of the Chassis Manager specification. Today, this code, is available on GitHub, and implements functions such as server management, and fan and power supply control.”]. We would like to help to build an open source software community for this project within OCP.
Our hardware partners are developing products for Microsoft based on these specifications and we look forward to availability of commercial offerings from our partners in the near future.
We are excited to share our cloud infrastructure learnings and operational experiences with the broader community to help drive the industry efficiencies forward, reduce the cost of hardware for all participants, and accelerate the adoption of cloud computing. You can find more information about the Microsoft cloud server specification via my customer discussion video, our white paper and at www.opencompute.org.
Compare this to the current (certified) and upcoming (new) boards from Intel based on current OCP specification (Decathlete for financial services, and Windmill the Facebook design with Intel for the dense servers) particularly designed by Facebook, as well as the upcoming Leopard being the next-generation compute module for Facebook):
But keep in mind Intel’s advanced interest in:
All from Designing the Datacenter of the Future – Eric Hooper (Director, Cloud Service Provider Optimization, Intel Corporation) [Open Compute Project YouTube channel, Jan 29, 2014] video:
In that video there is also a testimonial part from Goldmann Sachs using the jointly developed Decathlete design (code named “Swiss Army Knife”).