{"id":641,"date":"2022-10-18T16:30:13","date_gmt":"2022-10-18T16:30:13","guid":{"rendered":"https:\/\/fde.cat\/index.php\/2022\/10\/18\/ocp-summit-2022-open-hardware-for-ai-infrastructure\/"},"modified":"2022-10-18T16:30:13","modified_gmt":"2022-10-18T16:30:13","slug":"ocp-summit-2022-open-hardware-for-ai-infrastructure","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2022\/10\/18\/ocp-summit-2022-open-hardware-for-ai-infrastructure\/","title":{"rendered":"OCP Summit 2022: Open hardware for AI infrastructure"},"content":{"rendered":"<p><span><span>At OCP Summit 2022, we\u2019re announcing Grand Teton, our next-generation platform for AI at scale that we\u2019ll contribute to the OCP community.<\/span><\/span><\/p>\n<p><span>We\u2019re also sharing new innovations designed to support data centers as they advance to support new AI technologies:<\/span><\/p>\n<p>A new, more efficient version of Open Rack.<br \/>\n<span>Our Air-Assisted Liquid Cooling (AALC) \u2013 design.<\/span><br \/>\n<span>Grand Canyon, our new HDD storage system.\u00a0<\/span><\/p>\n<p><span>You can view AR\/VR models of our latest hardware designs at:<\/span><a href=\"https:\/\/metainfrahardware.com\/\"> <span>https:\/\/metainfrahardware.com<\/span><\/a><\/p>\n<p><span>Empowering Open, the theme of this year\u2019s Open Compute Project (OCP) Global Summit has always been at the heart of Meta\u2019s design philosophy. Open-source hardware and software is, and will always be, a pivotal tool to help the industry solve problems at large scale.<\/span><\/p>\n<p><span>Today, some of the greatest challenges our industry is facing at scale are around AI. How can we continue to facilitate and run the models that drive the experiences behind today\u2019s innovative products and services? And what will it take to enable the AI behind the innovative products and services of the future? As we move into the next computing platform, the metaverse, the need for new open innovations to power AI becomes even clearer.<\/span><\/p>\n<p><span>As a founding member of the OCP community, Meta has always embraced open collaboration. Our history of designing and contributing next-generation AI systems dates back to 2016, when we first announced<\/span><a href=\"https:\/\/engineering.fb.com\/2015\/12\/10\/ml-applications\/facebook-to-open-source-ai-hardware-design\/\" target=\"_blank\" rel=\"noopener\"><span> Big Sur<\/span><\/a><span>. That work continues today and is always evolving as we develop better systems to serve AI workloads.<\/span><\/p>\n<p><span>After <\/span><a href=\"https:\/\/tech.fb.com\/engineering\/2021\/11\/10-years-world-class-data-centers\/\" target=\"_blank\" rel=\"noopener\"><span>10 years of building world-class data centers<\/span> <\/a><span>and distributed compute systems, we\u2019ve come a long way from developing hardware independent of the software stack. Our AI and machine learning (ML) models are becoming increasingly powerful and sophisticated and need more high-performance infrastructure to match. Deep learning recommendation models (<\/span><a href=\"https:\/\/ai.facebook.com\/blog\/dlrm-an-advanced-open-source-deep-learning-recommendation-model\/\" target=\"_blank\" rel=\"noopener\"><span>DLRMs<\/span><\/a><span>), for example, have on the order of tens of trillions of parameters and can require a zettaflop of compute to train.<\/span><\/p>\n<p><span>At this year\u2019s OCP Summit, we\u2019re sharing our journey as we continue to enhance our data centers to meet Meta, and the industry\u2019s, large-scale AI needs. From new platforms for training and running AI models, to power and rack innovations to help our data centers handle AI more efficiently, and even new developments with PyTorch, our signature machine learning framework \u2013 we\u2019re releasing open innovations to help solve industry-wide challenges and push AI into the future. <\/span><\/p>\n<h2><span>Grand Teton: AI platform<\/span><\/h2>\n<p><span>We\u2019re announcing Grand Teton, our next-generation, GPU-based hardware platform, a follow-up to our Zion-EX platform. Grand Teton has multiple performance enhancements over its predecessor,<\/span><a href=\"https:\/\/engineering.fb.com\/2019\/03\/14\/data-center-engineering\/accelerating-infrastructure\/\"> <span>Zion<\/span><\/a><span>, such as 4x the host-to-GPU bandwidth, 2x the compute and data network bandwidth, and 2x the power envelope. Grand Teton also has an integrated chassis in contrast to Zion-EX, which comprises multiple independent subsystems.<\/span><\/p>\n<p><span>As AI models become increasingly sophisticated, so will their associated workloads. Grand Teton has been designed with greater compute capacity to better support memory-bandwidth-bound workloads at Meta, such as our open source <\/span><a href=\"https:\/\/ai.facebook.com\/blog\/dlrm-an-advanced-open-source-deep-learning-recommendation-model\/\"><span>DLRMs<\/span><\/a><span>. Grand Teton\u2019s expanded operational compute power envelope also optimizes it for compute-bound workloads, such as content understanding.\u00a0<\/span><\/p>\n<p><span>The previous-generation Zion platform consists of three boxes: a CPU head node, a switch sync system, and a GPU system, and requires external cabling to connect everything. Grand Teton integrates this into a single chassis with fully integrated power, control, compute, and fabric interfaces for better overall performance, signal integrity, and thermal performance.\u00a0<\/span><\/p>\n<p><span>This high level of integration dramatically simplifies the deployment of Grand Teton, allowing it to be introduced into data center fleets faster and with fewer potential points of failure, while providing rapid scale with increased reliability.<\/span><span><br \/>\n<\/span><\/p>\n\n<h2><span>Rack and power innovations<\/span><\/h2>\n<h3><span>Open Rack v3<\/span><\/h3>\n<p><span>The latest edition of our Open Rack hardware is here to offer a common rack and power architecture for the entire industry. To bridge the gap between present and future data center needs, Open Rack v3 (ORV3) is designed with flexibility in mind, with a frame and power infrastructure capable of supporting a wide range of use cases \u2014 including support for Grand Teton.<\/span><\/p>\n<p><span>ORV3\u2019s power shelf isn\u2019t bolted to the busbar. Instead, the power shelf installs anywhere in the rack, which enables flexible rack configurations. Multiple shelves can be installed on a single busbar to support 30kW racks, while 48VDC output will support the higher power transmission needs of future AI accelerators.<\/span><\/p>\n<p><span>It also features an improved battery backup unit, upping the capacity to four minutes, compared with the previous model\u2019s 90 seconds, and with a power capacity of 15kW per shelf. Like the power shelf, this backup unit installs anywhere in the rack for customization and provides 30kW when installed as a pair.<\/span><\/p>\n<p><span>Meta chose to develop almost every component of the ORV3 design through OCP from the beginning. While an ecosystem-led design can result in a lengthier design process than that of a traditional in-house design, the end product is a holistic infrastructure solution that can be deployed at scale with improved flexibility, full supplier interoperability, and a diverse supplier ecosystem.\u00a0<\/span><\/p>\n<p><span>You can join our efforts at: <\/span><a href=\"https:\/\/www.opencompute.org\/projects\/rack-and-power\" target=\"_blank\" rel=\"noopener\"><span>https:\/\/www.opencompute.org\/projects\/rack-and-power<\/span><\/a><\/p>\n\n<h3><span>Machine learning cooling trends vs. cooling limits<\/span><\/h3>\n<p><span>With a higher socket power comes increasingly complex thermal management overhead. The ORV3 ecosystem has been designed to accommodate several different forms of liquid cooling strategies, including air-assisted liquid cooling and facility water cooling. The ORV3 ecosystem also includes an optional blind mate liquid cooling interface design, providing dripless connections between the IT gear and the liquid manifold, which allows for easier servicing and installation of the IT gear.<\/span><\/p>\n<p><span>In 2020, we formed a new OCP focus group, the ORV3 Blind Mate Interfaces Group, with other industry experts, suppliers, solution providers, and partners, where we are developing interface specifications and solutions,\u00a0 such as rack interfaces and structural enhancements to support liquid cooling, blind mate quick (liquid) connectors, blind mate manifolds, hose and tubing requirements, blind mate IT gear design concepts, and various white papers on best practices.<\/span><\/p>\n<p><span>You might be asking yourself, why is Meta so focused on all these areas? The power trend increases we are seeing, and the need for liquid cooling advances, are forcing us to think differently about all elements of our platform, rack and power, and data center design. The chart below shows projections of increased high-bandwidth memory (HBM) and training module power growth over several years, as well as how these trends will require different cooling technologies over time and the limits associated with those technologies.<\/span><\/p>\n<p>You can see that with facility water cooling strategies, we can accommodate north of 1000W sockets, with as much as 50W of HBM per stack<\/p>\n<p><span>You can join our efforts at: <\/span><a href=\"https:\/\/www.opencompute.org\/projects\/cooling-environments\" target=\"_blank\" rel=\"noopener\"><span>https:\/\/www.opencompute.org\/projects\/cooling-environments<\/span><\/a><span>\u00a0<\/span><\/p>\n<h2><span>Grand Canyon: Next-gen storage for AI infrastructure<\/span><\/h2>\n<p><span>Supporting ever-advancing AI models also means having the best possible storage solutions to support our AI infrastructure. Grand Canyon is Meta\u2019s next-generation storage platform,featuring improved hardware security and future upgrades of key commodities. The platform is designed to support higher-density HDD\u2019s without performance degradation and with improved power utilization.\u00a0<\/span><\/p>\n<h2><span>Launching the PyTorch Foundation<\/span><\/h2>\n<p><span>Since 2016, when we first partnered with the AI research community to create PyTorch, it has grown to become one of the leading platforms for AI research and production applications. In September of this year, we announced the next step in PyTorch\u2019s journey to accelerate innovation in AI. PyTorch is moving under the Linux Foundation umbrella as a new, independent <\/span><a href=\"https:\/\/ai.facebook.com\/blog\/pytorch-foundation\/\" target=\"_blank\" rel=\"noopener\"><span>PyTorch Foundation<\/span><\/a><span>.<\/span><\/p>\n<p><span>While Meta will continue to invest in PyTorch, and use it as our primary framework for AI research and production applications, the PyTorch Foundation will act as a responsible steward. It will support PyTorch through conferences, training courses, and other initiatives. The foundation\u2019s mission is to foster and sustain an ecosystem of vendor-neutral projects with PyTorch that will help drive industry-wide adoption of AI tooling.<\/span><\/p>\n<p><span>We remain fully committed to PyTorch. And we believe this approach is the fastest and best way to build and deploy new systems that will not only address real-world needs, but also help researchers answer fundamental questions about the nature of intelligence.\u00a0<\/span><\/p>\n<h2><span>The future of AI infrastructure<\/span><\/h2>\n<p><span>At Meta, we\u2019re all-in on AI. But the future of AI won\u2019t come from us alone. It\u2019ll come from collaboration \u2013 the sharing of ideas and technologies through organizations like OCP. We\u2019re eager to continue working together to build new tools and technologies to drive the future of AI. And we hope that you\u2019ll all join us in our various efforts. Whether it\u2019s developing new approaches to AI today or radically rethinking hardware design and software for the future, we\u2019re excited to see what the industry has in store next. <\/span><\/p>\n<p>The post <a href=\"https:\/\/engineering.fb.com\/2022\/10\/18\/open-source\/ocp-summit-2022-grand-teton\/\">OCP Summit 2022: Open hardware for AI infrastructure<\/a> appeared first on <a href=\"https:\/\/engineering.fb.com\/\">Engineering at Meta<\/a>.<\/p>\n<p>Engineering at Meta<\/p>","protected":false},"excerpt":{"rendered":"<p>At OCP Summit 2022, we\u2019re announcing Grand Teton, our next-generation platform for AI at scale that we\u2019ll contribute to the OCP community. We\u2019re also sharing new innovations designed to support data centers as they advance to support new AI technologies: A new, more efficient version of Open Rack. Our Air-Assisted Liquid Cooling (AALC) \u2013 design.&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2022\/10\/18\/ocp-summit-2022-open-hardware-for-ai-infrastructure\/\">Continue reading <span class=\"screen-reader-text\">OCP Summit 2022: Open hardware for AI infrastructure<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-641","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":501,"url":"https:\/\/fde.cat\/index.php\/2021\/11\/09\/ocp-summit-2021-open-networking-hardware-lays-the-groundwork-for-the-metaverse\/","url_meta":{"origin":641,"position":0},"title":"OCP Summit 2021: Open networking hardware lays the groundwork for the metaverse","date":"November 9, 2021","format":false,"excerpt":"Open infrastructure technologies and networking hardware will play an important role as we build new technologies for the metaverse, where billions of people will someday come together in virtual spaces. As we head toward the next major computing platform with a continued spirit of embracing openness and disaggregation, we\u2019re announcing\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":730,"url":"https:\/\/fde.cat\/index.php\/2023\/06\/29\/metas-evenstar-is-transitioning-to-ocp-to-accelerate-open-ran-adoption\/","url_meta":{"origin":641,"position":1},"title":"Meta\u2019s Evenstar is transitioning to OCP to accelerate open RAN adoption","date":"June 29, 2023","format":false,"excerpt":"Meta is transferring its IP for Evenstar, a program to accelerate the adoption of open RAN technologies, to the Open Compute Project (OCP). Meta will contribute Evenstar\u2019s radio unit design to OCP, giving the telecom industry its first open, white box radio unit solution. The TIP Open RAN community will\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":836,"url":"https:\/\/fde.cat\/index.php\/2024\/03\/12\/building-metas-genai-infrastructure\/","url_meta":{"origin":641,"position":2},"title":"Building Meta\u2019s GenAI Infrastructure","date":"March 12, 2024","format":false,"excerpt":"Marking a major investment in Meta\u2019s AI future, we are announcing two 24k GPU clusters. We are sharing details on the hardware, network, storage, design, performance, and software that help us extract high throughput and reliability for various AI workloads. We use this cluster design for Llama 3 training. We\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":458,"url":"https:\/\/fde.cat\/index.php\/2021\/09\/20\/cachelib-facebooks-open-source-caching-engine-for-web-scale-services\/","url_meta":{"origin":641,"position":3},"title":"CacheLib, Facebook\u2019s open source caching engine for web-scale services","date":"September 20, 2021","format":false,"excerpt":"Caching plays an important role in helping people access their information efficiently. For example, when an email app loads, it temporarily caches some messages, so the user can refresh the page without the app retrieving the same messages. However, large-scale caching has long been a complex engineering challenge. Companies must\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":346,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/open-sourcing-a-more-precise-time-appliance\/","url_meta":{"origin":641,"position":4},"title":"Open-sourcing a more precise time appliance","date":"August 31, 2021","format":false,"excerpt":"Facebook engineers have built and open-sourced an Open Compute Time Appliance, an important component of the modern timing infrastructure. To make this possible, we came up with the Time Card \u2014 a PCI Express (PCIe) card that can turn almost any commodity server into a time appliance. With the help\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":655,"url":"https:\/\/fde.cat\/index.php\/2022\/11\/21\/how-precision-time-protocol-is-being-deployed-at-meta\/","url_meta":{"origin":641,"position":5},"title":"How Precision Time Protocol is being deployed at Meta","date":"November 21, 2022","format":false,"excerpt":"Implementing Precision Time Protocol (PTP) at Meta allows us to synchronize the systems that drive our products and services down to nanosecond precision. PTP\u2019s predecessor, Network Time Protocol (NTP), provided us with millisecond precision, but as we scale to more advanced systems on our way to building the next computing\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/641","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=641"}],"version-history":[{"count":0,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/641\/revisions"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=641"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=641"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=641"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}