{"id":836,"date":"2024-03-12T15:00:49","date_gmt":"2024-03-12T15:00:49","guid":{"rendered":"https:\/\/fde.cat\/index.php\/2024\/03\/12\/building-metas-genai-infrastructure\/"},"modified":"2024-03-12T15:00:49","modified_gmt":"2024-03-12T15:00:49","slug":"building-metas-genai-infrastructure","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2024\/03\/12\/building-metas-genai-infrastructure\/","title":{"rendered":"Building Meta\u2019s GenAI Infrastructure"},"content":{"rendered":"<p><span>Marking a major investment in Meta\u2019s AI future, we are announcing two 24k GPU clusters. We are sharing details on the hardware, network, storage, design, performance, and software that help us extract high throughput and reliability for various AI workloads. We use this cluster design for Llama 3 training.<\/span><br \/>\n<span>We are strongly committed to open compute and open source. We built these clusters on top of <\/span><a href=\"https:\/\/engineering.fb.com\/2022\/10\/18\/open-source\/ocp-summit-2022-grand-teton\/\" target=\"_blank\" rel=\"noopener\"><span>Grand Teton<\/span><\/a><span>, <\/span><a href=\"https:\/\/www.opencompute.org\/wiki\/Open_Rack\/SpecsAndDesigns\" target=\"_blank\" rel=\"noopener\"><span>OpenRack<\/span><\/a><span>, and <\/span><a href=\"https:\/\/pytorch.org\/\" target=\"_blank\" rel=\"noopener\"><span>PyTorch<\/span><\/a><span> and continue to push open innovation across the industry.<\/span><br \/>\n<span>This announcement is one step in our ambitious infrastructure roadmap. By the end of 2024, we\u2019re aiming to continue to grow our infrastructure build-out that will include 350,000 NVIDIA H100 GPUs as part of a portfolio that will feature compute power equivalent to nearly 600,000 H100s.<\/span><\/p>\n<p><span>To lead in developing AI means leading investments in hardware infrastructure. Hardware infrastructure plays an important role in AI\u2019s future. Today, we\u2019re sharing details on two versions of our <\/span><span>24,576-GPU data center scale cluster at Meta. These clusters support our current and next generation AI models, including Llama 3, the successor to<\/span><a href=\"https:\/\/ai.meta.com\/llama\/open-innovation-ai-research-community\"> <span>Llama 2<\/span><\/a><span>, our publicly released LLM, as well as AI research and development across GenAI and other areas .<\/span><\/p>\n<h2><span>A peek into Meta\u2019s large-scale AI clusters<\/span><\/h2>\n<p><span>Meta\u2019s long-term vision is to build artificial general intelligence (AGI) that is open and built responsibly so that it can be widely available for everyone to benefit from. As we work towards AGI, we have also worked on scaling our clusters to power this ambition. The progress we make towards AGI creates new products,<\/span><a href=\"https:\/\/about.fb.com\/news\/2023\/09\/introducing-ai-powered-assistants-characters-and-creative-tools\/\" target=\"_blank\" rel=\"noopener\"> <span>new AI features for our family of apps<\/span><\/a><span>, and new AI-centric computing devices.\u00a0<\/span><\/p>\n<p><span>While we\u2019ve had a long history of building AI infrastructure, we first shared details on our <\/span><a href=\"https:\/\/ai.meta.com\/blog\/ai-rsc\/\" target=\"_blank\" rel=\"noopener\"><span>AI Research SuperCluster (RSC)<\/span><\/a><span>, featuring 16,000 NVIDIA A100 GPUs, in 2022. RSC has accelerated our open and responsible AI research by helping us build our first generation of advanced AI models. It played and continues to play an important role in the development of <\/span><a href=\"https:\/\/arxiv.org\/abs\/2302.13971\" target=\"_blank\" rel=\"noopener\"><span>Llama<\/span><\/a><span> and <\/span><a href=\"https:\/\/arxiv.org\/abs\/2307.09288\" target=\"_blank\" rel=\"noopener\"><span>Llama 2<\/span><\/a><span>, as well as advanced AI models for applications ranging from computer vision, NLP, and speech recognition, to<\/span><a href=\"https:\/\/ai.meta.com\/blog\/emu-text-to-video-generation-image-editing-research\/\" target=\"_blank\" rel=\"noopener\"> <span>image generation<\/span><\/a><span>, and even<\/span> <a href=\"https:\/\/ai.meta.com\/blog\/code-llama-large-language-model-coding\/\" target=\"_blank\" rel=\"noopener\"><span>coding<\/span><\/a><span>.<\/span><\/p>\n\n<h2><span>Under the hood<\/span><\/h2>\n<p><span>Our newer AI clusters build upon the successes and lessons learned from RSC. We focused on building end-to-end AI systems with a major emphasis on researcher and developer experience and productivity. The efficiency of the high-performance network fabrics within these clusters, some of the key storage decisions, combined with the 24,576 NVIDIA Tensor Core H100 GPUs in each, allow both cluster versions to support models larger and more complex than that could be supported in the RSC and pave the way for advancements in GenAI product development and AI research.<\/span><\/p>\n<h3><span>Network<\/span><\/h3>\n<p><span>At Meta, we handle hundreds of trillions of AI model executions per day. Delivering these services at a large scale requires a highly advanced and flexible infrastructure. Custom designing much of our own hardware, software, and network fabrics allows us to optimize the end-to-end experience for our AI researchers while ensuring our data centers operate efficiently.\u00a0<\/span><\/p>\n<p><span>With this in mind, we built one cluster with a remote direct memory access (RDMA) over converged Ethernet (RoCE) network fabric solution based on the <\/span><a href=\"https:\/\/www.arista.com\/assets\/data\/pdf\/Datasheets\/7800R3-Data-Sheet.pdf\" target=\"_blank\" rel=\"noopener\"><span>Arista 7800<\/span><\/a><span> with <\/span><a href=\"https:\/\/engineering.fb.com\/2021\/11\/09\/data-center-engineering\/ocp-summit-2021\/\" target=\"_blank\" rel=\"noopener\"><span>Wedge400<\/span><\/a><span> and <\/span><a href=\"https:\/\/engineering.fb.com\/2021\/11\/09\/data-center-engineering\/ocp-summit-2021\/\" target=\"_blank\" rel=\"noopener\"><span>Minipack2<\/span><\/a><span> OCP rack switches. The other cluster features an <\/span><a href=\"https:\/\/www.nvidia.com\/en-us\/networking\/quantum2\/\" target=\"_blank\" rel=\"noopener\"><span>NVIDIA Quantum2 InfiniBand<\/span><\/a><span> fabric. Both of these solutions interconnect 400 Gbps endpoints. With these two, we are able to assess the suitability and scalability of these <\/span><a href=\"https:\/\/engineering.fb.com\/2023\/11\/15\/networking-traffic\/watch-metas-engineers-on-building-network-infrastructure-for-ai\/\" target=\"_blank\" rel=\"noopener\"><span>different types of interconnect for large-scale training,<\/span><\/a><span> giving us more insights that will help inform how we design and build even larger, scaled-up clusters in the future. Through careful co-design of the network, software, and model architectures, we have successfully used both RoCE and InfiniBand clusters for large, GenAI workloads (including our ongoing training of Llama 3 on our RoCE cluster) without any network bottlenecks.<\/span><\/p>\n<h3><span>Compute<\/span><\/h3>\n<p><span>Both clusters are built using<\/span> <a href=\"https:\/\/engineering.fb.com\/2022\/10\/18\/open-source\/ocp-summit-2022-grand-teton\/\" target=\"_blank\" rel=\"noopener\"><span>Grand Teton<\/span><\/a><span>, our in-house-designed, open GPU hardware platform that we\u2019ve contributed to the Open Compute Project (OCP). Grand Teton builds on the many generations of AI systems that integrate power, control, compute, and fabric interfaces into a single chassis for better overall performance, signal integrity, and thermal performance. It provides rapid scalability and flexibility in a simplified design, allowing it to be quickly deployed into data center fleets and easily maintained and scaled. Combined with other in-house innovations like our<\/span> <a href=\"https:\/\/engineering.fb.com\/2022\/10\/18\/open-source\/ocp-summit-2022-grand-teton\/\" target=\"_blank\" rel=\"noopener\"><span>Open Rack<\/span><\/a><span> power and rack architecture, Grand Teton allows us to build new clusters in a way that is purpose-built for current and future applications at Meta.<\/span><\/p>\n<p><span>We have been openly designing our GPU hardware platforms beginning with our <\/span><a href=\"https:\/\/engineering.fb.com\/2015\/12\/10\/ml-applications\/facebook-to-open-source-ai-hardware-design\/\" target=\"_blank\" rel=\"noopener\"><span>Big Sur platform in 2015<\/span><\/a><span>.<\/span><\/p>\n<h3><span>Storage<\/span><\/h3>\n<p><span>Storage plays an important role in AI training, and yet is one of the least talked-about aspects. As the GenAI training jobs become more multimodal over time, consuming large amounts of image, video, and text data, the need for data storage grows rapidly. The need to fit all that data storage into a performant, yet power-efficient footprint doesn\u2019t go away though, which makes the problem more interesting.<\/span><\/p>\n<p><span>Our storage deployment addresses the data and checkpointing needs of the AI clusters via a home-grown Linux Filesystem in Userspace (FUSE) API backed by a version of Meta\u2019s <\/span><a href=\"https:\/\/www.usenix.org\/conference\/fast21\/presentation\/pan\" target=\"_blank\" rel=\"noopener\"><span>\u2018Tectonic\u2019 distributed storage solution<\/span><\/a><span> optimized for Flash media. This solution enables thousands of GPUs to save and load checkpoints in a synchronized fashion (a <\/span><a href=\"https:\/\/en.wikipedia.org\/wiki\/Thundering_herd_problem#:~:text=In%20computer%20science%2C%20the%20thundering,able%20to%20handle%20the%20event.\"><span>challenge<\/span><\/a><span> for any storage solution) while also providing a flexible and high-throughput exabyte scale storage required for data loading.<\/span><\/p>\n<p><span>We have also partnered with <\/span><a href=\"https:\/\/hammerspace.com\/software\/\" target=\"_blank\" rel=\"noopener\"><span>Hammerspace<\/span><\/a><span> to co-develop and land a parallel network file system (NFS) deployment to meet the developer experience requirements for this AI cluster. Among other benefits, Hammerspace enables engineers to perform interactive debugging for jobs using thousands of GPUs as code changes are immediately accessible to all nodes within the environment. When paired together, the combination of our Tectonic distributed storage solution and Hammerspace enable fast iteration velocity without compromising on scale.\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><\/p>\n<p><span>The storage deployments in our GenAI clusters, both Tectonic- and Hammerspace-backed, are based on the <\/span><a href=\"https:\/\/www.opencompute.org\/documents\/e1s-expansion-2ou-1s-server-design-specification-pdf\"><span>YV3 Sierra Point server platform<\/span><\/a><span>, upgraded with the latest high capacity E1.S SSD we can procure in the market today. Aside from the higher SSD capacity, the servers per rack was customized to achieve the right balance of throughput capacity per server, rack count reduction, and associated power efficiency. Utilizing the OCP servers as Lego-like building blocks, our storage layer is able to flexibly scale to future requirements in this cluster as well as in future, bigger AI clusters, while being fault-tolerant to day-to-day Infrastructure maintenance operations.<\/span><\/p>\n<h3><span>Performance<\/span><\/h3>\n<p><span>One of the principles we have in building our large-scale AI clusters is to maximize performance and ease of use simultaneously without compromising one for the other. This is an important principle in creating the best-in-class AI models.\u00a0<\/span><\/p>\n<p><span>As we push the limits of AI systems, the best way we can test our ability to scale-up our designs is to simply build a system, optimize it, and actually test it (while simulators help, they only go so far). In this design journey, we compared the performance seen in our small clusters and with large clusters to see where our bottlenecks are. In the graph below, AllGather collective performance is shown (as normalized bandwidth on a 0-100 scale) when a large number of GPUs are communicating with each other at message sizes where roofline performance is expected.\u00a0<\/span><\/p>\n<p><span>Our out-of-box performance for large clusters was initially poor and inconsistent, compared to optimized small cluster performance. To address this we made several changes to how our internal job scheduler schedules jobs with network topology awareness \u2013 this resulted in latency benefits and minimized the amount of traffic going to upper layers of the network. We also optimized our network routing strategy in combination with NVIDIA Collective Communications Library (NCCL) changes to achieve optimal network utilization. This helped push our large clusters to achieve great and expected performance just as our small clusters.<\/span><\/p>\n<p>In the figure we see that small cluster performance (overall communication bandwidth and utilization) reaches 90%+ out of the box, but an unoptimized large cluster performance has very poor utilization, ranging from 10% to 90%. After we optimize the full system (software, network, etc.), we see large cluster performance return to the ideal 90%+ range.<\/p>\n<p><span>In addition to software changes targeting our internal infrastructure, we worked closely with teams authoring training frameworks and models to adapt to our evolving infrastructure. For example, NVIDIA H100 GPUs open the possibility of leveraging new data types such as 8-bit floating point (FP8) for training. Fully utilizing larger clusters required investments in additional parallelization techniques and new storage solutions provided opportunities to highly optimize checkpointing across thousands of ranks to run in hundreds of milliseconds.<\/span><\/p>\n<p><span>We also recognize debuggability as one of the major challenges in large-scale training. Identifying a problematic GPU that is stalling an entire training job becomes very difficult at a large scale. We\u2019re building tools such as desync debug, or a distributed collective flight recorder, to expose the details of distributed training, and help identify issues in a much faster and easier way<\/span><\/p>\n<p><span>Finally, we\u2019re continuing to evolve PyTorch, the foundational AI framework powering our AI workloads, to make it ready for tens, or even hundreds, of thousands of GPU training. We have identified multiple bottlenecks for process group initialization, and reduced the startup time from sometimes hours down to minutes.\u00a0<\/span><\/p>\n<h2><span>Commitment to open AI innovation<\/span><\/h2>\n<p><span>Meta maintains its commitment to open innovation in AI software and hardware. We believe open-source hardware and software will always be a valuable tool to help the industry solve problems at large scale.<\/span><\/p>\n<p><span>Today, we continue to support<\/span><a href=\"https:\/\/engineering.fb.com\/2022\/10\/18\/open-source\/ocp-summit-2022-grand-teton\/\" target=\"_blank\" rel=\"noopener\"> <span>open hardware innovation<\/span><\/a><span> as a founding member of OCP, where we make designs like Grand Teton and Open Rack available to the OCP community. We also continue to be the largest and primary contributor to <\/span><a href=\"https:\/\/pytorch.org\/\" target=\"_blank\" rel=\"noopener\"><span>PyTorch<\/span><\/a><span>, the AI software framework that is powering a large chunk of the industry.<\/span><\/p>\n<p><span>We also continue to be committed to open innovation in the AI research community. We\u2019ve launched the<\/span><a href=\"https:\/\/ai.meta.com\/llama\/open-innovation-ai-research-community\" target=\"_blank\" rel=\"noopener\"> <span>Open Innovation AI Research Community<\/span><\/a><span>, a partnership program for academic researchers to deepen our understanding of how to responsibly develop and share AI technologies \u2013 with a particular focus on LLMs.<\/span><\/p>\n<p><span>An open approach to AI is not new for Meta. We\u2019ve also launched the <\/span><a href=\"https:\/\/ai.meta.com\/blog\/ai-alliance\/\" target=\"_blank\" rel=\"noopener\"><span>AI Alliance<\/span><\/a><span>, a group of leading organizations across the AI industry focused on accelerating responsible innovation in AI within an open community. Our AI efforts are built on a philosophy of open science and cross-collaboration. An open ecosystem brings transparency, scrutiny, and trust to AI development and leads to innovations that everyone can benefit from that are built with safety and responsibility top of mind.\u00a0<\/span><\/p>\n<h2><span>The future of Meta\u2019s AI infrastructure<\/span><\/h2>\n<p><span>These two AI training cluster designs are a part of our larger roadmap for the future of AI. By the end of 2024, we\u2019re aiming to continue to grow our infrastructure build-out that will include 350,000 NVIDIA H100s as part of a portfolio that will feature compute power equivalent to nearly 600,000 H100s.<\/span><\/p>\n<p><span>As we look to the future, we recognize that what worked yesterday or today may not be sufficient for tomorrow\u2019s needs. That\u2019s why we are constantly evaluating and improving every aspect of our infrastructure, from the physical and virtual layers to the software layer and beyond. Our goal is to create systems that are flexible and reliable to support the fast-evolving new models and research.\u00a0 <\/span><\/p>\n<p>The post <a href=\"https:\/\/engineering.fb.com\/2024\/03\/12\/data-center-engineering\/building-metas-genai-infrastructure\/\">Building Meta\u2019s GenAI Infrastructure<\/a> appeared first on <a href=\"https:\/\/engineering.fb.com\/\">Engineering at Meta<\/a>.<\/p>\n<p>Engineering at Meta<\/p>","protected":false},"excerpt":{"rendered":"<p>Marking a major investment in Meta\u2019s AI future, we are announcing two 24k GPU clusters. We are sharing details on the hardware, network, storage, design, performance, and software that help us extract high throughput and reliability for various AI workloads. We use this cluster design for Llama 3 training. We are strongly committed to open&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2024\/03\/12\/building-metas-genai-infrastructure\/\">Continue reading <span class=\"screen-reader-text\">Building Meta\u2019s GenAI Infrastructure<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-836","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":787,"url":"https:\/\/fde.cat\/index.php\/2023\/11\/15\/watch-metas-engineers-on-building-network-infrastructure-for-ai\/","url_meta":{"origin":836,"position":0},"title":"Watch: Meta\u2019s engineers on building network infrastructure for AI","date":"November 15, 2023","format":false,"excerpt":"Meta is building for the future of AI at every level \u2013 from hardware like MTIA v1, Meta\u2019s first-generation AI inference accelerator to publicly released models like Llama 2, Meta\u2019s next-generation large language model, as well as new generative AI (GenAI) tools like Code Llama. Delivering next-generation AI products and\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":759,"url":"https:\/\/fde.cat\/index.php\/2023\/09\/07\/arcadia-an-end-to-end-ai-system-performance-simulator\/","url_meta":{"origin":836,"position":1},"title":"Arcadia: An end-to-end AI system performance simulator","date":"September 7, 2023","format":false,"excerpt":"We\u2019re introducing Arcadia, Meta\u2019s unified system that simulates the compute, memory, and network performance of AI training clusters. Extracting maximum performance from an AI cluster and increasing overall efficiency warrants a multi-input system that accounts for various hardware and software parameters across compute, storage, and network collectively. Arcadia gives Meta\u2019s\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":773,"url":"https:\/\/fde.cat\/index.php\/2023\/10\/18\/how-meta-is-creating-custom-silicon-for-ai\/","url_meta":{"origin":836,"position":2},"title":"How Meta is creating custom silicon for AI","date":"October 18, 2023","format":false,"excerpt":"With the recent launches of MTIA v1,\u00a0 Meta\u2019s first-generation AI inference accelerator, and Llama 2,\u00a0 the next generation of Meta\u2019s publicly available large language model, it\u2019s clear that Meta is focused on advancing AI for a more connected world. Fueling the success of these products are world-class infrastructure teams, including\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":641,"url":"https:\/\/fde.cat\/index.php\/2022\/10\/18\/ocp-summit-2022-open-hardware-for-ai-infrastructure\/","url_meta":{"origin":836,"position":3},"title":"OCP Summit 2022: Open hardware for AI infrastructure","date":"October 18, 2022","format":false,"excerpt":"At OCP Summit 2022, we\u2019re announcing Grand Teton, our next-generation platform for AI at scale that we\u2019ll contribute to the OCP community. We\u2019re also sharing new innovations designed to support data centers as they advance to support new AI technologies: A new, more efficient version of Open Rack. Our Air-Assisted\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":879,"url":"https:\/\/fde.cat\/index.php\/2024\/06\/12\/how-meta-trains-large-language-models-at-scale\/","url_meta":{"origin":836,"position":4},"title":"How Meta trains large language models at scale","date":"June 12, 2024","format":false,"excerpt":"As we continue to focus our AI research and development on solving increasingly complex problems, one of the most significant and challenging shifts we\u2019ve experienced is the sheer scale of computation required to train large language models (LLMs). Traditionally, our AI model training has involved a training massive number of\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":757,"url":"https:\/\/fde.cat\/index.php\/2023\/09\/07\/threads-the-inside-story-of-metas-newest-social-app\/","url_meta":{"origin":836,"position":5},"title":"Threads: The inside story of Meta\u2019s newest social app","date":"September 7, 2023","format":false,"excerpt":"Earlier this year, a small team of engineers at Meta started working on an idea for a new app. It would have all the features people expect from a text-based conversations app, but with one very key, distinctive goal \u2013 being an app that would allow people to share their\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/836","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=836"}],"version-history":[{"count":0,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/836\/revisions"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=836"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=836"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=836"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}