{"id":879,"date":"2024-06-12T22:45:11","date_gmt":"2024-06-12T22:45:11","guid":{"rendered":"https:\/\/fde.cat\/index.php\/2024\/06\/12\/how-meta-trains-large-language-models-at-scale\/"},"modified":"2024-06-12T22:45:11","modified_gmt":"2024-06-12T22:45:11","slug":"how-meta-trains-large-language-models-at-scale","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2024\/06\/12\/how-meta-trains-large-language-models-at-scale\/","title":{"rendered":"How Meta trains large language models at scale"},"content":{"rendered":"<p><span>As we continue to focus our AI research and development on solving increasingly complex problems, one of the most significant and challenging shifts we\u2019ve experienced is the sheer scale of computation required to train large language models (LLMs).<\/span><\/p>\n<p><span>Traditionally, our AI model training has involved a training massive number of models that required a comparatively smaller number of GPUs. This was the case for our recommendation models (e.g., our feed and ranking models) that would ingest vast amounts of information to make accurate recommendations that power most of our products.<\/span><\/p>\n\n<p><span>With the advent of generative AI (GenAI), we\u2019ve seen a shift towards fewer jobs, but incredibly large ones. Supporting GenAI at scale has meant rethinking how our software, hardware, and network infrastructure come together.<\/span><\/p>\n<h2><span>The challenges of large-scale model training<\/span><\/h2>\n\n<p><span>As we increase the number of GPUs in a job, the likelihood of an interruption due to a hardware failure also increases. Also, all of these GPUs still need to communicate on the same high-speed fabric to perform optimally. This underscores the importance of four factors:<\/span><\/p>\n<p>Hardware reliability<span>: Ensuring that our hardware is reliable is important. We need to minimize the chances of a hardware failure interrupting a training job. This involves rigorous testing and quality control measures, and automation to quickly detect and remediate issues.<\/span><br \/>\nFast recovery on failure<span>: Despite our best efforts, hardware failures can and do occur. When they do, we need to be able to recover quickly. This involves reducing re-scheduling overhead and fast training re-initialization.<\/span><br \/>\nEfficient preservation of the training state<span>: In the event of a failure, we need to be able to pick up where we left off. This means we need to regularly checkpoint our training state and efficiently store and retrieve training data.<\/span><br \/>\nOptimal connectivity between GPUs:<span> Large-scale model training involves transferring vast amounts of data between GPUs in a synchronized fashion. A slow data exchange between a subset of GPUs can compound and slow down the whole job. Solving this problem requires a robust and high-speed network infrastructure as well as efficient data transfer protocols and algorithms.\u00a0<\/span><\/p>\n<h2><span>Innovating across the infrastructure stack<\/span><\/h2>\n<p><span>Perfecting every layer of our infrastructure stack is important due to the demands of GenAI at scale. This has encompassed developments in a wide range of areas.<\/span><\/p>\n<h3><span>Training software<\/span><\/h3>\n<p><span>We enable researchers to use <\/span><a href=\"https:\/\/pytorch.org\/blog\/training-production-ai-models\/\"><span>PyTorch<\/span><\/a><span> and other new open source developments, facilitating extremely fast research-to-production development. This includes <\/span><span>developing new algorithms and techniques for efficient large-scale training and integrating new software tools and frameworks into our infrastructure.<\/span><\/p>\n<h3><span>Scheduling<\/span><\/h3>\n<p><span>Efficient scheduling helps ensure that our resources are used optimally. This involves <\/span><span>sophisticated algorithms that can allocate resources based on the needs of different jobs and dynamic scheduling to adapt to changing workloads.<\/span><\/p>\n<h3><span>Hardware\u00a0<\/span><\/h3>\n<p><span>We need high-performance hardware to handle the computational demands of large-scale model training. Beyond size and scale, many hardware configurations and attributes need to be best optimized for GenAI. Given that hardware development times are traditionally long, we had to adapt existing hardware, and to this end we explored various dimensions including power, HBM capacity and speed, and I\/O.\u00a0<\/span><\/p>\n<p><span>We also pivoted by modifying the <\/span><a href=\"https:\/\/engineering.fb.com\/2022\/10\/18\/open-source\/ocp-summit-2022-grand-teton\/\"><span>Grand Teton<\/span><\/a><span> platform that was developed using NVIDIA H100 GPUs, increased the TDP of the GPUs to 700W, and moved to HBM3 on the GPUs. Since we did not have time to change the cooling infrastructure, we had to remain in an air-cooled environment. The mechanical and thermal designs had to change to accommodate this, and that triggered a validation cycle to support a large-scale deployment.\u00a0<\/span><\/p>\n<p><span>All of these hardware-related changes were challenging because we had to find a solution that fit within the existing resource constraints, with a very small degree of freedom to change and meet a tight schedule.<\/span><\/p>\n<h3><span>Data center deployment<\/span><\/h3>\n<p><span>Once we\u2019ve chosen a GPU and system, the task of placing them in a data center for optimal usage of resources (power, cooling, networking, etc.) requires revisiting trade-offs made for other types of workloads. Data center power and cooling infrastructure cannot be changed quickly (or easily) and we had to find an optimal layout that allowed maximum compute capability within a data hall. This required relocating supporting services such as readers out of the data hall and packing as many GPU racks as possible to maximize the power and network capability for highest compute density with the largest network cluster.\u00a0<\/span><\/p>\n<h3><span>Reliability\u00a0<\/span><\/h3>\n<p><span>We need to plan for detection and remediation to minimize downtime during hardware failures. The number of failures scales with the size of the cluster, and having a job that spans the cluster makes it necessary to keep adequate spare capacity to restart the job as soon as possible. In addition, we monitor failures and can sometimes take preventive measures to mitigate downtime.\u00a0<\/span><\/p>\n\n<p><span>Some of the most frequent failure modes we have observed are:<\/span><\/p>\n<p>GPUs falling off:<span> In this case, GPUs are not detected by the host on PCIe. There are several reasons for this failure, but this failure mode is seen more in the early life and settles as the server ages.<\/span><br \/>\nDRAM &amp; SRAM UCE:<span> Uncorrectable errors are common in memories, and we monitor and identify repeat offenders, track against thresholds, and initiate RMAs when error rates exceed vendor thresholds.<\/span><br \/>\nHW network cable:<span> In the general category of unreachable servers, these failures are also seen most often in the early life of the server.\u00a0<\/span><\/p>\n<h3><span>Network<\/span><\/h3>\n<p><span>Large-scale model training involves transferring vast amounts of data quickly between GPUs. This requires robust and high-speed network infrastructure as well as efficient data transfer protocols and algorithms.\u00a0<\/span><\/p>\n<p><span>There are two leading choices in the industry that fit these requirements: RoCE and InfiniBand fabrics. Both of these options had tradeoffs. On the one hand, Meta had built RoCE clusters for the past four years, but the largest of those clusters only supported 4K GPUs. We needed significantly larger RoCE clusters. On the other hand, Meta had built research clusters with InfiniBand as <\/span><a href=\"https:\/\/ai.meta.com\/blog\/ai-rsc\/\"><span>large as 16K GPUs<\/span><\/a><span>. However, those clusters were <\/span><span>not<\/span><span> tightly integrated into Meta\u2019s production environment, nor were they built for the latest generation of GPUs\/networking. This made for a difficult decision of what fabric to build with.<\/span><\/p>\n<p><span>So we decided to build both: <\/span><a href=\"https:\/\/engineering.fb.com\/2024\/03\/12\/data-center-engineering\/building-metas-genai-infrastructure\/\"><span>two 24k clusters<\/span><\/a><span>, one with RoCE and another with InfiniBand. Our intent was to build and learn from the operational experience. These learnings will inform the future direction of GenAI fabrics. We optimized the RoCE cluster for quick build time, and the InfiniBand cluster for full-bisection bandwidth. We used both InfiniBand and RoCE clusters to train <\/span><a href=\"https:\/\/ai.meta.com\/blog\/meta-llama-3\/\"><span>Llama 3<\/span><\/a><span>, with the RoCE cluster used for training the largest model. Despite the underlying network technology differences between these clusters, we were able to tune both of them to provide equivalent performance for these large GenAI workloads<\/span><\/p>\n\n<p><span>We optimized three aspects of the overall stack to make network communication for GenAI models performant on both clusters:<\/span><\/p>\n<p><span>We assigned communication patterns resulting from different model, data and pipeline parallelisms to different layers of the network topology so that the network capabilities were effectively exploited.<\/span><br \/>\n<span>We implemented collective communication patterns with network topology awareness so that they can be less latency-sensitive. We do this by changing the default implementation of collectives with custom algorithms such as recursive doubling or halving instead of conventional algorithms like rings.<\/span><br \/>\n<span>Just like ranking jobs, GenAI jobs produce additional fat flows that make it hard to distribute traffic across all possible network paths. This required us to further invest in network load balancing and routing to achieve an optimal distribution of traffic across network resources.<\/span><\/p>\n<p><span>We spoke in depth about our <\/span><a href=\"https:\/\/atscaleconference.com\/videos\/scaling-roce-networks-for-ai-training\/\"><span>RoCE load-balancing techniques<\/span><\/a><span> at <\/span><a href=\"https:\/\/atscaleconference.com\/videos\/scaling-roce-networks-for-ai-training\/\"><span>Networking @Scale 2023<\/span><\/a><span>.<\/span><\/p>\n\n<h3><span>Storage<\/span><\/h3>\n<p><span>We need efficient data-storage solutions to store the vast amounts of data used in model training. This involves investing in high-capacity and high-speed storage technologies and developing new data-storage solutions for specific workloads.<\/span><\/p>\n<h2><span>Looking ahead<\/span><\/h2>\n<p><span>In the next few years w<\/span><span>e will be working with hundreds of thousands of GPUs, handling even larger volumes of data, and dealing with longer distances and latencies. We\u2019ll be adopting new hardware technologies\u2014including newer GPU architectures\u2014and evolving our infrastructure.\u00a0<\/span><\/p>\n<p><span>These challenges will push us to innovate and adapt in ways we can\u2019t fully predict yet. But one thing is certain: We are only at the beginning of this journey. As we continue to navigate the evolving landscape of AI, we remain committed to pushing the boundaries of what\u2019s possible.<\/span><\/p>\n<p>The post <a href=\"https:\/\/engineering.fb.com\/2024\/06\/12\/data-infrastructure\/training-large-language-models-at-scale-meta\/\">How Meta trains large language models at scale<\/a> appeared first on <a href=\"https:\/\/engineering.fb.com\/\">Engineering at Meta<\/a>.<\/p>\n<p>Engineering at Meta<\/p>","protected":false},"excerpt":{"rendered":"<p>As we continue to focus our AI research and development on solving increasingly complex problems, one of the most significant and challenging shifts we\u2019ve experienced is the sheer scale of computation required to train large language models (LLMs). Traditionally, our AI model training has involved a training massive number of models that required a comparatively&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2024\/06\/12\/how-meta-trains-large-language-models-at-scale\/\">Continue reading <span class=\"screen-reader-text\">How Meta trains large language models at scale<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-879","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":787,"url":"https:\/\/fde.cat\/index.php\/2023\/11\/15\/watch-metas-engineers-on-building-network-infrastructure-for-ai\/","url_meta":{"origin":879,"position":0},"title":"Watch: Meta\u2019s engineers on building network infrastructure for AI","date":"November 15, 2023","format":false,"excerpt":"Meta is building for the future of AI at every level \u2013 from hardware like MTIA v1, Meta\u2019s first-generation AI inference accelerator to publicly released models like Llama 2, Meta\u2019s next-generation large language model, as well as new generative AI (GenAI) tools like Code Llama. Delivering next-generation AI products and\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":836,"url":"https:\/\/fde.cat\/index.php\/2024\/03\/12\/building-metas-genai-infrastructure\/","url_meta":{"origin":879,"position":1},"title":"Building Meta\u2019s GenAI Infrastructure","date":"March 12, 2024","format":false,"excerpt":"Marking a major investment in Meta\u2019s AI future, we are announcing two 24k GPU clusters. We are sharing details on the hardware, network, storage, design, performance, and software that help us extract high throughput and reliability for various AI workloads. We use this cluster design for Llama 3 training. We\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":811,"url":"https:\/\/fde.cat\/index.php\/2024\/01\/11\/how-meta-is-advancing-genai\/","url_meta":{"origin":879,"position":2},"title":"How Meta is advancing GenAI","date":"January 11, 2024","format":false,"excerpt":"What\u2019s going on with generative AI (GenAI) at Meta? And what does the future have in store? In this episode of the Meta Tech Podcast, Meta engineer Pascal Hartig (@passy) speaks with\u00a0Devi Parikh, an AI research director at Meta.\u00a0They cover a wide range of topics, including the history and future\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":802,"url":"https:\/\/fde.cat\/index.php\/2023\/12\/12\/developing-the-new-xgen-salesforces-foundational-large-language-models\/","url_meta":{"origin":879,"position":3},"title":"Developing the New XGen: Salesforce\u2019s Foundational Large Language Models","date":"December 12, 2023","format":false,"excerpt":"By Shafiq Rayhan Joty and Scott Nyberg In our \u201cEngineering Energizers\u201d Q&A series, we examine the professional journeys that have shaped Salesforce Engineering leaders. Meet Shafiq Rayhan Joty, a Director at Salesforce AI Research. Shafiq co-leads the development of XGen, a series of groundbreaking large language models (LLMs) of different\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":618,"url":"https:\/\/fde.cat\/index.php\/2022\/08\/10\/scaling-data-ingestion-for-machine-learning-training-at-meta\/","url_meta":{"origin":879,"position":4},"title":"Scaling data ingestion for machine learning training at Meta","date":"August 10, 2022","format":false,"excerpt":"Many of Meta\u2019s products, such as search and language translations, utilize AI models to continuously improve user experiences. As the performance of hardware we use to support training infrastructure increases, we need to scale our data ingestion infrastructure accordingly to handle workloads more efficiently. GPUs, which are used for training\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":332,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/fully-sharded-data-parallel-faster-ai-training-with-fewer-gpus\/","url_meta":{"origin":879,"position":5},"title":"Fully Sharded Data Parallel: faster AI training with fewer GPUs","date":"August 31, 2021","format":false,"excerpt":"Training AI models at a large scale isn\u2019t easy. Aside from the need for large amounts of computing power and resources, there is also considerable engineering complexity behind training very large models. At Facebook AI Research (FAIR) Engineering, we have been working on building tools and infrastructure to make training\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/879","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=879"}],"version-history":[{"count":0,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/879\/revisions"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=879"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=879"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=879"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}