{"id":787,"date":"2023-11-15T17:00:16","date_gmt":"2023-11-15T17:00:16","guid":{"rendered":"https:\/\/fde.cat\/index.php\/2023\/11\/15\/watch-metas-engineers-on-building-network-infrastructure-for-ai\/"},"modified":"2023-11-15T17:00:16","modified_gmt":"2023-11-15T17:00:16","slug":"watch-metas-engineers-on-building-network-infrastructure-for-ai","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2023\/11\/15\/watch-metas-engineers-on-building-network-infrastructure-for-ai\/","title":{"rendered":"Watch: Meta\u2019s engineers on building network infrastructure for AI"},"content":{"rendered":"<p><span>Meta is building for the future of AI at every level \u2013 from hardware like <\/span><a href=\"https:\/\/ai.meta.com\/blog\/meta-training-inference-accelerator-AI-MTIA\/\"><span>MTIA v1<\/span><\/a><span>, Meta\u2019s first-generation AI inference accelerator to publicly released models like <\/span><a href=\"https:\/\/ai.meta.com\/llama\/\"><span>Llama 2<\/span><\/a><span>, Meta\u2019s next-generation large language model, as well as new generative AI (GenAI) tools like <\/span><a href=\"https:\/\/about.fb.com\/news\/2023\/08\/code-llama-ai-for-coding\/\"><span>Code Llama<\/span><\/a><span>.<\/span><\/p>\n<p><span>Delivering next-generation AI products and services at Meta\u2019s scale also requires a next-generation infrastructure.<\/span><\/p>\n<p><span>The 2023 edition of <a href=\"https:\/\/atscaleconference.com\/\" target=\"_blank\" rel=\"noopener\">Networking at Scale<\/a> focused on how Meta\u2019s engineers and researchers have been designing and operating the network infrastructure over the last several years for Meta\u2019s AI workloads, including our numerous ranking and recommendation workloads and the immense GenAI models. They cover a wide range of topics, including physical and logical network design, custom routing and load balancing solutions, performance tuning\/debugging\/benchmarking, and workload simulation and planning. We also look ahead to the requirements of GenAI models coming in the next several years.\u00a0<\/span><\/p>\n<h2><span>Networking for GenAI Training and Inference Clusters<\/span><\/h2>\n<p><span>Jongsoo Park, Research Scientist, Infrastructure<\/span><span><br \/>\n<\/span><span>Petr Lapukhov, Network Engineer<\/span><\/p>\n<p><span>Developing new GenAI technologies and incorporating them into product features is a top priority at Meta. But the sheer scale and complexity of GenAI models means new challenges for Meta\u2019s network infrastructure.<\/span><\/p>\n<p><span>Jongsoo Park and Petr Lapukhov discuss the unique requirements of new large language models, and how Meta\u2019s infrastructure is changing for the new GenAI landscape.<\/span><\/p>\n<div class=\"fb-video\"><\/div>\n<h2><span>Meta\u2019s Network Journey to Enable AI<\/span><\/h2>\n<p><span>Hany Morsy, Network Engineer<\/span><span><br \/>\n<\/span><span>Susana Contrera, Network Engineer<\/span><\/p>\n<p><span>Over the years, Meta\u2019s AI infrastructure has transitioned from CPU-based to GPU-based training due to growing AI workloads. As a result, we have deployed large-scale, distributed, network-interconnected systems to support these systems and workloads. .<\/span><\/p>\n<p><span>Today, our training models use a RoCE-based network fabric with a CLOS topology, where leaf switches are connected to GPU hosts and spine switches provide the Scale-Out connectivity to GPUs in the cluster.<\/span><\/p>\n<p><span>Hany Morsy and Susana Contrera delve into how Meta\u2019s network builds have evolved to support the needs of AI services. Along the way, they share challenges encountered, new solutions that were implemented, and the strategic considerations that have gone into building Meta\u2019s high-performance, efficient network fabric for AI workloads.<\/span><\/p>\n<div class=\"fb-video\"><\/div>\n<h2><span>Scaling RoCE Networks for AI Training<\/span><\/h2>\n<p><span>Adi Gangidi, Production Network Engineer<\/span><\/p>\n<p><span>Adi Gangidi provides an overview of Meta\u2019s RDMA deployment based on RoCEV2 transport for supporting our production AI training infrastructure. He sheds light on how Meta\u2019s infrastructure is designed to both maximize the raw performance and consistency that is fundamental for AI-related workloads.<\/span><\/p>\n<p><span>The talk also covers challenges in the routing, transport, and hardware layers that were solved along the way to scale Meta\u2019s infrastructure, as well as opportunities for further progress over the next few years.<\/span><\/p>\n<div class=\"fb-video\"><\/div>\n<h2><span>Traffic Engineering for AI Training Networks<\/span><\/h2>\n<p><span>Shuqiang Zhang, Software Engineer<\/span><span><br \/>\n<\/span><span>Jingyi Yang, Software Engineer<\/span><\/p>\n<p><span>Meta has been operating RoCE-based distributed training clusters to serve its internal AI training workloads since 2020. But those early days saw challenges around maintaining job performance consistency.<\/span><\/p>\n<p><span>Shuqiang Zhang and Jingyi Yang discuss centralized traffic engineering, one of Meta\u2019s solutions to this challenge, which dynamically places traffic over all available paths in a load-balanced manner. They go over the centralized traffic engineering solution\u2019s design, development, evaluation, and operational experience.<\/span><\/p>\n<div class=\"fb-video\"><\/div>\n<h2><span>Network Observability for AI\/HPC Training Workflows<\/span><\/h2>\n<p><span>Shengbao Zheng, Research Scientist<\/span><\/p>\n<p><span>Having high-performance and reliable collective communication over Meta\u2019s AI-Zone RDMA network is foundational for enabling and scaling Meta\u2019s AI training and inference workloads. To facilitate this, it\u2019s necessary to capture top-down observability from workload to network for collective communication \u2013 this allows us to attribute performance regression and training failures to the backend network when appropriate.<\/span><\/p>\n<p><span>Meta has introduced two important tools for this: The ROCET and PARAM benchmarks and <\/span><a href=\"https:\/\/engineering.fb.com\/2023\/09\/07\/networking-traffic\/chakra-execution-traces-benchmarking-network-performance-optimization\/\"><span>Chakra<\/span><\/a><span> ecosystems. We build ROCET to associate the job to RDMA network metrics and provide analysis on top. In addition, we built the PARAM benchmark to allow for analyzing and tuning collective communication operations through workload trace. We recently shared these systems with the community via <\/span><a href=\"https:\/\/engineering.fb.com\/2023\/09\/07\/networking-traffic\/chakra-execution-traces-benchmarking-network-performance-optimization\/\"><span>Chakra<\/span><\/a><span>, which allows for co-designing efficient distributed ML systems. In this talk, Shengbao Zheng discusses the design and use cases for these tools.<\/span><\/p>\n<div class=\"fb-video\"><\/div>\n<h2><span>Arcadia: End-to-end AI System Performance Simulator: Fostering data-driven decision-making processes and promoting the future evolution of AI systems<\/span><\/h2>\n<p><span>Zhaodong Wang, Research Scientist<\/span><span><br \/>\n<\/span><span>Satyajeet Singh Ahuja, Networking Modeling and Optimization Engineer<\/span><\/p>\n<p><a href=\"https:\/\/engineering.fb.com\/2023\/09\/07\/data-infrastructure\/arcadia-end-to-end-ai-system-performance-simulator\/\"><span>Arcadia<\/span><\/a><span> is a unified system designed to simulate the compute, memory, and network performance of AI training clusters. By providing a multi-disciplinary performance analysis framework, Arcadia aims to facilitate the design and optimization of various system levels, including application, network, and hardware.<\/span><\/p>\n<p><span>With Arcadia, researchers and practitioners can gain valuable insights into the performance of future AI models and workloads on specific infrastructures and make data-driven decisions around how models and hardware will evolve in the future.<\/span><\/p>\n<p><span>Arcadia allows Meta\u2019s engineers to simulate the performance impact of scheduled operational tasks on AI-models that are running in production and helps them make job-aware decisions during day-to-day operational activity.<\/span><\/p>\n<p><span>Zhaodong Wang and Satyajeet Singh Ahuja discuss Arcadia\u2019s capabilities and its potential impact in advancing the field of AI systems and infrastructure.<\/span><\/p>\n<div class=\"fb-video\"><\/div>\n<p>The post <a href=\"https:\/\/engineering.fb.com\/2023\/11\/15\/networking-traffic\/watch-metas-engineers-on-building-network-infrastructure-for-ai\/\">Watch: Meta\u2019s engineers on building network infrastructure for AI<\/a> appeared first on <a href=\"https:\/\/engineering.fb.com\/\">Engineering at Meta<\/a>.<\/p>\n<p>Engineering at Meta<\/p>","protected":false},"excerpt":{"rendered":"<p>Meta is building for the future of AI at every level \u2013 from hardware like MTIA v1, Meta\u2019s first-generation AI inference accelerator to publicly released models like Llama 2, Meta\u2019s next-generation large language model, as well as new generative AI (GenAI) tools like Code Llama. Delivering next-generation AI products and services at Meta\u2019s scale also&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2023\/11\/15\/watch-metas-engineers-on-building-network-infrastructure-for-ai\/\">Continue reading <span class=\"screen-reader-text\">Watch: Meta\u2019s engineers on building network infrastructure for AI<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-787","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":836,"url":"https:\/\/fde.cat\/index.php\/2024\/03\/12\/building-metas-genai-infrastructure\/","url_meta":{"origin":787,"position":0},"title":"Building Meta\u2019s GenAI Infrastructure","date":"March 12, 2024","format":false,"excerpt":"Marking a major investment in Meta\u2019s AI future, we are announcing two 24k GPU clusters. We are sharing details on the hardware, network, storage, design, performance, and software that help us extract high throughput and reliability for various AI workloads. We use this cluster design for Llama 3 training. We\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":773,"url":"https:\/\/fde.cat\/index.php\/2023\/10\/18\/how-meta-is-creating-custom-silicon-for-ai\/","url_meta":{"origin":787,"position":1},"title":"How Meta is creating custom silicon for AI","date":"October 18, 2023","format":false,"excerpt":"With the recent launches of MTIA v1,\u00a0 Meta\u2019s first-generation AI inference accelerator, and Llama 2,\u00a0 the next generation of Meta\u2019s publicly available large language model, it\u2019s clear that Meta is focused on advancing AI for a more connected world. Fueling the success of these products are world-class infrastructure teams, including\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":852,"url":"https:\/\/fde.cat\/index.php\/2024\/04\/11\/building-new-custom-silicon-for-metas-ai-workloads\/","url_meta":{"origin":787,"position":2},"title":"Building new custom silicon for Meta\u2019s AI workloads","date":"April 11, 2024","format":false,"excerpt":"The post Building new custom silicon for Meta\u2019s AI workloads appeared first on Engineering at Meta. Engineering at Meta","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":670,"url":"https:\/\/fde.cat\/index.php\/2023\/01\/27\/watch-metas-engineers-discuss-optimizing-large-scale-networks\/","url_meta":{"origin":787,"position":3},"title":"Watch Meta\u2019s engineers discuss optimizing large-scale networks","date":"January 27, 2023","format":false,"excerpt":"Managing network solutions amidst a growing scale inherently brings challenges around performance, deployment, and operational complexities.\u00a0 At Meta, we\u2019ve found that these challenges broadly fall into three themes: 1.) \u00a0 Data center networking: Over the past decade, on the physical front, we have seen a rise in vendor-specific hardware that\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":795,"url":"https:\/\/fde.cat\/index.php\/2023\/11\/21\/writing-and-linting-python-at-scale\/","url_meta":{"origin":787,"position":4},"title":"Writing and linting Python at scale","date":"November 21, 2023","format":false,"excerpt":"Python plays a big part at Meta. It powers Instagram\u2019s backend and plays an important role in our configuration systems, as well as much of our AI work. Meta even made contributions to Python 3.12, the latest version of Python. On this episode of the\u00a0Meta Tech Podcast, Meta engineer Pascal\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":501,"url":"https:\/\/fde.cat\/index.php\/2021\/11\/09\/ocp-summit-2021-open-networking-hardware-lays-the-groundwork-for-the-metaverse\/","url_meta":{"origin":787,"position":5},"title":"OCP Summit 2021: Open networking hardware lays the groundwork for the metaverse","date":"November 9, 2021","format":false,"excerpt":"Open infrastructure technologies and networking hardware will play an important role as we build new technologies for the metaverse, where billions of people will someday come together in virtual spaces. As we head toward the next major computing platform with a continued spirit of embracing openness and disaggregation, we\u2019re announcing\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/787","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=787"}],"version-history":[{"count":0,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/787\/revisions"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=787"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=787"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=787"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}