{"id":759,"date":"2023-09-07T19:10:13","date_gmt":"2023-09-07T19:10:13","guid":{"rendered":"https:\/\/fde.cat\/index.php\/2023\/09\/07\/arcadia-an-end-to-end-ai-system-performance-simulator\/"},"modified":"2023-09-07T19:10:13","modified_gmt":"2023-09-07T19:10:13","slug":"arcadia-an-end-to-end-ai-system-performance-simulator","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2023\/09\/07\/arcadia-an-end-to-end-ai-system-performance-simulator\/","title":{"rendered":"Arcadia: An end-to-end AI system performance simulator"},"content":{"rendered":"<p><span>We\u2019re introducing Arcadia, Meta\u2019s unified system that simulates the compute, memory, and network performance of AI training clusters.<\/span><br \/>\n<span>Extracting maximum performance from an AI cluster and increasing overall efficiency warrants a multi-input system that accounts for various hardware and software parameters across compute, storage, and network collectively.<\/span><br \/>\n<span>Arcadia gives Meta\u2019s researchers and engineers valuable insights into the performance of AI models and workloads in an AI cluster \u2013 enabling data-driven decision making in the design of AI clusters.<\/span><\/p>\n<p><span>AI plays an important role in the work we do at Meta. We leverage AI to deliver more <\/span><a href=\"https:\/\/engineering.fb.com\/2023\/08\/09\/ml-applications\/scaling-instagram-explore-recommendations-system\/\"><span>personalized experiences and recommendations<\/span><\/a><span> for people across our family of apps. We\u2019re also committed to <a href=\"https:\/\/www.metaconnect.com\/en\/home?utm_source=fbengineering&amp;utm_medium=organic\" target=\"_blank\" rel=\"noopener\">advancing the state-of-the-art<\/a> in <\/span><a href=\"https:\/\/ai.meta.com\/blog\/code-llama-large-language-model-coding\/\"><span>generative AI<\/span><\/a><span>, computer vision, new augmented reality (AR) tools, natural language processing (NLP), and other core areas of AI for a wide range of applications.<\/span><\/p>\n<p><span>Delivering on these commitments means maximizing the performance of every GPU within our AI clusters across three performance pillars: Compute, memory, and network.<\/span><\/p>\n<p><span>Within these pillars, AI cluster performance can be influenced by multiple factors, including<\/span><\/p>\n<p><span>model parameters, workload distribution, job scheduler logic, topology, and hardware specs. But focusing on these pillars in isolation leads to local performance optimization efforts that are unable to tap into the full extent of cluster performance. From an organizational perspective, this further leads to decreased efficiencies because multiple efforts with the same goal of increasing cluster performance aren\u2019t being holistically prioritized. These challenges will only grow as large language models (LLMs) become more prevalent.\u00a0<\/span><\/p>\n<p><span>We need a systemized source of truth that can simulate various performance factors across compute, storage, and network collectively. That\u2019s where Arcadia, Meta\u2019s end-to-end AI system performance simulator, comes in. Arcadia is designed to create a unified simulation framework that accurately models the performance of compute, memory, and network components within large-scale AI training clusters. Using insights from Arcadia, our engineers and developers can make data-driven design decisions for AI clusters and infrastructure that supports it while they are being developed.<\/span><\/p>\n<h2><span>Challenges of optimizing AI clusters\u00a0<\/span><\/h2>\n<p><span>When we think about optimizing our AI clusters, there are several factors to take into consideration:<\/span><\/p>\n<p>Our large-scale distributed system<span>: Advancement in any area of AI, whether it is computer vision, speech, or NLP requires training large and complex models. At Meta, this is facilitated by multiple high performing computing infrastructure clusters. For instance, <\/span><a href=\"https:\/\/ai.meta.com\/blog\/ai-rsc\/\"><span>the AI Research SuperCluster for AI research<\/span><\/a><span>.<\/span><br \/>\nOur multi-layered system<span>: At Meta, we control the stack from physical network to applications. This translates into multiple tunable parameters across network, compute, memory, application, and scheduling to achieve the desired model performance. Finding the right set of parameters and inputs for achieving good model performance is an iterative task that can increase training time significantly.<\/span><br \/>\nOur operational workflows<span>: The availability of the underlying infrastructure is a major factor that can influence model training time. For instance, a component failure can trigger a job to be rolled-back to a previous checkpoint and progress made would be lost. At our scale, operating such clusters without operational awareness data can lead to performance losses.<\/span><br \/>\nAI workload characteristics<span>: Our training clusters cater to workloads across multiple use cases that may exhibit different sets of characteristics ranging from memory and compute-intensive, to latency-sensitive, and parallelizable. Keeping track of these characteristics across multiple workloads is already challenging. But the problem\u2019s complexity increases by an order of magnitude due to uncertainty around future workloads and predicting various workloads for optimum performance.<\/span><br \/>\nThe need for a common source of truth<span>: Interdisciplinary research efforts, such as AI cluster-performance optimization, span multiple teams across network, compute, and storage. These teams may be working on outdated assumptions about other pillars as they drive their own local optimization efforts. Lack of a holistic approach in such cases often leads to organizational inefficiencies such as decision-making challenges and duplicative efforts.<\/span><\/p>\n<h2><span>The Arcadia system<\/span><\/h2>\n<p><span>Our primary objective with Arcadia is to develop a multi-disciplinary performance analysis system that enables design and joint optimization across various system levels, including application, network, and hardware.<\/span><\/p>\n<p><span>Arcadia empowers stakeholders to examine and enhance different aspects such as machine learning (ML) model architectures, collective algorithms, job scheduling, hardware, and network architecture design. By providing insights into the impact of these factors on system performance, Arcadia facilitates data-driven decision-making processes and fosters the evolution of models and hardware.<\/span><\/p>\n\n<h3><span>Inputs<\/span><\/h3>\n<p><span>As shown in the architecture design above, the input to the Arcadia system encompasses a range of essential parameters, including the long-range plans of AI systems and models, network topology and routing protocols, data center floor plans, AI workload distributions, and hardware specifications. Additionally, Arcadia considers failure domains to provide a holistic view of the system\u2019s performance and reliability.<\/span><\/p>\n<h3><span>Core\u00a0<\/span><\/h3>\n<p><span>At the core of Arcadia is an orchestrator that coordinates the simulation of various components, including job scheduling, compute and memory, and network behavior at different levels. The system employs an AI workload synthesizer that learns from production distributions and generates representative workloads as inputs, ensuring the simulation reflects real-world conditions.<\/span><\/p>\n\n<h3><span>Outputs\u00a0<\/span><\/h3>\n<p><span>Arcadia offers a wide range of outputs, including AI training and inference performance metrics, resource utilizations, and reliability and availability metrics. This comprehensive set of metrics empowers stakeholders to analyze the impact of different factors and make informed decisions to optimize system performance.<\/span><\/p>\n<h3><span>Feedback Loop<\/span><\/h3>\n<p><span>Unlike analytical roofline estimates, Arcadia takes into account the network and compute feedback loop, providing accurate estimations of performance that align with real-world production measurements. This capability allows for more precise predictions and a better understanding of the expected performance of AI models and workloads on a given infrastructure.<\/span><\/p>\n<h2><span>Arcadia\u2019s benefits<\/span><\/h2>\n<p><span>Arcadia provides operational insights and a level of flexibility in simulation that allows us to address several challenges around optimizing our clusters.<\/span><\/p>\n<p><span>Operational workflows benefit significantly from Arcadia\u2019s simulation capabilities, providing enhanced visibility and a deeper understanding of risk and mitigation plans. Simulation-based audits for AI\/HPC network maintenance can be conducted to identify potential issues and devise appropriate solutions. Maintenance scheduling can be optimized by leveraging Arcadia\u2019s insights, ensuring minimal disruption to AI\/HPC jobs. Furthermore, Arcadia aids in debugging and root-causing production events, enabling efficient troubleshooting and preventing recurrence of issues.<\/span><\/p>\n<p><span>Arcadia offers flexibility in terms of simulation detail levels, catering to different user requirements and purposes. Users who focus solely on the application level can disregard lower-level details, enabling faster simulation runs. On the other hand, for example, users requiring in-depth analysis of low-level network hardware behaviors can leverage Arcadia\u2019s packet-level network simulation to extract detailed insights.<\/span><\/p>\n<p><span>Furthermore, Arcadia serves as a single source of truth that is agreed upon by all stakeholders. This unified approach helps ensure consistent and reliable performance analysis across teams and disciplines, establishing a common framework for hardware, network, job-scheduling, and AI systems co-design.<\/span><\/p>\n<h2><span>Use cases for Arcadia<\/span><\/h2>\n<p><span>There are several use cases for Arcadia system in pursuit of large-scale, efficient high-performance clusters:<\/span><\/p>\n<p><span>Cluster utilization and fragmentation insights<\/span><br \/>\n<span>Measuring the impact of network and hardware on AI\/HPC job performance<\/span><br \/>\n<span>AI\/HPC job profile analysis in the training clusters<\/span><br \/>\n<span>Assessing the reliability, availability, and efficiency of training clusters<\/span><br \/>\n<span>Optimization of training cluster maintenances<\/span><br \/>\n<span>Optimization of AI\/HPC job scheduling and configurations<\/span><\/p>\n<h2><span>Next steps for Arcadia<\/span><\/h2>\n<p><span>As we build out more use cases for Arcadia we\u2019re also developing additional frameworks to expand on its capabilities. This will include a framework to support operational cases in production networks, such as optimizing training cluster maintenance and AI\/HPC job scheduling and configurations.\u00a0<\/span><\/p>\n<p><span>We\u2019re also investigating a framework to provide design insights for different topology\/routing designs given a set of known models. This would be used to surface key bottlenecks in compute, memory, or network and provide insights on how different model parameters can be optimized for a given cluster.<\/span><\/p>\n<p><span>Finally, we\u2019re aiming for Arcadia to support inputs from <\/span><a href=\"https:\/\/mlcommons.org\/en\/groups\/research-chakratracebench\/\"><span>Chakra<\/span><\/a><span>, an open, graph-based representation of AI\/ML workloads being developed in a working group in MLCommons.<\/span><\/p>\n<h2><span>Acknowledgments<\/span><\/h2>\n<p><span>Many people contributed to this project but we\u2019d particularly like to thank Naader Hasani, Petr Lapukhov, Mikel Jimenez Fernandez, Thomas Fuller, Xin Liu, Greg Steinbrecher, Yuhui Zhang, Max Noormohammadpour, Mustafa Ozdal, Kapil Bisht, Phil Buonadonna, Josh Gilliland, Abishek Gopalan, Biao Lu, Gaya Nagarajan, Steve Politis, Kevin Quirk, Jimmy Williams, Yi Zhang, and Ying Zhang.<\/span><\/p>\n<p>The post <a href=\"https:\/\/engineering.fb.com\/2023\/09\/07\/data-infrastructure\/arcadia-end-to-end-ai-system-performance-simulator\/\">Arcadia: An end-to-end AI system performance simulator<\/a> appeared first on <a href=\"https:\/\/engineering.fb.com\/\">Engineering at Meta<\/a>.<\/p>\n<p>Engineering at Meta<\/p>","protected":false},"excerpt":{"rendered":"<p>We\u2019re introducing Arcadia, Meta\u2019s unified system that simulates the compute, memory, and network performance of AI training clusters. Extracting maximum performance from an AI cluster and increasing overall efficiency warrants a multi-input system that accounts for various hardware and software parameters across compute, storage, and network collectively. Arcadia gives Meta\u2019s researchers and engineers valuable insights&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2023\/09\/07\/arcadia-an-end-to-end-ai-system-performance-simulator\/\">Continue reading <span class=\"screen-reader-text\">Arcadia: An end-to-end AI system performance simulator<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-759","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":787,"url":"https:\/\/fde.cat\/index.php\/2023\/11\/15\/watch-metas-engineers-on-building-network-infrastructure-for-ai\/","url_meta":{"origin":759,"position":0},"title":"Watch: Meta\u2019s engineers on building network infrastructure for AI","date":"November 15, 2023","format":false,"excerpt":"Meta is building for the future of AI at every level \u2013 from hardware like MTIA v1, Meta\u2019s first-generation AI inference accelerator to publicly released models like Llama 2, Meta\u2019s next-generation large language model, as well as new generative AI (GenAI) tools like Code Llama. Delivering next-generation AI products and\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":836,"url":"https:\/\/fde.cat\/index.php\/2024\/03\/12\/building-metas-genai-infrastructure\/","url_meta":{"origin":759,"position":1},"title":"Building Meta\u2019s GenAI Infrastructure","date":"March 12, 2024","format":false,"excerpt":"Marking a major investment in Meta\u2019s AI future, we are announcing two 24k GPU clusters. We are sharing details on the hardware, network, storage, design, performance, and software that help us extract high throughput and reliability for various AI workloads. We use this cluster design for Llama 3 training. We\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":879,"url":"https:\/\/fde.cat\/index.php\/2024\/06\/12\/how-meta-trains-large-language-models-at-scale\/","url_meta":{"origin":759,"position":2},"title":"How Meta trains large language models at scale","date":"June 12, 2024","format":false,"excerpt":"As we continue to focus our AI research and development on solving increasingly complex problems, one of the most significant and challenging shifts we\u2019ve experienced is the sheer scale of computation required to train large language models (LLMs). Traditionally, our AI model training has involved a training massive number of\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":641,"url":"https:\/\/fde.cat\/index.php\/2022\/10\/18\/ocp-summit-2022-open-hardware-for-ai-infrastructure\/","url_meta":{"origin":759,"position":3},"title":"OCP Summit 2022: Open hardware for AI infrastructure","date":"October 18, 2022","format":false,"excerpt":"At OCP Summit 2022, we\u2019re announcing Grand Teton, our next-generation platform for AI at scale that we\u2019ll contribute to the OCP community. We\u2019re also sharing new innovations designed to support data centers as they advance to support new AI technologies: A new, more efficient version of Open Rack. Our Air-Assisted\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":758,"url":"https:\/\/fde.cat\/index.php\/2023\/09\/07\/using-chakra-execution-traces-for-benchmarking-and-network-performance-optimization\/","url_meta":{"origin":759,"position":4},"title":"Using Chakra execution traces for benchmarking and network performance optimization","date":"September 7, 2023","format":false,"excerpt":"Meta presents Chakra execution traces, an open graph-based representation of AI\/ML workload execution, laying the foundation for benchmarking and network performance optimization. Chakra execution traces represent key operations, such as compute, memory, and communication, data and control dependencies, timing, and resource constraints. In collaboration with MLCommons, we are seeking industry-wide\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":670,"url":"https:\/\/fde.cat\/index.php\/2023\/01\/27\/watch-metas-engineers-discuss-optimizing-large-scale-networks\/","url_meta":{"origin":759,"position":5},"title":"Watch Meta\u2019s engineers discuss optimizing large-scale networks","date":"January 27, 2023","format":false,"excerpt":"Managing network solutions amidst a growing scale inherently brings challenges around performance, deployment, and operational complexities.\u00a0 At Meta, we\u2019ve found that these challenges broadly fall into three themes: 1.) \u00a0 Data center networking: Over the past decade, on the physical front, we have seen a rise in vendor-specific hardware that\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/759","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=759"}],"version-history":[{"count":0,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/759\/revisions"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=759"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=759"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=759"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}