{"id":758,"date":"2023-09-07T19:35:28","date_gmt":"2023-09-07T19:35:28","guid":{"rendered":"https:\/\/fde.cat\/index.php\/2023\/09\/07\/using-chakra-execution-traces-for-benchmarking-and-network-performance-optimization\/"},"modified":"2023-09-07T19:35:28","modified_gmt":"2023-09-07T19:35:28","slug":"using-chakra-execution-traces-for-benchmarking-and-network-performance-optimization","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2023\/09\/07\/using-chakra-execution-traces-for-benchmarking-and-network-performance-optimization\/","title":{"rendered":"Using Chakra execution traces for benchmarking and network performance optimization"},"content":{"rendered":"<p><span>Meta presents <\/span><a href=\"https:\/\/arxiv.org\/pdf\/2305.14516.pdf\"><span>Chakra execution traces<\/span><\/a><span>, an open graph-based representation of AI\/ML workload execution, laying the foundation for benchmarking and network performance optimization.<\/span><br \/>\n<span>Chakra execution traces represent key operations, such as compute, memory, and communication, data and control dependencies, timing, and resource constraints.<\/span><br \/>\n<a href=\"https:\/\/mlcommons.org\/en\/groups\/research-chakratracebench\/\"><span>In collaboration with MLCommons<\/span><\/a><span>, we are seeking industry-wide adoption for benchmarking.\u00a0<\/span><br \/>\n<span>Meta open sourced a set of tools to enable the collection, analysis, generation, and adoption of Chakra execution traces by a broad range of simulators, emulators, and replay tools.<\/span><\/p>\n<p><span>At Meta, our endeavors are not only geared towards <\/span><a href=\"https:\/\/www.metaconnect.com\/en\/home?utm_source=fbengineering&amp;utm_medium=organic\"><span>pushing the boundaries of AI\/ML<\/span><\/a><span> but also towards optimizing the vast networks that enable these computations. Our agile, reproducible, and standardized benchmarking system plays an important role in this. Through our collaboration with MLCommons, and our deep insights into traditional benchmarking constraints, we have initiated the Chakra execution traces\u2014a graph-based representation of AI\/ML workloads. This approach aims to unify diverse execution trace schemas, seeking industry-wide adoption for enhanced AI efficiency analysis tools and holistic performance benchmarking.<\/span><\/p>\n<h2><span>The limitations of traditional AI benchmarking methodology<\/span><\/h2>\n<p><span>Traditionally, benchmarking AI systems has largely relied on running full ML workloads. Established benchmarking approaches, such as <\/span><a href=\"https:\/\/engineering.fb.com\/2018\/12\/12\/ml-applications\/mask-r-cnn2go\/\"><span>MLPerf<\/span><\/a><span>, have provided invaluable insights into the behavior and performance of AI workloads and systems. However, traditional full workload benchmarking presents several challenges:<\/span><\/p>\n<p>Difficulty in forecasting future system performance<span>: When designing an AI system, engineers frequently face the challenge of predicting the performance of future systems. Such predictions become even more complex when the compute engines aren\u2019t ready or when changes in network topology and bandwidth become necessary. Relying on full workloads to evaluate the performance of these not-yet-realized systems is not feasible.<\/span><br \/>\nHigh compute cost<span>: Executing full workload benchmarks comes at a substantial compute cost. Given that training contemporary ML models often requires thousands of graphics processing units (GPUs), these benchmarks should ideally be executed on a similarly vast number of GPUs. Additionally, gauging the performance of a system using this method can be time-consuming.<\/span><br \/>\nInability to adapt to evolving workloads<span>: The landscape of ML workloads and their requirements is rapidly evolving. Traditional full workload benchmarks fall short when it comes to addressing these changing needs, primarily because they necessitate significant efforts to standardize workloads as benchmarks.<\/span><\/p>\n<h2><span>An overview of Chakra<\/span><\/h2>\n<p><span>Building upon our insights into the constraints of traditional benchmarking, we present the Chakra execution traces. This new approach provides an open, interoperable graph-based depiction of AI\/ML workload execution. The Chakra execution trace captures core operations\u2014including compute, memory, and <\/span> <span>communication\u2014along with their dependencies, timing, and metadata.\u00a0<\/span><\/p>\n<p><span>Though execution traces are a valuable representation of an ML task, the structure and metadata of the resulting traces can differ based on the ML framework utilized. Recognizing this, Chakra introduces a standardized schema for performance modeling, termed the Chakra execution trace. The below figure outlines the Chakra ecosystem, with execution traces as its central component. As depicted in the figure, Chakra also offers a range of tools to convert, visualize, generate, and simulate these execution traces.<\/span><\/p>\n\n<h2><span>How Meta leverages Chakra execution traces<\/span><\/h2>\n<p><span>At Meta, we collect execution traces from our production servers every day. These execution traces serve multiple purposes: Benchmarking, visualization, and performance optimization.<\/span><\/p>\n<h3><span>Benchmarking<\/span><\/h3>\n<p><span>Benchmarking is essential for improving current AI systems and planning future networks. We specifically utilize Chakra execution traces for this task. We have developed several benchmarking tools, including <\/span><a href=\"https:\/\/dl.acm.org\/doi\/abs\/10.1145\/3579371.3589072\"><span>Mystique<\/span><\/a><span> and <\/span><a href=\"https:\/\/github.com\/facebookresearch\/param\"><span>PARAM<\/span><\/a><span>. Mystique allows us to replicate the performance of an ML workload by replaying both compute and communication operators found in execution traces. It leverages the Chakra execution trace to record runtime details of a model at the operator level and then replays them to reproduce the original performance. In line with our vision, the <\/span><a href=\"https:\/\/mlcommons.org\/en\/groups\/research-chakratracebench\/\"><span>MLCommons Chakra working group<\/span><\/a><span> is curating the \u2018Chakra trace benchmark suite\u2019 by gathering execution traces from various industry players.<\/span><\/p>\n\n<h3><span>Visualization and performance optimization<\/span><\/h3>\n<p><span>One example of visualization and performance optimization is the analysis of collective message sizes. We analyze production execution traces using an automated system. The visual data generated aids us in identifying any balance or imbalance in collective message sizes across different ranks. Our visualization tool can precisely highlight these imbalances, as shown by the below figure.\u00a0<\/span><\/p>\n\n<p><span>With this information at hand, Meta engineers are equipped to craft appropriate solutions, ensuring a balanced message size, as demonstrated in the below figure.<\/span><\/p>\n\n<h2><span>Future plans<\/span><\/h2>\n<h3><span>Enhancing the benchmarking capability of Chakra execution traces<\/span><\/h3>\n<p><span>While the execution trace replayer enables replay of execution traces, it brings forth challenges. A primary challenge is the intrinsic linkage of collected execution traces to specific systems. Because traces are gathered from actual machine runs, the kernels executed are optimized for the specific system at play. As a result, traces sourced from one system might not accurately simulate on another with a different GPU, network topology, and bandwidth.<\/span><\/p>\n<p><span>We\u2019re addressing this constraint in collaboration with the MLCommons Chakra working group. We aspire to gather execution traces prior to the operator optimization phase for any target system, as shown in the figure. These are termed pre-execution traces. In parallel, to enable benchmarking next-gen AI systems, we\u2019re streamlining the process from trace collection to simulation on a simulator.<\/span><\/p>\n\n<h3><span>Using AI to generate representative execution traces<\/span><\/h3>\n<p><span>Chakra execution traces are capable of identifying network bottlenecks in ML workload execution. However, optimizing SW\/HW stacks with production execution traces presents a practical challenge. The main challenge arises when trying to globally optimize our production systems. Given the sheer volume of production traces, exhaustively running them for system optimization is neither feasible nor efficient. Doing so would be both time-consuming and computationally expensive. Thus, selecting a representative subset of production execution traces becomes imperative.\u00a0<\/span><\/p>\n<p><span>However, there\u2019s a risk: The chosen traces might not holistically represent the global characteristics, potentially skewing optimization efforts towards only specific ML workloads. We envision a generative AI model that can identify and generate execution traces that are representative of the primary characteristics observed. We also plan to incorporate an obfuscation mechanism within the AI model. This will facilitate trace sharing without jeopardizing intellectual property, fostering SW\/HW co-design between different companies.<\/span><\/p>\n\n<p>\u00a0<\/p>\n<h2><span>Taking the leap with industry collaboration<\/span><\/h2>\n<p><span>For such an ecosystem to flourish, industry consensus is paramount. Our collaboration with the MLCommons consortium, an open engineering assembly of over 50 leading companies, is a testament to our commitment. This collaboration aims to establish Chakra within its fold, providing a framework for broad adoption.<\/span><\/p>\n<p><span>Chakra\u2019s working group under MLCommons will spearhead efforts to create and develop:<\/span><\/p>\n<p><span>A standardized schema that can capture and convert execution traces from diverse frameworks.<\/span><br \/>\n<span>ML models for creating representative Chakra execution traces \u2013 protecting proprietary information while also projecting future AI workloads.<\/span><br \/>\n<span>An open ecosystem of tools for benchmarks, simulations, and emulations.<\/span><br \/>\n<span>Comprehensive benchmarks with Chakra execution traces based on MLCommons\/MLPerf guidelines.<\/span><\/p>\n<h2><span>Join us on this journey<\/span><\/h2>\n<p><span>Our vision is to forge an agile, reproducible benchmarking and co-design system for AI. Collaboration with peers, academic institutions, and consortiums will be pivotal. We invite interested individuals and companies to become a part of the <\/span><a href=\"https:\/\/mlcommons.org\/en\/groups\/research-chakratracebench\/\"><span>Chakra working group<\/span><\/a><span>, to help contribute to the paradigm shift in benchmarking and network performance optimization.<\/span><\/p>\n<h2><span>Read the research paper<\/span><\/h2>\n<p><a href=\"https:\/\/arxiv.org\/pdf\/2305.14516.pdf\"><span>Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces<\/span><\/a><\/p>\n<h2><span>Acknowledgements<\/span><\/h2>\n<p><span>We would like to thank all contributors to the Chakra project within Meta: Taekyung Heo, Srinivas Sridharan, Brian Coutinho, Hiwot Kassa, Matt Bergeron, Parth Malani, Shashi Gandham, Omar Baldonado, our external partners in Georgia Tech and MLCommons, as well as external collaborators in AMD, CMU, Cornell, Enfabrica, Google, Harvard, HP Labs, Intel, Keysight Technologies, Microsoft, NVIDIA, OCP, and Stanford.<\/span><\/p>\n<p>The post <a href=\"https:\/\/engineering.fb.com\/2023\/09\/07\/networking-traffic\/chakra-execution-traces-benchmarking-network-performance-optimization\/\">Using Chakra execution traces for benchmarking and network performance optimization<\/a> appeared first on <a href=\"https:\/\/engineering.fb.com\/\">Engineering at Meta<\/a>.<\/p>\n<p>Engineering at Meta<\/p>","protected":false},"excerpt":{"rendered":"<p>Meta presents Chakra execution traces, an open graph-based representation of AI\/ML workload execution, laying the foundation for benchmarking and network performance optimization. Chakra execution traces represent key operations, such as compute, memory, and communication, data and control dependencies, timing, and resource constraints. In collaboration with MLCommons, we are seeking industry-wide adoption for benchmarking.\u00a0 Meta open&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2023\/09\/07\/using-chakra-execution-traces-for-benchmarking-and-network-performance-optimization\/\">Continue reading <span class=\"screen-reader-text\">Using Chakra execution traces for benchmarking and network performance optimization<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-758","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":787,"url":"https:\/\/fde.cat\/index.php\/2023\/11\/15\/watch-metas-engineers-on-building-network-infrastructure-for-ai\/","url_meta":{"origin":758,"position":0},"title":"Watch: Meta\u2019s engineers on building network infrastructure for AI","date":"November 15, 2023","format":false,"excerpt":"Meta is building for the future of AI at every level \u2013 from hardware like MTIA v1, Meta\u2019s first-generation AI inference accelerator to publicly released models like Llama 2, Meta\u2019s next-generation large language model, as well as new generative AI (GenAI) tools like Code Llama. Delivering next-generation AI products and\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":626,"url":"https:\/\/fde.cat\/index.php\/2022\/08\/31\/introducing-velox-an-open-source-unified-execution-engine\/","url_meta":{"origin":758,"position":1},"title":"Introducing Velox: An open source unified execution engine","date":"August 31, 2022","format":false,"excerpt":"Meta is introducing Velox, an open source unified execution engine aimed at accelerating data management systems and streamlining their development. Velox is under active development. Experimental results from our paper published at the International Conference on Very Large Data Bases (VLDB) 2022 show how Velox improves efficiency and consistency in\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":326,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/asicmon-a-platform-agnostic-observability-system-for-ai-accelerators\/","url_meta":{"origin":758,"position":2},"title":"Asicmon: A platform agnostic observability system for AI accelerators","date":"August 31, 2021","format":false,"excerpt":"We will be hosting a talk about our work on, \u201cA Platform Agnostic Observability System for AI Accelerators\u201d during our virtual Systems @Scale event at 10:20 a.m. PT on Wednesday, June 30, followed by a live Q&A session. Please submit any questions to systemsatscale@fb.com before the event. Accelerators are special-purpose\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":869,"url":"https:\/\/fde.cat\/index.php\/2024\/05\/22\/composable-data-management-at-meta\/","url_meta":{"origin":758,"position":3},"title":"Composable data management at Meta","date":"May 22, 2024","format":false,"excerpt":"In recent years, Meta\u2019s data management systems have evolved into a composable architecture that creates interoperability, promotes reusability, and improves engineering efficiency.\u00a0 We\u2019re sharing how we\u2019ve achieved this, in part, by leveraging Velox, Meta\u2019s open source execution engine, as well as work ahead as we continue to rethink our data\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":768,"url":"https:\/\/fde.cat\/index.php\/2023\/10\/05\/meta-contributes-new-features-to-python-3-12\/","url_meta":{"origin":758,"position":4},"title":"Meta contributes new features to Python 3.12","date":"October 5, 2023","format":false,"excerpt":"Python 3.12 is out! It includes new features and performance improvements \u2013 some contributed by Meta \u2013 that we believe will benefit all Python users. We\u2019re sharing details about these new features that we worked closely with the Python community to develop. This week\u2019s release of Python 3.12 marks a\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":301,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/reverse-debugging-at-scale\/","url_meta":{"origin":758,"position":5},"title":"Reverse debugging at scale","date":"August 31, 2021","format":false,"excerpt":"Say you receive an email notification that a service is crashing just after your last code change deploys. The crash happens in only 0.1 percent of the servers where it runs. But you\u2019re at a large-scale company, so 0.1 percent equals thousands of servers \u2014 and\u00a0this issue is going to\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/758","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=758"}],"version-history":[{"count":0,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/758\/revisions"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=758"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=758"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=758"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}