{"id":818,"date":"2024-01-29T17:00:57","date_gmt":"2024-01-29T17:00:57","guid":{"rendered":"https:\/\/fde.cat\/index.php\/2024\/01\/29\/improving-machine-learning-iteration-speed-with-faster-application-build-and-packaging\/"},"modified":"2024-01-29T17:00:57","modified_gmt":"2024-01-29T17:00:57","slug":"improving-machine-learning-iteration-speed-with-faster-application-build-and-packaging","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2024\/01\/29\/improving-machine-learning-iteration-speed-with-faster-application-build-and-packaging\/","title":{"rendered":"Improving machine learning iteration speed with faster application build and packaging"},"content":{"rendered":"<p><span>Slow build times and inefficiencies in packaging and distributing execution files were costing our ML\/AI engineers a significant amount of time while working on our training stack.<\/span><br \/>\n<span>By addressing these issues head-on, we were able to reduce this overhead by double-digit percentages.\u00a0<\/span><\/p>\n<p><span>In the fast-paced world of AI\/ML development, it\u2019s crucial to ensure that our infrastructure can keep up with the increasing demands and needs of our ML engineers, whose workflows include checking out code, writing code, building, packaging, and verification.<\/span><\/p>\n\n<p><span>In our efforts to maintain efficiency and productivity while empowering our ML\/AI engineers to deliver cutting-edge solutions, we found two major challenges that needed to be addressed head-on: slow builds and inefficiencies in packaging and distributing executable files.<\/span><\/p>\n<p><span>The frustrating problem of slow builds often arises when ML engineers work on older (\u201ccold\u201d) revisions for which our build infrastructure doesn\u2019t maintain a high cache hit rate, requiring us to repeatedly rebuild and relink many components. Moreover, build non-determinism further contributes to rebuilding by introducing inefficiencies and producing different outputs for the same source code, making previously cached results unusable.<\/span><\/p>\n<p><span>Executable packaging and distribution was another significant challenge because, historically, most ML Python executables were represented as<\/span><a href=\"https:\/\/engineering.fb.com\/2018\/07\/13\/data-infrastructure\/xars-a-more-efficient-open-source-system-for-self-contained-executables\/\"> <span>XAR files<\/span><\/a><span> (self-contained executables) and it is not always possible to leverage OSS layer-based solutions efficiently (see more details below). Unfortunately, creating such executables can be computationally costly, especially when dealing with a large number of files or substantial file sizes. Even if a developer modifies only a few Python files, a full XAR file reassembly and distribution is often required, causing delays for the executable to be executed on remote machines.<\/span><\/p>\n<p><span>Our goal in improving build speed was to minimize the need for extensive rebuilding. To accomplish this, we streamlined the build graph by reducing dependency counts, mitigated the challenges posed by build non-determinism, and maximized the utilization of built artifacts.<\/span><\/p>\n<p><span>Simultaneously, our efforts in packaging and distribution aimed to introduce incrementality support, thereby eliminating the time-consuming overhead associated with XAR creation and distribution.<\/span><\/p>\n<h2><span>How we improved build speeds<\/span><\/h2>\n<p><span>To make builds faster we wanted to ensure that we built as little as possible by addressing non-determinism and eliminating unused code and dependencies.<\/span><\/p>\n<p><span>We identified two sources of build non-determinism:<\/span><\/p>\n<p>Non-determinism in tooling.<span> Some compilers, such as Clang, Rustc, and NVCC, can produce different binary files for the same input, leading to non-deterministic results. Tackling these tooling non-determinism issues proved challenging, as they often required extensive root cause analysis and time-consuming fixes.<\/span><br \/>\nNon-determinism in source code and build rules.<span> Developers, whether intentionally or unintentionally, introduced non-determinism by incorporating things like temporary directories, random values, or timestamps into build rules code. Addressing these issues posed a similar challenge, demanding a substantial investment of time to identify and fix.<\/span><\/p>\n<p><span>Thanks to<\/span><a href=\"https:\/\/engineering.fb.com\/2023\/04\/06\/open-source\/buck2-open-source-large-scale-build-system\/\"> <span>Buck2<\/span><\/a><span>,<\/span><span> which sends nearly all of the build actions to the<\/span><a href=\"https:\/\/buck2.build\/docs\/users\/remote_execution\/\"> <span>Remote Execution (RE) service<\/span><\/a><span>, we have been able to successfully implement non-determinism mitigation within RE. Now we provide consistent outputs for identical actions, paving the way for the adoption of a warm and stable revision for ML development. In practice, this approach will eliminate build times in many cases.<\/span><\/p>\n<p><span>Though removing the build process from the critical path of ML engineers might not be possible in all cases, we understand how important it is to handle dependencies for controlling build times. As dependencies naturally increased, we made enhancements to our tools for managing them better. These improvements helped us find and remove many unnecessary dependencies, making build graph analysis and overall build times much better. For example, we removed GPU code from the final binary when it wasn\u2019t needed and figured out ways to identify which Python modules are actually used and cut native code using linker maps.<\/span><\/p>\n<h2><span>Adding incrementality for executable distribution<\/span><\/h2>\n<p><span>A typical self-executable Python binary, when unarchived, is represented by thousands of Python files (.py and\/or .pyc), substantial native libraries, and the Python interpreter. The cumulative result is a multitude of files, often numbering in the hundreds of thousands, with a total size reaching tens of gigabytes.<\/span><\/p>\n<p><span>Engineers <\/span><span>spend a significant amount of time<\/span><span> dealing with incremental builds where packaging and fetching overhead of such a large executable surpasses the build time. In response to this challenge, we implemented a new solution for the packaging and distribution of Python executables \u2013 the Content Addressable Filesystem (CAF).<\/span><\/p>\n<p><span>The primary strength of CAF lies in its ability to operate incrementally during content addressable file packaging and fetching stages:<\/span><\/p>\n<p>Packaging<span>: By adopting a content-aware approach, CAF can intelligently skip redundant uploads of files already present in Content Addressable Storage (CAS), whether as part of a different executable or the same executable with a different version.<\/span><br \/>\nFetching<span>: CAF maintains a cache on the destination host, ensuring that only content not already present in the cache needs to be downloaded.<\/span><\/p>\n<p><span>To optimize the efficiency of this system, we deploy a CAS daemon on the majority of Meta\u2019s data center hosts. The CAS daemon assumes multiple responsibilities, including maintaining the local cache on the host (materialization into the cache and cache GC) and organizing a P2P network with other CAS daemon instances using<\/span><a href=\"https:\/\/engineering.fb.com\/2022\/07\/14\/data-infrastructure\/owl-distributing-content-at-meta-scale\/\"> <span>Owl<\/span><\/a><span>, our high-fanout distribution system for large data objects. This P2P network enables direct content fetching from other CAS daemon instances, significantly reducing latency and storage bandwidth capacity.<\/span><\/p>\n<p><span>In the case of CAF, an executable is defined by a flat manifest file detailing all symlinks, directories, hard links, and files, along with their digest and attributes. This manifest implementation allows us to deduplicate all unique files across executables and implement a smart affinity\/routing mechanism for scheduling, thereby minimizing the amount of content that needs to be downloaded by maximizing local cache utilization.<\/span><\/p>\n<p><span>While the concept may bear some resemblance to what<\/span><a href=\"https:\/\/gdevillele.github.io\/engine\/userguide\/storagedriver\/overlayfs-driver\/\"> <span>Docker achieves with OverlayFS<\/span><\/a><span>, our approach differs significantly. Organizing proper layers is not always feasible in our case due to the number of executables with diverse dependencies. In this context, layering becomes less efficient and its organization becomes more complex to achieve. Additionally direct access to files is essential for P2P support.<\/span><\/p>\n<p><span>We opted for<\/span><a href=\"https:\/\/docs.kernel.org\/filesystems\/btrfs.html#:~:text=Btrfs%20is%20a%20copy%20on,open%20for%20contribution%20from%20anyone.\"> <span>Btrfs<\/span><\/a><span> as our filesystem because of its<\/span><a href=\"https:\/\/btrfs.readthedocs.io\/en\/latest\/Compression.html\"> <span>compression<\/span><\/a><span> and ability to write compressed storage data directly to extents, which bypasses redundant decompression and compression and<\/span><a href=\"https:\/\/btrfs.readthedocs.io\/en\/latest\/Reflink.html\"> <span>Copy-on-write (COW)<\/span><\/a><span> capabilities. These attributes allow us to maintain executables on block devices with a total size similar to those represented as XAR files, share the same files from cache across executables, and implement a highly efficient COW mechanism that, when needed, only copies affected file extents.<\/span><\/p>\n<h2><span>LazyCAF and enforcing uniform revisions: Areas for further ML iteration improvements<\/span><\/h2>\n<p><span>The improvements we implemented have proven highly effective, drastically reducing the overhead and significantly elevating the efficiency of our ML engineers. Faster build times and more efficient packaging and distribution of executables have reduced overhead by double-digit percentages.<\/span><\/p>\n<p><span>Yet, our journey to slash build overhead doesn\u2019t end here. We\u2019ve identified several promising improvements that we aim to deliver soon. In our investigation into our ML workflows, we discovered that only a fraction of the entire executable content is utilized in certain scenarios. Recognizing that, we intend\u00a0 to start working on optimizations to fetch executable parts on demand, thereby significantly reducing materialization time and minimizing the overall disk footprint.<\/span><\/p>\n<p><span>We can also further accelerate the development process by enforcing uniform revisions. We plan to enable all our ML engineers to operate on the same revision, which will improve the cache hit ratios of our build. This move will further increase the percentage of incremental builds since most of the artifacts will be cached.<\/span><\/p>\n<p>The post <a href=\"https:\/\/engineering.fb.com\/2024\/01\/29\/ml-applications\/improving-machine-learning-iteration-speed-with-faster-application-build-and-packaging\/\">Improving machine learning iteration speed with faster application build and packaging<\/a> appeared first on <a href=\"https:\/\/engineering.fb.com\/\">Engineering at Meta<\/a>.<\/p>\n<p>Engineering at Meta<\/p>","protected":false},"excerpt":{"rendered":"<p>Slow build times and inefficiencies in packaging and distributing execution files were costing our ML\/AI engineers a significant amount of time while working on our training stack. By addressing these issues head-on, we were able to reduce this overhead by double-digit percentages.\u00a0 In the fast-paced world of AI\/ML development, it\u2019s crucial to ensure that our&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2024\/01\/29\/improving-machine-learning-iteration-speed-with-faster-application-build-and-packaging\/\">Continue reading <span class=\"screen-reader-text\">Improving machine learning iteration speed with faster application build and packaging<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-818","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":897,"url":"https:\/\/fde.cat\/index.php\/2024\/07\/16\/ai-lab-the-secrets-to-keeping-machine-learning-engineers-moving-fast\/","url_meta":{"origin":818,"position":0},"title":"AI Lab: The secrets to keeping machine learning engineers moving fast","date":"July 16, 2024","format":false,"excerpt":"The key to developer velocity across AI lies in minimizing time to first batch (TTFB) for machine learning (ML) engineers. AI Lab is a pre-production framework used internally at Meta. It allows us to continuously A\/B test common ML workflows \u2013 enabling proactive improvements and automatically preventing regressions on TTFB.\u00a0\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":814,"url":"https:\/\/fde.cat\/index.php\/2024\/01\/18\/lazy-is-the-new-fast-how-lazy-imports-and-cinder-accelerate-machine-learning-at-meta\/","url_meta":{"origin":818,"position":1},"title":"Lazy is the new fast: How Lazy Imports and Cinder accelerate machine learning at Meta","date":"January 18, 2024","format":false,"excerpt":"At Meta, the quest for faster model training has yielded an exciting milestone: the adoption of Lazy Imports and the Python Cinder runtime. The outcome? Up to 40 percent time to first batch (TTFB) improvements, along with a 20 percent reduction in Jupyter kernel startup times. This advancement facilitates swifter\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":674,"url":"https:\/\/fde.cat\/index.php\/2023\/02\/06\/the-evolution-of-facebooks-ios-app-architecture\/","url_meta":{"origin":818,"position":2},"title":"The evolution of Facebook\u2019s iOS app architecture","date":"February 6, 2023","format":false,"excerpt":"Facebook for iOS (FBiOS) is the oldest mobile codebase at Meta. Since the app was rewritten in 2012, it has been worked on by thousands of engineers and shipped to billions of users, and it can support hundreds of engineers iterating on it at a time. After years of iteration,\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":893,"url":"https:\/\/fde.cat\/index.php\/2024\/07\/10\/metas-approach-to-machine-learning-prediction-robustness\/","url_meta":{"origin":818,"position":3},"title":"Meta\u2019s approach to machine learning prediction robustness","date":"July 10, 2024","format":false,"excerpt":"Meta\u2019s advertising business leverages large-scale machine learning (ML) recommendation models that power millions of ads recommendations per second across Meta\u2019s family of apps. Maintaining reliability of these ML systems helps ensure the highest level of service and uninterrupted benefit delivery to our users and advertisers. To minimize disruptions and ensure\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":751,"url":"https:\/\/fde.cat\/index.php\/2023\/08\/22\/how-is-einstein-gpt-shaping-the-future-of-salesforce-development-and-unleashing-developer-productivity\/","url_meta":{"origin":818,"position":4},"title":"How is Einstein GPT Shaping the Future of Salesforce Development and Unleashing Developer Productivity?","date":"August 22, 2023","format":false,"excerpt":"By Yingbo Zhou and Scott Nyberg In our \u201cEngineering Energizers\u201d Q&A series, we examine the professional life experiences that have shaped Salesforce Engineering leaders. Meet Yingbo Zhou, a Senior Director of Research for Salesforce AI Research, where he leads the team to develop the model for Einstein GPT for Developers\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":791,"url":"https:\/\/fde.cat\/index.php\/2023\/08\/22\/how-is-einstein-shaping-the-future-of-salesforce-development-and-unleashing-developer-productivity\/","url_meta":{"origin":818,"position":5},"title":"How is Einstein Shaping the Future of Salesforce Development and Unleashing Developer Productivity?","date":"August 22, 2023","format":false,"excerpt":"By Yingbo Zhou and Scott Nyberg In our \u201cEngineering Energizers\u201d Q&A series, we examine the professional life experiences that have shaped Salesforce Engineering leaders. Meet Yingbo Zhou, a Senior Director of Research for Salesforce AI Research, where he leads the team to develop the model for Einstein for Developers, a\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/818","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=818"}],"version-history":[{"count":0,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/818\/revisions"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=818"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=818"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=818"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}