{"id":897,"date":"2024-07-16T16:00:51","date_gmt":"2024-07-16T16:00:51","guid":{"rendered":"https:\/\/fde.cat\/index.php\/2024\/07\/16\/ai-lab-the-secrets-to-keeping-machine-learning-engineers-moving-fast\/"},"modified":"2024-07-16T16:00:51","modified_gmt":"2024-07-16T16:00:51","slug":"ai-lab-the-secrets-to-keeping-machine-learning-engineers-moving-fast","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2024\/07\/16\/ai-lab-the-secrets-to-keeping-machine-learning-engineers-moving-fast\/","title":{"rendered":"AI Lab: The secrets to keeping machine learning engineers moving fast"},"content":{"rendered":"<p><span>The key to developer velocity across AI lies in minimizing time to first batch (TTFB) for machine learning (ML) engineers.<\/span><br \/>\n<span>AI Lab is a pre-production framework used internally at Meta. It allows us to continuously A\/B test common ML workflows \u2013 enabling proactive improvements and automatically preventing regressions on TTFB.\u00a0<\/span><br \/>\n<span>AI Lab prevents TTFB regressions whilst enabling experimentation to develop improvements. For example, during the rollout of the open source <\/span><a href=\"https:\/\/github.com\/facebookincubator\/cinder\"><span>Python Cinder runtime<\/span><\/a><span>, AI Lab was used to yield a 2x increase on original TTFB improvements, reducing <\/span><a href=\"https:\/\/engineering.fb.com\/2024\/01\/18\/developer-tools\/lazy-imports-cinder-machine-learning-meta\/\"><span>TTFB by up to 40%<\/span><\/a><span>.<\/span><\/p>\n<p><span>Time to first batch (TTFB), the delay from when a workflow is submitted to the training job\u2019s first batch, plays an important role in accelerating our machine learning (ML) engineers\u2019 iteration speeds. Essentially, TTFB is the time elapsed from the moment you hit the \u201cstart\u201d button on your ML model training to the point when the first batch of data enters the model for processing. TTFB influences the overhead for all ML training jobs and is essentially the moment when developers first get a signal on their job.\u00a0<\/span><\/p>\n<p><span>By minimizing TTFB we\u2019re unblocking our ML engineers, increasing the number of iterations they can do per day, and improving the overall speed of innovation at Meta.<\/span><\/p>\n<p><span>Supporting TTFB across Meta requires a scalable offering to not only enable proactive improvements on this valuable metric, but also keep it healthy autonomously. To this end we\u2019ve created AI Lab, a pre-production TTFB signal generation tool which empowers infra owners to ship new changes with high confidence, reducing <\/span><a href=\"https:\/\/engineering.fb.com\/2024\/01\/18\/developer-tools\/lazy-imports-cinder-machine-learning-meta\/\"><span>TTFB by up to 40%<\/span><\/a><span>. This, coupled with automatic prevention of regressions keeps ML engineers moving fast across Meta.<\/span><\/p>\n<h2><span>Optimizing TTFB helps ML engineers move fast<\/span><\/h2>\n<p><span>The overhead induced from TTFB is on the critical path for most ML development. It is composed of components like config validation, feature pre-processing, and infra overhead (like queuing for capacity). Optimizations to components of TTFB can even impact the entire training cycle of some models. At Meta\u2019s scale, the metric value of TTFB often subtly changes as developers iterate on their model, launcher, or architecture.<\/span><\/p>\n<p>Example TTFB measurement with components.<\/p>\n<p>\u00a0<\/p>\n<p><span>To get and keep ML engineers moving fast, two things are required:<\/span><\/p>\n<p>Offensively improve TTFB: <span>We need an intuitive, easy-to-use experimentation framework that allows users to quantify the impact of their changes, enabling fast iteration, and impact certification of new features, empowering infra owners to ship new changes with high confidence.<\/span><br \/>\nDefensively prevent regressions on TTFB: <span>We need continuous regression prevention that tests the latest changes in a low-noise environment, whilst providing a way to monitor, detect, and prevent regressions from affecting ML engineers in the first place.<\/span><\/p>\n<h2><span>Introducing AI Lab<\/span><\/h2>\n<p><span>AI Lab is a specialized pre-production framework in which we continuously execute common ML workflows as an A\/B test to accurately measure the impact of recent changes on metrics like TTFB. Built on top of the same systems as <\/span><a href=\"https:\/\/engineering.fb.com\/2018\/10\/19\/android\/mobilelab\/\"><span>MobileLab<\/span><\/a><span>, AI Lab automatically defends TTFB by preventing regressions prior to release and enables offensive TTFB improvements opportunistically as an experimentation framework.\u00a0<\/span><\/p>\n<p><span>Building AI Lab presented unique challenges. Because GPU capacity is such a precious resource, we had to ensure we were a net positive to capacity usage across Meta. We took care to work with partners on shrunk models and simple configurations like some that could run on only CPUs, but still prevent the regressions that would regularly tie up GPUs. To this end, we created an auto-shrinker that aims to ensure tests run the same code \/ configurations as production; except consume less compute resources. It does things like reduce the number of training iterations and model size, even enabling more deterministic behavior. These tests often run in &lt;10 minutes, which is beneficial for developers iterating on potential TTFB changes. We also needed a holistic strategy to scale to the size of Meta, something we\u2019ll cover in a later section.<\/span><\/p>\n<p>AI Lab finding a regression in TTFB.<\/p>\n<p>\u00a0<\/p>\n<p><span>Let\u2019s jump into a real example for how we can leverage a tool like AI Lab to reduce TTFB.<\/span><\/p>\n<h2><span>Reducing TTFB with the Python Cinder runtime and AI Lab<\/span><\/h2>\n<p><span>Meta\u2019s open source <\/span><a href=\"https:\/\/github.com\/facebookincubator\/cinder\"><span>Python Cinder runtime<\/span><\/a><span> brought with it up to a <\/span><a href=\"https:\/\/engineering.fb.com\/2024\/01\/18\/developer-tools\/lazy-imports-cinder-machine-learning-meta\/\"><span>40% improvement in TTFB<\/span><\/a><span> thanks to the aggressive lazy imports. Here, we see the true utility of a framework like AI Lab and how it was used to facilitate this sweeping change.<\/span><\/p>\n<h3><span>Offensively<\/span><\/h3>\n<p><span>We can leverage AI Lab instead of experimenting on real ML engineers\u2019 workflows that may require days or weeks of turnaround to validate a performance hypothesis. With AI Lab, in less than an hour, we\u2019re able to accurately test and measure the impact of a proposed Cinder version on TTFB across a comprehensive set of representative ML scenarios.\u00a0\u00a0\u00a0\u00a0<\/span><\/p>\n<p><span>In practice, developers turned this into an iteration loop to test further optimizations and fine-tune Cinder, yielding a 2x increase on the original TTFB improvements they were seeing. For example, initially in profiles with Cinder enabled engineers found that up to 10% of the execution time was spent in a workflow to just pretty print. Turns out, the method of memoization used caused a <\/span><span>repr()<\/span><span> to happen on an underlying data structure, which just so happened to be huge in typical ML scenarios. Instead, they made an object wrapper on this underlying data structure and made memoization comparisons using the <\/span><a href=\"https:\/\/docs.python.org\/3\/reference\/datamodel.html\"><span>object identities<\/span><\/a><span> instead.<\/span><\/p>\n<p><span>AI Lab verified the improvement, enabling them to proceed with rolling out the change.<\/span><\/p>\n<h3><span>Defensively<\/span><\/h3>\n<p><span>Around when Cinder began rolling out, a regression just so happened to occur that was completely unrelated to the rollout. In this new regression, an engineer added some logging that they believed was being done <\/span><a href=\"https:\/\/docs.python.org\/3\/glossary.html#term-coroutine\"><span>asynchronously<\/span><\/a><span>. Unbeknownst to them, the call was actually blocking due to one of the nested clients they required being synchronous. AI Lab leveraged <\/span><a href=\"https:\/\/engineering.fb.com\/2020\/03\/05\/developer-tools\/incident-tracker\/\"><span>Incident Tracker<\/span><\/a><span> and automatically attributed the regression down to the specific change. The change author of the regression was notified shortly afterwards, reverting their change before the release went out to production.<\/span><\/p>\n<p><span>Thanks to AI Lab, the engineers working on Cinder never had to worry about a TTFB regression occurring in the same release they rolled out in, avoiding a potential rollback.<\/span><\/p>\n<p>AI Lab root causing a specific change that caused a TTFB regression.<\/p>\n<p>\u00a0<\/p>\n<h2><span>How to achieve prevention at Meta\u2019s scale<\/span><\/h2>\n<p><span>We want to give accurate TTFB signals as early as possible in the development cycle, but it\u2019s infeasible to benchmark all ML scenarios for every change made by every engineer at Meta. Instead, similar to <\/span><a href=\"https:\/\/engineering.fb.com\/2018\/11\/21\/developer-tools\/predictive-test-selection\/\"><span>predictive test selection<\/span><\/a><span>, we establish a limit on capacity used and set out to find as many regressions\/improvements as early in the development cycle as possible. In practice, this means:<\/span><\/p>\n<p>AI Lab integrating at various stages pre-production.<\/p>\n<p>\u00a0<\/p>\n<p>O(Code Changes):<span> Running relevant, effective, and computationally efficient (often CPU-only) AI Lab tests on prospective changes before they are even reviewed.<\/span><br \/>\nO(Releases):<span> Running a more holistic set of AI Lab tests prior to release and performing a <\/span><a href=\"https:\/\/engineering.fb.com\/2020\/03\/05\/developer-tools\/incident-tracker\/\"><span>bisect-like<\/span><\/a><span> attribution process to find the root cause.<\/span><\/p>\n<p><span>Attribution in this manner is highly effective and efficient; it serves as a great fallback when we must run more computationally intensive tests to find a certain regression.<\/span><\/p>\n<p>AI Lab\u2019s high-level end-to-end flow.<\/p>\n<p>\u00a0<\/p>\n<p><span>Should we find a statistically significant change per a <\/span><a href=\"https:\/\/en.wikipedia.org\/wiki\/Student%27s_t-test\"><span>t-test<\/span><\/a><span>, we perform additional checks before marking it as a regression\/improvement:<\/span><\/p>\n<p><span>Run confirmation runs to confirm we confidently reproduce the expected regression\/improvement.<\/span><br \/>\n<span>Ensure the size of the regression\/improvement is above a dynamic threshold based on the standard deviation of the test and a tuned <\/span><a href=\"https:\/\/en.wikipedia.org\/wiki\/Receiver_operating_characteristic\"><span>receiver operating characteristic<\/span><\/a><span>. For example, a partner may require &lt;1 false positive per week, which sets the threshold for our tests to find as many true positives as possible whilst staying below that.<\/span><\/p>\n<h2><span>Inviting industry collaboration<\/span><\/h2>\n<p><span>While AI Lab is an internal-only tool at Meta, we would love to hear from members of the community who may be running similar platforms. Synthetic signal production is a boon to both developers and users. When developers can rapidly evaluate a hypothesis, and users can experience fewer regressions, it speeds up AI innovation across the industry. We\u2019d love to collaborate with the industry to explore more ways we can improve on tools like AI Lab and optimize more metrics like TTFB.<\/span><\/p>\n<h2><span>Acknowledgements<\/span><\/h2>\n<p><span>AI Lab was made possible due to the foundational work of <\/span><a href=\"https:\/\/engineering.fb.com\/2018\/10\/19\/android\/mobilelab\/\"><span>MobileLab<\/span><\/a><span>. As we aim to scale past TTFB, we look forward to tackling AI efficiency metrics too with <\/span><a href=\"https:\/\/www.usenix.org\/conference\/osdi24\/presentation\/chow\"><span>ServiceLab<\/span><\/a><span>. We\u2019d like to thank members of the AI Training Orchestration team for helping us build AI Lab and all of our users for leveraging the product to keep improving TTFB.<\/span><\/p>\n<p>The post <a href=\"https:\/\/engineering.fb.com\/2024\/07\/16\/developer-tools\/ai-lab-secrets-machine-learning-engineers-moving-fast\/\">AI Lab: The secrets to keeping machine learning engineers moving fast<\/a> appeared first on <a href=\"https:\/\/engineering.fb.com\/\">Engineering at Meta<\/a>.<\/p>\n<p>Engineering at Meta<\/p>","protected":false},"excerpt":{"rendered":"<p>The key to developer velocity across AI lies in minimizing time to first batch (TTFB) for machine learning (ML) engineers. AI Lab is a pre-production framework used internally at Meta. It allows us to continuously A\/B test common ML workflows \u2013 enabling proactive improvements and automatically preventing regressions on TTFB.\u00a0 AI Lab prevents TTFB regressions&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2024\/07\/16\/ai-lab-the-secrets-to-keeping-machine-learning-engineers-moving-fast\/\">Continue reading <span class=\"screen-reader-text\">AI Lab: The secrets to keeping machine learning engineers moving fast<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-897","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":814,"url":"https:\/\/fde.cat\/index.php\/2024\/01\/18\/lazy-is-the-new-fast-how-lazy-imports-and-cinder-accelerate-machine-learning-at-meta\/","url_meta":{"origin":897,"position":0},"title":"Lazy is the new fast: How Lazy Imports and Cinder accelerate machine learning at Meta","date":"January 18, 2024","format":false,"excerpt":"At Meta, the quest for faster model training has yielded an exciting milestone: the adoption of Lazy Imports and the Python Cinder runtime. The outcome? Up to 40 percent time to first batch (TTFB) improvements, along with a 20 percent reduction in Jupyter kernel startup times. This advancement facilitates swifter\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":271,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/faster-more-efficient-systems-for-finding-and-fixing-regressions\/","url_meta":{"origin":897,"position":1},"title":"Faster, more efficient systems for finding and fixing regressions","date":"August 31, 2021","format":false,"excerpt":"Every workday, Facebook engineers commit thousands of diffs (which is a change consisting of one or more files) into production. This code velocity allows us to rapidly ship new features, deliver bug fixes and optimizations, and run experiments. However, a natural downside to moving quickly in any industry is the\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":893,"url":"https:\/\/fde.cat\/index.php\/2024\/07\/10\/metas-approach-to-machine-learning-prediction-robustness\/","url_meta":{"origin":897,"position":2},"title":"Meta\u2019s approach to machine learning prediction robustness","date":"July 10, 2024","format":false,"excerpt":"Meta\u2019s advertising business leverages large-scale machine learning (ML) recommendation models that power millions of ads recommendations per second across Meta\u2019s family of apps. Maintaining reliability of these ML systems helps ensure the highest level of service and uninterrupted benefit delivery to our users and advertisers. To minimize disruptions and ensure\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":818,"url":"https:\/\/fde.cat\/index.php\/2024\/01\/29\/improving-machine-learning-iteration-speed-with-faster-application-build-and-packaging\/","url_meta":{"origin":897,"position":3},"title":"Improving machine learning iteration speed with faster application build and packaging","date":"January 29, 2024","format":false,"excerpt":"Slow build times and inefficiencies in packaging and distributing execution files were costing our ML\/AI engineers a significant amount of time while working on our training stack. By addressing these issues head-on, we were able to reduce this overhead by double-digit percentages.\u00a0 In the fast-paced world of AI\/ML development, it\u2019s\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":618,"url":"https:\/\/fde.cat\/index.php\/2022\/08\/10\/scaling-data-ingestion-for-machine-learning-training-at-meta\/","url_meta":{"origin":897,"position":4},"title":"Scaling data ingestion for machine learning training at Meta","date":"August 10, 2022","format":false,"excerpt":"Many of Meta\u2019s products, such as search and language translations, utilize AI models to continuously improve user experiences. As the performance of hardware we use to support training infrastructure increases, we need to scale our data ingestion infrastructure accordingly to handle workloads more efficiently. GPUs, which are used for training\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":842,"url":"https:\/\/fde.cat\/index.php\/2024\/03\/20\/optimizing-rtc-bandwidth-estimation-with-machine-learning\/","url_meta":{"origin":897,"position":5},"title":"Optimizing RTC bandwidth estimation with machine learning","date":"March 20, 2024","format":false,"excerpt":"Bandwidth estimation (BWE) and congestion control play an important role in delivering high-quality real-time communication (RTC) across Meta\u2019s family of apps. We\u2019ve adopted a machine learning (ML)-based approach that allows us to solve networking problems holistically across cross-layers such as BWE, network resiliency, and transport. We\u2019re sharing our experiment results\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/897","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=897"}],"version-history":[{"count":0,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/897\/revisions"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=897"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=897"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=897"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}