{"id":893,"date":"2024-07-10T13:00:37","date_gmt":"2024-07-10T13:00:37","guid":{"rendered":"https:\/\/fde.cat\/index.php\/2024\/07\/10\/metas-approach-to-machine-learning-prediction-robustness\/"},"modified":"2024-07-10T13:00:37","modified_gmt":"2024-07-10T13:00:37","slug":"metas-approach-to-machine-learning-prediction-robustness","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2024\/07\/10\/metas-approach-to-machine-learning-prediction-robustness\/","title":{"rendered":"Meta\u2019s approach to machine learning prediction robustness"},"content":{"rendered":"<p><span>Meta\u2019s advertising business leverages large-scale machine learning (ML) recommendation models that power millions of ads recommendations per second across Meta\u2019s family of apps. Maintaining reliability of these ML systems helps <\/span><span>ensure the highest<\/span><span> level of service and uninterrupted benefit delivery to our users and advertisers. To minimize disruptions and ensure our ML systems are intrinsically resilient, we have built a comprehensive set of <\/span>prediction robustness<span> solutions that ensure stability without compromising performance or availability of our ML systems.\u00a0<\/span><\/p>\n<h2><span>Why is machine learning robustness difficult?<\/span><\/h2>\n<p><span>Solving for ML prediction stability has many unique characteristics, making it more complex than addressing stability challenges for traditional online services:\u00a0<\/span><\/p>\n<p>ML models are stochastic by nature.<span> Prediction uncertainty is inherent, which makes it difficult to define, identify, diagnose, reproduce, and debug prediction quality issues.\u00a0<\/span><span><br \/>\n<\/span><br \/>\nConstant and frequent refreshing of models and features.<span> ML models and features are continuously updated to learn from and reflect people\u2019s interests, which makes it challenging to locate prediction quality issues, contain their impact, and quickly resolve them<\/span><span><br \/>\n<\/span><br \/>\nBlurred line between reliability and performance.<span>In traditional online services, reliability issues are easier to detect based on service metrics such as latency and availability. However, ML prediction stability implies a consistent prediction quality shift, which is harder to distinguish. For example, an \u201cavailable\u201d ML recommender system that reliably produces inaccurate predictions is actually \u201cunreliable.\u201d<\/span><span><br \/>\n<\/span><br \/>\nCumulative effect of small distribution shifts over time.<span> Due to the stochastic nature of ML models, small regressions in prediction quality are hard to distinguish from the anticipated organic traffic-pattern changes. However, if undetected, such small prediction regressions could have a significant cumulative negative impact over time.\u00a0<\/span><br \/>\nLong chain of complex interactions.<span> The final ML prediction result is derived from a complex chain of processing and propagation across multiple ML systems. Regression in prediction quality could be traced back to several hops upstream in the chain, making it hard to diagnose and locate stability improvements per specific ML system.\u00a0<\/span><br \/>\nSmall fluctuations can amplify to become big impacts.<span> Even small changes in the input data (e.g., features, training data, and model hyperparameters) can have a significant and unpredictable impact on the final predictions. This poses a major challenge in containing prediction quality issues at particular ML artifacts (model, feature, label), and it requires end-to-end global protection.\u00a0<\/span><br \/>\nRising complexity with rapid modeling innovations.<span> Meta\u2019s ML technologies are<\/span><a href=\"https:\/\/ai.meta.com\/blog\/meta-llama-3\/\"><span> evolving rapidly<\/span><\/a><span>, with increasingly larger and more complex models and <\/span><a href=\"https:\/\/engineering.fb.com\/2024\/03\/12\/data-center-engineering\/building-metas-genai-infrastructure\/\"><span>new architectures<\/span><\/a><span>. This requires prediction robustness solutions to evolve at the same fast pace.\u00a0<\/span><\/p>\n<h2><span>Meta\u2019s approach and progress towards prediction robustness<\/span><\/h2>\n<p><span>Meta has developed a systematic framework to build prediction robustness. This framework includes a set of <\/span>prevention guardrails<span> to build control from outside-in, <\/span>fundamental understanding<span> of the issues to gain ML insights, and a set of technical fortifications to establish <\/span>intrinsic robustness<span>.\u00a0<\/span><\/p>\n<p><span>These three approaches are exercised across models, features, training data, calibration, and interpretability to ensure all possible issues are covered throughout the ML ecosystem. <\/span><span>With prediction robustness, Meta\u2019s ML systems are robust by design, and any stability issues are actively monitored and resolved to ensure smooth ads delivery for our users and advertisers.\u00a0<\/span><\/p>\n<p>Figure 1: A simplified view of Meta\u2019s ads recommendation system shows the flow of complex interactions for producing the final predictions.<\/p>\n<p><span>Our prediction robustness solution systematically covers all areas of the recommender system \u2013 training data, features, models, calibration, and interpretability.\u00a0<\/span><\/p>\n<h3><span>Model robustness<\/span><\/h3>\n<p><span>Model robustness challenges include model snapshot quality, model snapshot freshness, and inferencing availability. We use Snapshot Validator, an internal-only real-time, scalable, and low-latency model evaluation system, as the <\/span>prevention guardrail<span> on the quality of every single model snapshot, before it ever serves production traffic.\u00a0<\/span><\/p>\n<p><span>Snapshot Validator runs evaluations with holdout datasets on newly-published model snapshots in real-time, and it determines whether the new snapshot can serve production traffic. Snapshot Validator has reduced model snapshot corruption by 74% in the past two years. It has protected &gt;90% of Meta ads ranking models in production without prolonging Meta\u2019s real-time model refresh.\u00a0<\/span><\/p>\n<p><span>In addition, Meta engineers built new ML techniques to improve the<\/span> <span>intrinsic robustness of models, such as pruning less-useful modules inside models, better model generalization against overfitting, more effective quantization algorithms, and ensuring model resilience in performance even with a small amount of input data anomalies. Together these techniques have improved the ads ML model stability, making the models resilient against overfitting, loss divergence, and more.\u00a0\u00a0<\/span><\/p>\n<h3><span>Feature robustness<\/span><\/h3>\n<p><span>Feature robustness focuses on guaranteeing the quality of ML features across coverage, data distribution, freshness, and training-inference consistency. As prevention guardrails, robust feature monitoring systems were in production to continuously detect anomalies on ML features. As the ML-feature-value distributions can change widely with non-deterministics sways on model performance, the anomaly detection systems have turned to accommodate the particular traffic and ML prediction patterns for accuracy.\u00a0<\/span><\/p>\n<p><span>Upon detection, automated preventive measures will kick in to ensure abnormal features are not used in production. Furthermore, a real-time feature importance evaluation system is built to provide fundamental understanding of the correlation between feature quality and model prediction quality.\u00a0<\/span><\/p>\n<p><span>All these solutions have effectively contained ML feature issues on coverage drop, data corruption, and inconsistency in Meta.\u00a0<\/span><\/p>\n<h3><span>Training data robustness<\/span><\/h3>\n<p><span>The wide spectrum of Meta ads products requires distinct labeling logics for model training, which significantly increases the complexity of labeling. In addition, the data sources for label calculation could be unstable, due to the complicated logging infrastructure and the organic traffic drifts. Dedicated training-data-quality systems were built as the prevention guardrails to detect label drifts over time with high accuracy, and swiftly and automatically mitigate the abnormal data changes and prevent models from learning the affected training data.\u00a0<\/span><\/p>\n<p><span>Additionally, fundamental understanding of training data label consistency has resulted in optimizations in training data generation for better model learning.\u00a0<\/span><\/p>\n<h3><span>Calibration robustness<\/span><\/h3>\n<p><span>Calibration robustness builds real-time monitoring and auto-mitigation toolsets to guarantee that the final prediction is well calibrated, which is vital for advertiser experiences. The calibration mechanism is technically unique because it is unjoined-data real-time model training, and it is more sensitive to traffic distribution shifts than the joined-data mechanism.\u00a0<\/span><\/p>\n<p><span>To improve the stability and accuracy of calibration Meta has built prevention guardrails that consist of high-precision alert systems to minimize problem-detection time, as well as high-rigor, automatically orchestrated mitigations to minimize problem-mitigation time.<\/span><\/p>\n<h3><span>ML interpretability<\/span><\/h3>\n<p><span>ML interpretability focuses on identifying the root causes of all ML instability issues. <\/span><a href=\"https:\/\/engineering.fb.com\/2023\/12\/19\/data-infrastructure\/hawkeye-ai-debugging-meta\/\"><span>Hawkeye<\/span><\/a><span>, our internal AI debugging toolkit, allows engineers at Meta to root-cause tricky ML prediction problems. Hawkeye is an end-to-end and streamlined diagnostic experience covering all ML artifacts at Meta, and it has covered &gt;80% of ads ML artifacts. It is now one of the most widely used tools in the Meta ML engineering community.\u00a0<\/span><\/p>\n<p><span>Beyond debugging, ML interpretability invests heavily in model internal state understanding \u2013<\/span><span> one of the most complex and technically challenging areas in the realm of ML stability. There are no standardized solutions to this challenge, but Meta uses model graph tracing, which <\/span><span>uses model internal states on model activations and neuron importance, to accurately explain why models get corrupted.\u00a0<\/span><\/p>\n<p><span>Altogether, advancements in ML Interpretability have reduced the time to root-cause ML prediction issues by 50%, and have significantly boosted the<\/span> <span>fundamental understanding of model behaviors.\u00a0<\/span><\/p>\n<h2><span>Improving ranking and productivity with prediction robustness<\/span><\/h2>\n<p><span>Going forward, we\u2019ll be extending our prediction robustness solutions to improve ML ranking performance, and boost engineering productivity by accelerating ML developments.<\/span><\/p>\n<p><span>Prediction robustness techniques can boost ML performance by making models more robust intrinsically, with more stable training, less normalized entropy explosion or loss divergence, more resilience to data shift, and stronger generalizability. We\u2019ve seen performance gains from applying robustness techniques like gradient clipping and more robust quantization algorithms. And we will continue to identify more systematic improvement opportunities with model understanding techniques.<\/span><\/p>\n<p><span>In addition, model performance will be improved with less staleness and stronger consistency between serving and training environments across labels, features, inference platform, and more. We plan to continue upgrading Meta\u2019s ads ML services with stronger guarantees of training-serving consistency and more aggressive staleness SLAs.\u00a0<\/span><\/p>\n<p><span>Regarding ML development productivity, prediction robustness techniques can facilitate model development, and improve daily operations by reducing the time needed to address ML prediction stability issues. We\u2019re currently building an intelligent ML diagnostic platform that will leverage the latest ML technologies, in the context of prediction robustness, to help even engineers with little ML knowledge locate the root cause of ML stability issues within minutes.\u00a0<\/span><\/p>\n<p><span>The platform will also evaluate reliability risk continuously across the development lifecycle, minimizing delays in ML development due to reliability regressions. It will embed reliability into every ML development stage, from idea exploration all the way to online experimentation and final launches.\u00a0<\/span><\/p>\n<h2><span>Acknowledgements<\/span><\/h2>\n<p><span>We would like to thank all the team members and the leadership that contributed to make the Prediction Robustness effort successful in Meta. Special thanks to Alex Gong, Ashish Singh, Ashish Srivastava, Ben Dummitt, Booker Gong, David Serfass, David Thompson, Evan Poon, Govind Kabra, Haibo Lin, Haoyan Yuan, Igor Lytvynenko, Jie Zheng, Jin Zhu, Jing Chen, Junye Wang, Kapil Gupta, Kestutis Patiejunas, Konark Gill, Lanlan Liu, Lu Zheng, Maggie Ma, Marios Kokkodis, Namit Gupta, Ngoc Lan Nguyen, Pedro Perez de Tejada, Pratibha Udmalpet, Qiming Guo, Roopa Iyer, Rohit Iyer, Sam Elshamy, Sagar Chordia, Sheng Luo, Shuo Chang, Shupin Mao, Velavan Trichy, Weifeng Cui, Ximing Chen, Xin Zhao, Yalan Xing, Yiye Lin, Yongjun Xie, Yubin He, Yue Wang, Zewei Jiang, Santanu Kolay, Prabhakar Goyal, Neeraj Bhatia, Sandeep Pandey, Uladzimir Pashkevich, and Matt Steiner. <\/span><\/p>\n<p>The post <a href=\"https:\/\/engineering.fb.com\/2024\/07\/10\/data-infrastructure\/machine-learning-ml-prediction-robustness-meta\/\">Meta\u2019s approach to machine learning prediction robustness<\/a> appeared first on <a href=\"https:\/\/engineering.fb.com\/\">Engineering at Meta<\/a>.<\/p>\n<p>Engineering at Meta<\/p>","protected":false},"excerpt":{"rendered":"<p>Meta\u2019s advertising business leverages large-scale machine learning (ML) recommendation models that power millions of ads recommendations per second across Meta\u2019s family of apps. Maintaining reliability of these ML systems helps ensure the highest level of service and uninterrupted benefit delivery to our users and advertisers. To minimize disruptions and ensure our ML systems are intrinsically&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2024\/07\/10\/metas-approach-to-machine-learning-prediction-robustness\/\">Continue reading <span class=\"screen-reader-text\">Meta\u2019s approach to machine learning prediction robustness<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-893","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":806,"url":"https:\/\/fde.cat\/index.php\/2023\/12\/19\/ai-debugging-at-meta-with-hawkeye\/","url_meta":{"origin":893,"position":0},"title":"AI debugging at Meta with HawkEye","date":"December 19, 2023","format":false,"excerpt":"HawkEye is the powerful toolkit used internally at Meta for monitoring, observability, and debuggability of the end-to-end machine learning (ML) workflow that powers ML-based products. HawkEye supports recommendation and ranking models across several products at Meta. Over the past two years, it has facilitated order of magnitude improvements in the\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":842,"url":"https:\/\/fde.cat\/index.php\/2024\/03\/20\/optimizing-rtc-bandwidth-estimation-with-machine-learning\/","url_meta":{"origin":893,"position":1},"title":"Optimizing RTC bandwidth estimation with machine learning","date":"March 20, 2024","format":false,"excerpt":"Bandwidth estimation (BWE) and congestion control play an important role in delivering high-quality real-time communication (RTC) across Meta\u2019s family of apps. We\u2019ve adopted a machine learning (ML)-based approach that allows us to solve networking problems holistically across cross-layers such as BWE, network resiliency, and transport. We\u2019re sharing our experiment results\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":599,"url":"https:\/\/fde.cat\/index.php\/2022\/06\/14\/applying-federated-learning-to-protect-data-on-mobile-devices\/","url_meta":{"origin":893,"position":2},"title":"Applying federated learning to protect data on mobile devices","date":"June 14, 2022","format":false,"excerpt":"What the research is: Federated learning with differential privacy (FL-DP) is one of the latest privacy-enhancing technologies being evaluated at Meta as we constantly work to enhance user privacy and further safeguard users\u2019 data in the products we design, build, and maintain. FL-DP enhances privacy in two important ways: It\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":814,"url":"https:\/\/fde.cat\/index.php\/2024\/01\/18\/lazy-is-the-new-fast-how-lazy-imports-and-cinder-accelerate-machine-learning-at-meta\/","url_meta":{"origin":893,"position":3},"title":"Lazy is the new fast: How Lazy Imports and Cinder accelerate machine learning at Meta","date":"January 18, 2024","format":false,"excerpt":"At Meta, the quest for faster model training has yielded an exciting milestone: the adoption of Lazy Imports and the Python Cinder runtime. The outcome? Up to 40 percent time to first batch (TTFB) improvements, along with a 20 percent reduction in Jupyter kernel startup times. This advancement facilitates swifter\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":897,"url":"https:\/\/fde.cat\/index.php\/2024\/07\/16\/ai-lab-the-secrets-to-keeping-machine-learning-engineers-moving-fast\/","url_meta":{"origin":893,"position":4},"title":"AI Lab: The secrets to keeping machine learning engineers moving fast","date":"July 16, 2024","format":false,"excerpt":"The key to developer velocity across AI lies in minimizing time to first batch (TTFB) for machine learning (ML) engineers. AI Lab is a pre-production framework used internally at Meta. It allows us to continuously A\/B test common ML workflows \u2013 enabling proactive improvements and automatically preventing regressions on TTFB.\u00a0\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":229,"url":"https:\/\/fde.cat\/index.php\/2021\/02\/02\/ml-lake-building-salesforces-data-platform-for-machine-learning\/","url_meta":{"origin":893,"position":5},"title":"ML Lake: Building Salesforce\u2019s Data Platform for Machine Learning","date":"February 2, 2021","format":false,"excerpt":"Salesforce uses machine learning to improve every aspect of its product suite. With the help of Salesforce Einstein, companies are improving productivity and accelerating key decision-making. Data is a critical component of all machine learning applications and Salesforce is no exception. In this post I will share some unique challenges\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/893","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=893"}],"version-history":[{"count":0,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/893\/revisions"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=893"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=893"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=893"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}