{"id":886,"date":"2024-06-24T16:00:57","date_gmt":"2024-06-24T16:00:57","guid":{"rendered":"https:\/\/fde.cat\/index.php\/2024\/06\/24\/leveraging-ai-for-efficient-incident-response\/"},"modified":"2024-06-24T16:00:57","modified_gmt":"2024-06-24T16:00:57","slug":"leveraging-ai-for-efficient-incident-response","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2024\/06\/24\/leveraging-ai-for-efficient-incident-response\/","title":{"rendered":"Leveraging AI for efficient incident response"},"content":{"rendered":"<p><span>We\u2019re sharing how we streamline system reliability investigations using a new AI-assisted root cause analysis system.<\/span><br \/>\n<span>The system uses a combination of heuristic-based retrieval and large language model-based ranking to speed up root cause identification during investigations.<\/span><br \/>\n<span>Our testing has shown this new system achieves <\/span><span>42%<\/span><span> accuracy in identifying root causes for investigations at their creation time related to our web monorepo.<\/span><\/p>\n<p><span>Investigation is a critical part of ensuring system reliability, and a prerequisite to mitigating issues quickly. This is why Meta is investing in advancing our suite of investigation tooling with tools like<\/span><a href=\"https:\/\/engineering.fb.com\/2023\/12\/19\/data-infrastructure\/hawkeye-ai-debugging-meta\/\"> <span>Hawkeye<\/span><\/a><span>, which we use internally for debugging end-to-end machine learning workflows.<\/span><\/p>\n<p><span>Now, we\u2019re leveraging AI to advance our investigation tools even further. We\u2019ve streamlined our investigations through a combination of heuristic-based retrieval and large language model (LLM)-based ranking to provide AI-assisted root cause analysis. During backtesting, this system has achieved promising results: 42% accuracy in identifying root causes for investigations at their creation time related to our web monorepo.<\/span><\/p>\n\n<h2><span>Investigations at Meta<\/span><\/h2>\n<p><span>Every investigation is unique. But identifying the root cause of an issue is necessary to mitigate it properly.\u00a0 Investigating issues in systems dependent on <\/span><a href=\"https:\/\/en.wikipedia.org\/wiki\/Monorepo\"><span>monolithic repositories<\/span><\/a><span> can present scalability challenges due to the accumulating number of changes involved across many teams. In addition, responders need to build context on the investigation to start working on it, e.g., what is broken, which systems are involved, and who might be impacted.\u00a0<\/span><\/p>\n<p><span>These challenges can make investigating anomalies a complex and time consuming process. AI offers an opportunity to streamline the process, reducing the time needed and helping responders make better decisions. We focused on building a system capable of identifying potential code changes that might be the root cause for a given investigation.<\/span><\/p>\n<p>Figure 1: A responder\u2019s view of an investigation journey.<\/p>\n<h2><span>Our approach to root cause isolation<\/span><\/h2>\n<p><span>The system incorporates a novel heuristics-based retriever that is capable of reducing the search space from thousands of changes to a few hundred without significant reduction in accuracy using, for example., code and directory ownership or exploring the runtime code graph of impacted systems. Once we have reduced the search space to a few hundred changes relevant to the ongoing investigation, we rely on a LLM-based ranker system to identify the root cause across these changes.<\/span><\/p>\n<p>Figure 2: The system flow for our AI-assisted root cause analysis system.<\/p>\n<p><span>The ranker system uses a <\/span><a href=\"https:\/\/llama.meta.com\/\"><span>Llama<\/span><\/a><span> model to further reduce the search space from hundreds of potential code changes to a list of the top five. We explored different ranking algorithms and prompting scenarios and found that ranking through election was most effective to accommodate context window limitations and enable the model to reason across different changes. To rank the changes, we structure prompts to contain a maximum of 20 changes at a time, asking the LLM to identify the top five changes. The output across the LLM requests are aggregated and the process is repeated until we have only five candidates left. Based on exhaustive backtesting, with historical investigations and the information available at their start, 42% of these investigations had the root cause in the top five suggested code changes.<\/span><\/p>\n<p>Figure 3: Ranking possible code changes through election.<\/p>\n<h2><span>Training<\/span><\/h2>\n<p><span>The biggest lever to achieving <\/span><span>42%<\/span><span> accuracy was fine-tuning a Llama 2 (7B) model using historical investigations for which we knew the underlying root cause. We started by running continued pre-training (CPT) using limited and approved internal wikis, Q&amp;As, and code to expose the model to Meta artifacts. Later, we ran a supervised fine-tuning (SFT) phase where we mixed <\/span><span>Llama<\/span><span> 2\u2019s original SFT data with more internal context and a dedicated investigation root cause analysis (RCA) SFT dataset to teach the model to follow RCA instructions.<\/span><\/p>\n<p>Figure 4: The Llama 2 (7B) root cause analysis training process.<\/p>\n<p><span>Our RCA SFT dataset consists of ~5,000 instruction-tuning examples with details of 2-20 changes from our retriever, including the known root cause, and information known about the investigation at its start, e.g., its title and observed impact. Naturally, the available information density is low at this point, however this allows us to perform better in similar real-world scenarios when we have limited information at the beginning of the investigation.\u00a0<\/span><\/p>\n<p><span>Using the same fine-tuning data format for each possible culprit then allows us to gather the model\u2019s llog probabilities(logprobs) and rank our search space based on relevancy to a given investigation. We then curated a set of similar fine-tuning examples where we expect the model to yield a list of potential code changes likely responsible for the issue ordered by their logprobs-ranked relevance, with the expected root cause at the start. Appending this new dataset to the original RCA SFT dataset and re-running SFT gives the model the ability to respond appropriately to prompts asking for ranked lists of changes relevant to the investigation.<\/span><\/p>\n<p>Figure 5: The process for generating fine-tuning prompts to enable the LLM to produce ranked lists.<\/p>\n<h2><span>The future of AI-assisted Investigations<\/span><\/h2>\n<p><span>The application of AI in this context presents both opportunities and risks. For instance, it can reduce effort and time needed to root cause an investigation significantly, but it can potentially suggest wrong root causes and mislead engineers. To mitigate this, we ensure that all employee-facing features prioritize closed feedback loops and explainability of results. This strategy ensures that responders can independently reproduce the results generated by our systems to validate their results. We also rely on confidence measurement methodologies to detect low confidence answers and avoid recommending them to the users \u2013 sacrificing reach in favor of precision.<\/span><\/p>\n<p><span>By integrating AI-based systems into our internal tools we\u2019ve successfully leveraged them for tasks like onboarding engineers to investigations and root cause isolation. Looking ahead, we envision expanding the capabilities of these systems to autonomously execute full workflows and validate their results. Additionally, we anticipate that we can further streamline the development process by utilizing AI to detect potential incidents prior to code push, thereby proactively mitigating risks before they arise.<\/span><\/p>\n<h2>Acknowledgements<\/h2>\n<p><span>We wish to thank contributors to this effort across many teams throughout Meta, particularly Alexandra Antiochou, Beliz Gokkaya, Julian Smida, Keito Uchiyama, Shubham Somani; and our leadership: Alexey Subach, Ahmad Mamdouh Abou, Shahin Sefati, Shah Rahman, Sharon Zeng, and Zach Rait. <\/span><\/p>\n<p>The post <a href=\"https:\/\/engineering.fb.com\/2024\/06\/24\/data-infrastructure\/leveraging-ai-for-efficient-incident-response\/\">Leveraging AI for efficient incident response<\/a> appeared first on <a href=\"https:\/\/engineering.fb.com\/\">Engineering at Meta<\/a>.<\/p>\n<p>Engineering at Meta<\/p>","protected":false},"excerpt":{"rendered":"<p>We\u2019re sharing how we streamline system reliability investigations using a new AI-assisted root cause analysis system. The system uses a combination of heuristic-based retrieval and large language model-based ranking to speed up root cause identification during investigations. Our testing has shown this new system achieves 42% accuracy in identifying root causes for investigations at their&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2024\/06\/24\/leveraging-ai-for-efficient-incident-response\/\">Continue reading <span class=\"screen-reader-text\">Leveraging AI for efficient incident response<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-886","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":850,"url":"https:\/\/fde.cat\/index.php\/2024\/04\/09\/enhancing-aiops-efficiency-salesforces-new-similarity-model-overcomes-4-major-incident-management-challenges\/","url_meta":{"origin":886,"position":0},"title":"Enhancing AIOps Efficiency: Salesforce\u2019s New Similarity Model Overcomes 4 Major Incident Management Challenges","date":"April 9, 2024","format":false,"excerpt":"Optimizing the management of alerts from monitoring tools is crucial for efficient operations. However, it can be challenging due to the lack of confirmation on whether subsequent alerts indicate the same underlying problem. This leads to a repetitive and time-consuming process for an organization\u2019s operations team \u2014 including site reliability\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":624,"url":"https:\/\/fde.cat\/index.php\/2022\/08\/29\/improving-metas-slo-workflows-with-data-annotations\/","url_meta":{"origin":886,"position":1},"title":"Improving Meta\u2019s SLO workflows with data annotations","date":"August 29, 2022","format":false,"excerpt":"When we focus on minimizing errors and downtime here at Meta, we place a lot of attention on service-level indicators (SLIs) and service-level objectives (SLOs). Consider Instagram, for example. There, SLIs represent metrics from different product surfaces, like the volume of error response codes to certain endpoints, or the number\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":806,"url":"https:\/\/fde.cat\/index.php\/2023\/12\/19\/ai-debugging-at-meta-with-hawkeye\/","url_meta":{"origin":886,"position":2},"title":"AI debugging at Meta with HawkEye","date":"December 19, 2023","format":false,"excerpt":"HawkEye is the powerful toolkit used internally at Meta for monitoring, observability, and debuggability of the end-to-end machine learning (ML) workflow that powers ML-based products. HawkEye supports recommendation and ranking models across several products at Meta. Over the past two years, it has facilitated order of magnitude improvements in the\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":759,"url":"https:\/\/fde.cat\/index.php\/2023\/09\/07\/arcadia-an-end-to-end-ai-system-performance-simulator\/","url_meta":{"origin":886,"position":3},"title":"Arcadia: An end-to-end AI system performance simulator","date":"September 7, 2023","format":false,"excerpt":"We\u2019re introducing Arcadia, Meta\u2019s unified system that simulates the compute, memory, and network performance of AI training clusters. Extracting maximum performance from an AI cluster and increasing overall efficiency warrants a multi-input system that accounts for various hardware and software parameters across compute, storage, and network collectively. Arcadia gives Meta\u2019s\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":172,"url":"https:\/\/fde.cat\/index.php\/2020\/12\/09\/how-facebook-keeps-its-large-scale-infrastructure-hardware-up-and-running\/","url_meta":{"origin":886,"position":4},"title":"How Facebook keeps its large-scale infrastructure hardware up and running","date":"December 9, 2020","format":false,"excerpt":"Facebook\u2019s services rely on fleets of servers in data centers all over the globe \u2014 all running applications and delivering the performance our services need. This is why we need to make sure our server hardware is reliable and that we can manage server hardware failures at our scale with\u2026","rel":"","context":"In &quot;External&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":541,"url":"https:\/\/fde.cat\/index.php\/2022\/02\/14\/sre-weekly-issue-309\/","url_meta":{"origin":886,"position":5},"title":"SRE Weekly Issue #309","date":"February 14, 2022","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, Rootly: Manage incidents directly from Slack with Rootly \ud83d\ude92. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging the right team, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly shirt): https:\/\/rootly.com\/demo\/?utm_source=sreweekly Articles\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/886","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=886"}],"version-history":[{"count":0,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/886\/revisions"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=886"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=886"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=886"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}