{"id":850,"date":"2024-04-09T17:18:37","date_gmt":"2024-04-09T17:18:37","guid":{"rendered":"https:\/\/fde.cat\/index.php\/2024\/04\/09\/enhancing-aiops-efficiency-salesforces-new-similarity-model-overcomes-4-major-incident-management-challenges\/"},"modified":"2024-04-09T17:18:37","modified_gmt":"2024-04-09T17:18:37","slug":"enhancing-aiops-efficiency-salesforces-new-similarity-model-overcomes-4-major-incident-management-challenges","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2024\/04\/09\/enhancing-aiops-efficiency-salesforces-new-similarity-model-overcomes-4-major-incident-management-challenges\/","title":{"rendered":"Enhancing AIOps Efficiency: Salesforce\u2019s New Similarity Model Overcomes 4 Major Incident Management Challenges"},"content":{"rendered":"<p>Optimizing the management of alerts from monitoring tools is crucial for efficient operations. However, it can be challenging due to the lack of confirmation on whether subsequent alerts indicate the same underlying problem. This leads to a repetitive and time-consuming process for an organization\u2019s operations team \u2014 including site reliability engineers, performance engineers and others \u2014 who must manually analyze each alert, often discovering duplicated issues. To address this, organizations are prioritizing automation (66%) and enhancing productivity (61%), as revealed by a recent <a href=\"https:\/\/devops.com\/large-organizations-are-embracing-aiops\/\">survey<\/a>. These statistics emphasize the daily hurdles faced by operations teams.<\/p>\n<p>To address these challenges, organizations are increasingly adopting Artificial Intelligence for IT Operations (AIOps), which leverages AI to streamline operations and enhance network performance. At Salesforce, our DBAIOps (Database Artificial Intelligence for Operations) team has taken AIOps to the next level and revolutionized database operations by implementing the similarity model.<\/p>\n<p>This model uses advanced techniques such as<a href=\"https:\/\/pyshark.com\/cosine-similarity-explained-using-python\/\"> Cosine similarity<\/a> and <a href=\"https:\/\/pyshark.com\/jaccard-similarity-and-jaccard-distance-in-python\/\">Jaccard similarity <\/a>to measure the similarity in meaning between two pieces of text. By comparing root causes and assigning similarity scores, this model streamlines incident resolution and marks a significant transition in incident management.<\/p>\n<p>This approach helps identify commonalities among incidents, preventing alert overload and facilitating more effective resolution processes. This ultimately improves operational efficiency and reduces the manual workload for the operations team.<\/p>\n<p>Read on to discover how the similarity model helped DBAIOps overcome its four toughest technical challenges.<\/p>\n<p><strong>Challenge #1: Reducing Alerts and Manual Effort<\/strong><\/p>\n<p>Anomaly detection systems traditionally generate alerts for each abnormal pattern detected, failing to consider if subsequent anomalies are related to the same root cause as the initial anomaly.<\/p>\n<p>DBAIOps faced challenges with daily influx of alerts across multiple <a href=\"https:\/\/help.salesforce.com\/s\/articleView?id=000384755&amp;type=1\">instances <\/a>often duplicating issues and requiring manual analysis. Identical performance problems in different instances, like SQL-related issues, led to redundant alerts and manual verification by each performance engineer.<\/p>\n<p>To address this, DBAIOps\u2019 similarity model compares the root causes of alerts. By analyzing the current and previous alerts\u2019 Root Cause Analysis (RCA), these models determine if alerts are duplicates, effectively suppressing subsequent cases. <strong>Validations showed a 23% reduction in duplicate cases<\/strong>, identifying investigations with shared causes and intelligently ignoring them.<\/p>\n<p>This approach enhances incident management efficiency, reduces manual labor, and minimizes noise, allowing operational teams to focus on resolving actual issues.<\/p>\n<p><strong>Challenge #2: Using Historical Context to Solve New Cases<\/strong><\/p>\n<p>In scenarios with multiple alerts from different sources, it is crucial to determine if they are related to the same issue. Traditional approaches lack the capability to do this, leading to duplicated efforts and decreased productivity. Each alert is analyzed individually, without the knowledge of their relationship.<\/p>\n<p>To deal with this, DBAIOps\u2019 similarity model automatically tags current investigations with relevant past resolutions if a similar issue has occurred before. Using this Salesforce technology enables the team to track volumes of historical information, enables knowledge sharing, ensures quick access to past resolutions, and improves the incident resolution process. Approximately <strong>50% of proactive investigations were matched to past similar cases<\/strong> through this efficient tagging, streamlining incident resolution.<\/p>\n<p><strong>Challenge #3: Efficient Assignment Triage<\/strong><\/p>\n<p>Inefficient assignment of engineers with the necessary expertise can cause delays in issue resolution. Previously, investigations are typically assigned to the instance owner by default, with potential reassignment to another engineer based on their availability. However, this approach may overlook important factors like past experience with similar issues.<\/p>\n<p>To tackle this, DBAIOps\u2019 similarity model analyzes historical data and incident patterns to intelligently assign new cases to experts who possess the specific expertise required. This automated triaging process ensures that the right engineer is assigned to each task, leading to faster issue resolution and improved overall productivity. The positive feedback received from the Performance Engineering team further validates the efficacy of our model in accurately triaging cases based on tagged instances, while also reducing the Mean Time To Assign (MTTA).<\/p>\n<p><strong>Challenge #4: Increasing Severity Ranking<\/strong><\/p>\n<p>Frequent alerts on the same issue can indicate a potential customer incident waiting to happen. By default, proactive alerts are often assigned lower severity levels. However, this approach may not effectively handle recurring incidents.<\/p>\n<p>To resolve this, DBAIOps\u2019 similarity model intelligently ranks incident severity by detecting patterns in incidents with the same RCA. For example, by identifying frequently occurring alerts, the severity ranking can be automatically increased.<\/p>\n<p>The instant update in severity ranking is crucial for efficiently identifying and prioritizing critical incidents, leading to a more efficient resolution process. Our implementation of this model has resulted in a <strong>significant 23% improvement in incident severity ranking<\/strong>, enabling quicker actions when incidents occur repeatedly. This means that if DBAIOps has 100 investigations in a month and 23 of them experience frequent alerts until the main problem is resolved, the similarity model recognizes these patterns and recommends increasing the severity of such incidents.<\/p>\n<p>By proactively addressing these high severity alerts, we can minimize the Mean Time To Detect (MTTD) and Mean Time To Resolve (MTTR), thus improving service reliability and availability.<\/p>\n<p><strong>Diving deeper: Understanding how similarity scores are calculated<\/strong><\/p>\n<div class=\"wp-block-group is-layout-constrained wp-container-1 wp-block-group-is-layout-constrained\">\n<p>Understanding how DBAIOps calculates similarity scores is crucial for efficient incident resolution. Here\u2019s a breakdown of the steps involved:<\/p>\n<p><strong>New alert identified<\/strong>: Our detection runbook systematically gathers data from monitoring tools. If abnormalities are detected during the analysis time range, an alert is triggered. Once an alert is detected, DBAIOps triggers the RCA workflow. This workflow identifies an initial diagnosis, such as determining the type of SQL contributing to the alert or which org is contributing to the issue. This alert marks the start of our investigation and the incident resolution process.<\/p>\n<p><strong>Data cleansing<\/strong>: The RCA text undergoes a cleansing process to refine it. This includes removing special characters and stopwords to streamline the analysis. Keyword extraction is also performed to enhance the computation of similarity scores.<\/p>\n<p><strong>Alert comparison<\/strong>: The alert is compared with data stored in the Knowledge Repository, a comprehensive database capturing detailed information about alerts, RCAs, and historical insights. The RCA workflow triggers when an alert is detected, updating the Knowledge Repository with the latest data for accurate comparison. A similarity model generates meaningful scores for efficient alert comparison.<\/p>\n<p><strong>Score generation<\/strong>: The purpose-built similarity model calculates scores that guide subsequent actions when comparing RCAs.<\/p>\n<\/div>\n<h4 class=\"wp-block-heading\"><strong>Learn More<\/strong><\/h4>\n<p>Hungry for more AIOps stories? Check out how AIOps slashes thousands of manual hours annually in this <a href=\"https:\/\/engineering.salesforce.com\/aiops-engineering-secrets-revealed-how-ai-and-automation-slash-thousands-of-manual-hours-annually\/\">blog<\/a>.<\/p>\n<p>Stay connected \u2014 join our <a href=\"https:\/\/flows.beamery.com\/salesforce\/eng-social-2023\">Talent Community<\/a>!<\/p>\n<p>Check out our <a href=\"https:\/\/www.salesforce.com\/company\/careers\/teams\/tech-and-product\/?d=cta-tms-tp-2\">Technology and Product<\/a> teams to learn how you can get involved.<\/p>\n<p>The post <a href=\"https:\/\/engineering.salesforce.com\/enhancing-aiops-efficiency-salesforces-new-similarity-model-overcomes-4-major-incident-management-challenges\/\">Enhancing AIOps Efficiency: Salesforce\u2019s New Similarity Model Overcomes 4 Major Incident Management Challenges<\/a> appeared first on <a href=\"https:\/\/engineering.salesforce.com\/\">Salesforce Engineering Blog<\/a>.<\/p>\n<p><a href=\"https:\/\/engineering.salesforce.com\/enhancing-aiops-efficiency-salesforces-new-similarity-model-overcomes-4-major-incident-management-challenges\/\" target=\"_blank\" class=\"feedzy-rss-link-icon\" rel=\"noopener\">Read More<\/a><\/p>","protected":false},"excerpt":{"rendered":"<p>Optimizing the management of alerts from monitoring tools is crucial for efficient operations. However, it can be challenging due to the lack of confirmation on whether subsequent alerts indicate the same underlying problem. This leads to a repetitive and time-consuming process for an organization\u2019s operations team \u2014 including site reliability engineers, performance engineers and others&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2024\/04\/09\/enhancing-aiops-efficiency-salesforces-new-similarity-model-overcomes-4-major-incident-management-challenges\/\">Continue reading <span class=\"screen-reader-text\">Enhancing AIOps Efficiency: Salesforce\u2019s New Similarity Model Overcomes 4 Major Incident Management Challenges<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-850","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":745,"url":"https:\/\/fde.cat\/index.php\/2023\/08\/08\/what-is-slacks-secret-for-enhancing-accessibility-to-empower-people-with-disabilities\/","url_meta":{"origin":850,"position":0},"title":"What is Slack\u2019s Secret for Enhancing Accessibility to Empower People with Disabilities?","date":"August 8, 2023","format":false,"excerpt":"By Sommer Panage and Scott Nyberg. In our \u201cEngineering Energizers\u201d Q&A series, we examine the professional life experiences that have shaped Salesforce Engineering leaders. Meet Sommer Panage, a Senior Manager of Software Engineering for Slack at Salesforce, where she focuses on accessibility initiatives. Sommer and her team maximize the accessibility\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":792,"url":"https:\/\/fde.cat\/index.php\/2023\/08\/08\/what-is-slacks-secret-for-enhancing-accessibility-to-empower-people-with-disabilities-2\/","url_meta":{"origin":850,"position":1},"title":"What is Slack\u2019s Secret for Enhancing Accessibility to Empower People with Disabilities?","date":"August 8, 2023","format":false,"excerpt":"By Sommer Panage and Scott Nyberg. In our \u201cEngineering Energizers\u201d Q&A series, we examine the professional life experiences that have shaped Salesforce Engineering leaders. Meet Sommer Panage, a Senior Manager of Software Engineering for Slack at Salesforce, where she focuses on accessibility initiatives. Sommer and her team maximize the accessibility\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":798,"url":"https:\/\/fde.cat\/index.php\/2023\/12\/04\/sre-weekly-issue-401\/","url_meta":{"origin":850,"position":2},"title":"SRE Weekly Issue #401","date":"December 4, 2023","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, FireHydrant: Join FireHydrant Dec.14 for a conversation about on-call culture and its effect on engineering organizations, featuring special guests from Outreach and Udemy. Gain a better understanding of what makes excellent on-call culture and how to implement practices to improve yours. https:\/\/app.livestorm.co\/firehydrant\/better-incidents-winter-bonfire-inside-on-call?type=detailed\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":848,"url":"https:\/\/fde.cat\/index.php\/2024\/04\/01\/unveiling-the-cutting-edge-features-of-ml-console-for-ai-model-lifecycle-management\/","url_meta":{"origin":850,"position":3},"title":"Unveiling the Cutting-Edge Features of ML Console for AI Model Lifecycle Management","date":"April 1, 2024","format":false,"excerpt":"In our \u201cEngineering Energizers\u201d Q&A series, we explore the journeys of engineering leaders who have made remarkable contributions in their fields. Today, we meet Venkat Krishnamani, a Lead Member of the Technical Staff for Salesforce Engineering and the lead engineer for Salesforce Einstein\u2019s Machine Learning (ML) Console. This vital tool\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":892,"url":"https:\/\/fde.cat\/index.php\/2024\/07\/08\/unlocking-data-clouds-secret-for-scaling-massive-data-volumes-and-slashing-processing-bottlenecks\/","url_meta":{"origin":850,"position":4},"title":"Unlocking Data Cloud\u2019s Secret for Scaling Massive Data Volumes and Slashing Processing Bottlenecks","date":"July 8, 2024","format":false,"excerpt":"In our Engineering Energizers Q&A series, we explore engineers who have pioneered advancements in their fields. Today, we meet Rahul Singh, Vice President of Software Engineering, leading the India-based Data Cloud team. His team is focused on delivering a robust, scalable, and efficient Data Cloud platform that consolidates customer data\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":762,"url":"https:\/\/fde.cat\/index.php\/2023\/09\/12\/slack-behind-the-scenes-overcoming-key-challenges-to-craft-a-seamless-mobile-app\/","url_meta":{"origin":850,"position":5},"title":"Slack Behind the Scenes: Overcoming Key Challenges to Craft a Seamless Mobile App","date":"September 12, 2023","format":false,"excerpt":"By Tracy Stampfli and Scott Nyberg In our \u201cEngineering Energizers\u201d Q&A series, we examine the professional life experiences that have shaped Salesforce Engineering leaders. Meet Tracy Stampfli, a Principal Software Engineer for Slack at Salesforce. Tracy works behind the scenes on Slack\u2019s mobile infrastructure team \u2014 an elite group of\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/850","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=850"}],"version-history":[{"count":0,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/850\/revisions"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=850"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=850"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=850"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}