{"id":263,"date":"2021-08-31T14:40:46","date_gmt":"2021-08-31T14:40:46","guid":{"rendered":"https:\/\/fde.cat\/?p=263"},"modified":"2021-08-31T14:40:46","modified_gmt":"2021-08-31T14:40:46","slug":"minesweeper-automates-root-cause-analysis-as-a-first-line-defense-against-bugs","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/minesweeper-automates-root-cause-analysis-as-a-first-line-defense-against-bugs\/","title":{"rendered":"Minesweeper automates root cause analysis as a first-line defense against bugs"},"content":{"rendered":"<p><span>Root cause analysis (RCA) is an important part of fixing any bug. After all, you can\u2019t solve a problem without getting to the heart of it. But RCA <a href=\"https:\/\/engineering.fb.com\/2019\/11\/08\/developer-tools\/fast-dimensional-analysis\/\">isn\u2019t always simple<\/a>, especially at a scale like Facebook\u2019s. When billions of people are using an app on a variety of platforms and devices, a single bug can create several different issues on its own and multiple bugs can happen simultaneously.<\/span><\/p>\n<p><span>There was a time when on-call engineers had to spend hours, or even days, manually combing through error reports, looking for patterns to help them debug. Today, they can use<\/span> <span>Minesweeper<\/span><span> \u2014 <a href=\"https:\/\/arxiv.org\/abs\/2010.09974\">a technique we\u2019ve developed<\/a> for automating RCA that identifies the causes of bugs based on their symptoms.<\/span><\/p>\n<p><span>Minesweeper-based RCA is completely automated and scalable, and it\u2019s grounded in formal statistical concepts. Our own evaluations of Minesweeper using real-world bug reports from Facebook\u2019s apps have proven that it can perform RCA for tens of thousands of reports in minutes and can identify the root cause of bugs with 85 percent accuracy.<\/span><\/p>\n<p><span>Since it was developed, Minesweeper has become Facebook\u2019s first line of defense against bugs and has helped us prevent potentially wide-scale disruptions from affecting people on our apps.<\/span><\/p>\n<h2><span>How does Minesweeper work?<\/span><\/h2>\n<p><span>Anytime someone reports a bug through the Facebook app, the <a href=\"https:\/\/engineering.fb.com\/2020\/12\/09\/data-center-engineering\/how-facebook-keeps-its-large-scale-infrastructure-hardware-up-and-running\/\">error reporting system<\/a> typically captures a chronological trace of actions (or \u201cevents\u201d) that person performed in the app prior to encountering the bug. The idea is to get a snapshot of what might have caused the error to happen, such as the example below:<\/span><\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-17181\" src=\"https:\/\/i0.wp.com\/engineering.fb.com\/wp-content\/uploads\/2021\/01\/ScalableRCA_1.png?resize=750%2C71&#038;ssl=1\" alt=\"\" width=\"750\" height=\"71\"  data-recalc-dims=\"1\"><\/p>\n<p><span>Minesweeper scans these traces to look for distinctive patterns that could point to the cause of a bug. Traces that contain the bug (the test group) are compared with traces that do not (the control group). Minesweeper finds patterns of events that are statistically distinctive to the test group as opposed to the control group. These patterns are likely to be correlated to the bug and can thereby point toward its root cause.<\/span><\/p>\n<p><span>Here\u2019s an example. Suppose (hypothetically) that 10 people are using the Facebook app. Five of these people report a problem, and there are eight possible events tracked in the app (<\/span><i><span>a<\/span><\/i><span>, <\/span><i><span>b<\/span><\/i><span>, \u2026<\/span> <span>through <\/span><i><span>h<\/span><\/i><span>). We have 10 traces in total on these events, five in the test group (<\/span><i><span>T<\/span><\/i><span>) from people who encountered the bug, and five in the control group (<\/span><i><span>C<\/span><\/i><span>) from the remaining people who did not.<\/span><\/p>\n<p><span>When Minesweeper is applied to these traces, it begins by extracting sequential patterns in <\/span><i><span>T<\/span><\/i><span> and <\/span><i><span>C<\/span><\/i><span>. A sequential pattern is simply a chronological sequence of events that happened, but not necessarily one after another (i.e., there might be other events in between that were not significant).<\/span><\/p>\n<p><span>We could visualize the data like this:<\/span><\/p>\n<p><span>Events: { <\/span><i><span>a, b, \u2026, h <\/span><\/i><span>}<\/span><\/p>\n<table>\n<tbody>\n<tr>\n<td><b>Test group<\/b> <i><span>T<\/span><\/i><\/td>\n<td><b>\u00a0 \u00a0 \u00a0 Control group<\/b> <i><span>C<\/span><\/i><\/td>\n<\/tr>\n<tr>\n<td><i><span>t<\/span><\/i><span><sub>1<\/sub><\/span><span>:\u00a0 <\/span><i><span>a <\/span><\/i><span>\u2192<\/span><i><span> b <\/span><\/i><span>\u2192<\/span><i><span> c <\/span><\/i><span>\u2192<\/span><i><span> d<\/span><\/i><\/td>\n<td><i><span>\u00a0 \u00a0 \u00a0 \u00a0t<\/span><\/i><span><sub>6<\/sub><\/span><span>:\u00a0 <\/span><i><span>a <\/span><\/i><span>\u2192<\/span><i><span> b <\/span><\/i><span>\u2192<\/span><i><span> d<\/span><\/i><\/td>\n<\/tr>\n<tr>\n<td><i><span>t<\/span><\/i><span><sub>2<\/sub><\/span><span>:\u00a0 <\/span><i><span>a <\/span><\/i><span>\u2192<\/span><i><span> b <\/span><\/i><span>\u2192<\/span><i><span> c<\/span><\/i><\/td>\n<td><i><span>\u00a0 \u00a0 \u00a0 \u00a0t<\/span><\/i><span><sub>7<\/sub><\/span><span>:\u00a0 <\/span><i><span>a <\/span><\/i><span>\u2192<\/span><i><span> c <\/span><\/i><span>\u2192<\/span><i><span> d<\/span><\/i><\/td>\n<\/tr>\n<tr>\n<td><i><span>t<\/span><\/i><span><sub>3<\/sub><\/span><span>:\u00a0 <\/span><i><span>b <\/span><\/i><span>\u2192<\/span><i><span> c<\/span><\/i><\/td>\n<td><i><span>\u00a0 \u00a0 \u00a0 t<\/span><\/i><span><sub>8<\/sub><\/span><span>:\u00a0 <\/span><i><span>a <\/span><\/i><span>\u2192<\/span><i><span> c<\/span><\/i><\/td>\n<\/tr>\n<tr>\n<td><i><span>t<\/span><\/i><span><sub>4<\/sub><\/span><span>:\u00a0 <\/span><i><span>e <\/span><\/i><span>\u2192<\/span><i><span> f <\/span><\/i><span>\u2192<\/span><i><span> g <\/span><\/i><span>\u2192<\/span><i><span> h<\/span><\/i><\/td>\n<td><i><span>\u00a0 \u00a0 \u00a0 t<\/span><\/i><span><sub>9<\/sub><\/span><span>:\u00a0 <\/span><i><span>f <\/span><\/i><span>\u2192<\/span><i><span> g<\/span><\/i><\/td>\n<\/tr>\n<tr>\n<td><i><span>t<\/span><\/i><span><sub>5<\/sub><\/span><span>:\u00a0 <\/span><i><span>e <\/span><\/i><span>\u2192<\/span><i><span> g<\/span><\/i><\/td>\n<td><i><span>\u00a0 \u00a0 \u00a0 t<\/span><\/i><span><sub>10<\/sub><\/span><span>:\u00a0 <\/span><i><span>e <\/span><\/i><span>\u2192<\/span><i><span> f <\/span><\/i><span>\u2192<\/span><i><span> h<\/span><\/i><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>\u00a0<\/p>\n<p><span>An example of a pattern that the system can extract here is <\/span><i><span>a <\/span><\/i><span>\u2192<\/span><i><span> c<\/span><\/i><span>. For each pattern, it computes the number of traces in which it appears in each group (<\/span><i><span>T<\/span><\/i><span> and <\/span><i><span>C<\/span><\/i><span>). In the example above, this pattern appears twice in <\/span><i><span>T<\/span><\/i><span> (<\/span><i><span>t<\/span><\/i><span><sub>1<\/sub><\/span><span>and <\/span><i><span>t<\/span><\/i><span><sub>2<\/sub><\/span><span>), and so its support in <\/span><i><span>T<\/span><\/i><span> is 2. In this case, its support in <\/span><i><span>C<\/span><\/i><span> is also 2 (<\/span><i><span>t<\/span><\/i><span><sub>7 <\/sub><\/span><span>and <\/span><i><span>t<\/span><\/i><span><sub>8<\/sub><\/span><span>). Note that the pattern space is combinatorial in nature, and therefore it is crucial to employ algorithms that can search this space efficiently without an exponential blowup.<\/span><\/p>\n<p><span>Once all patterns are extracted along with their supports in <\/span><i><span>T<\/span><\/i><span> and <\/span><i><span>C<\/span><\/i><span>, the system performs statistical isolation. For each pattern, <\/span><i><span>P<\/span><\/i><span>, it computes precision and recall using its support:<\/span><\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-17184 size-medium\" src=\"https:\/\/i0.wp.com\/engineering.fb.com\/wp-content\/uploads\/2021\/01\/ScalableRCA_4-e1610748472874.png?resize=750%2C198&#038;ssl=1\" alt=\"\" width=\"750\" height=\"198\"  data-recalc-dims=\"1\"><\/p>\n<p><span>Informally, precision describes how accurate <\/span><i><span>P<\/span><\/i><span> is in detecting whether a given trace is in the test group rather than the control group, and recall describes how much of the test group <\/span><i><span>P<\/span><\/i><span> can cover. For example, the pattern <\/span><i><span>b<\/span><\/i><span> has a precision of 0.75 because it occurs in four traces in total, three of which are in the test group (i.e., it is 75 percent specific to the test group). It also has a recall of 0.6 because it occurs in three out of five traces in the test group (i.e., it covers 60 percent of the test group).<\/span><\/p>\n<p><span>The harmonic mean of the two, 0.67, is the F1-score. The system computes the F1-score of all patterns in this manner and returns the list of all patterns ranked by F1-score, as shown in the table below.<\/span><\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-17185 size-medium\" src=\"https:\/\/i0.wp.com\/engineering.fb.com\/wp-content\/uploads\/2021\/01\/ScalableRCA_5.png?resize=750%2C275&#038;ssl=1\" alt=\"\" width=\"750\" height=\"275\"  data-recalc-dims=\"1\"><\/p>\n<p><span>In this example, the result shows that the pattern <\/span><i><span>b <\/span><\/i><span>\u2192<\/span><i><span> c<\/span><\/i><span> is the highest ranked. In a real setting, an engineer debugging the reports can infer that events <\/span><i><span>b<\/span><\/i> <span>and <\/span><i><span>c<\/span><\/i><span>, occurring in that order, are suspicious and deserve close inspection.<\/span><\/p>\n<h2><span>Automating RCA at scale<\/span><\/h2>\n<p><span>To make Minesweeper work at Facebook\u2019s scale and complexity, we borrowed an idea from the data mining community \u2014<\/span><a href=\"https:\/\/en.wikipedia.org\/wiki\/Sequential_pattern_mining\"> <span>sequential pattern mining<\/span><\/a><span> \u2014 to help track the order that events appear in traces.<\/span><\/p>\n<p><span>At Facebook, it\u2019s not unusual to encounter tens of thousands of traces related to a bug. To handle this, we leveraged the<\/span><a href=\"https:\/\/ieeexplore.ieee.org\/abstract\/document\/914830\"> <span>PrefixSpan<\/span><\/a><span> algorithm, which is well known for being highly efficient at sequential pattern mining. To rank patterns according to their distinctiveness to the test group, thereby being useful for RCA, we utilized the aforementioned statistical approach based on the precision and recall of patterns. Finally, to make Minesweeper practical from a usability standpoint, we addressed several human-centric challenges, such as avoiding \u201credundant\u201d patterns that are similar in explaining the root cause.<\/span><\/p>\n<p><span>The figure below gives a high-level overview of the automated RCA architecture.\u00a0<\/span><\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-17182\" src=\"https:\/\/i0.wp.com\/engineering.fb.com\/wp-content\/uploads\/2021\/01\/ScalableRCA_2.png?resize=750%2C325&#038;ssl=1\" alt=\"\" width=\"750\" height=\"325\"  data-recalc-dims=\"1\"><\/p>\n<h2><span>The impact of Minesweeper<\/span><\/h2>\n<p><span>Minesweeper has played an important role in helping Facebook\u2019s engineers analyze and diagnose <a href=\"https:\/\/engineering.fb.com\/2020\/03\/05\/developer-tools\/incident-tracker\/\">regressions<\/a> \u2014 sudden spikes in a group of crash or bug reports \u2014 providing insights in minutes that once could have taken days to gather. With the complexity of Facebook\u2019s apps and the product release cycle, it is common for multiple regressions to take place simultaneously, especially after a new version is released. But thanks to Minesweeper, engineers working on user-facing regressions can analyze them promptly, making it easier than ever for them to reduce and mitigate disruptions to Facebook\u2019s services. <\/span><\/p>\n<p><span>More technical information is available in our paper \u201c<\/span><a href=\"https:\/\/arxiv.org\/abs\/2010.09974\"><span>Scalable statistical root cause analysis on app telemetry<\/span><\/a><span>.\u201d<\/span><\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/engineering.fb.com\/2021\/02\/09\/uncategorized\/minesweeper\/\">Minesweeper automates root cause analysis as a first-line defense against bugs<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/engineering.fb.com\/\">Facebook Engineering<\/a>.<\/p>\n<p><a href=\"https:\/\/engineering.fb.com\/2021\/02\/09\/uncategorized\/minesweeper\/\" target=\"_blank\" rel=\"noopener\">Read More<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Root cause analysis (RCA) is an important part of fixing any bug. After all, you can\u2019t solve a problem without getting to the heart of it. But RCA isn\u2019t always simple, especially at a scale like Facebook\u2019s. When billions of people are using an app on a variety of platforms and devices, a single bug&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2021\/08\/31\/minesweeper-automates-root-cause-analysis-as-a-first-line-defense-against-bugs\/\">Continue reading <span class=\"screen-reader-text\">Minesweeper automates root cause analysis as a first-line defense against bugs<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-263","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":269,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/sre-weekly-issue-257\/","url_meta":{"origin":263,"position":0},"title":"SRE Weekly Issue #257","date":"August 31, 2021","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, StackHawk: Keeping your APIs secure requires thoughtful design and testing. Learn how to protect your REST, SOAP and GraphQL APIs from security vulnerabilities with StackHawk http:\/\/sthwk.com\/api-protection Articles Sometimes alerts have inobvious reasons for existing This one really got me thinking. Make sure\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":886,"url":"https:\/\/fde.cat\/index.php\/2024\/06\/24\/leveraging-ai-for-efficient-incident-response\/","url_meta":{"origin":263,"position":1},"title":"Leveraging AI for efficient incident response","date":"June 24, 2024","format":false,"excerpt":"We\u2019re sharing how we streamline system reliability investigations using a new AI-assisted root cause analysis system. The system uses a combination of heuristic-based retrieval and large language model-based ranking to speed up root cause identification during investigations. Our testing has shown this new system achieves 42% accuracy in identifying root\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":850,"url":"https:\/\/fde.cat\/index.php\/2024\/04\/09\/enhancing-aiops-efficiency-salesforces-new-similarity-model-overcomes-4-major-incident-management-challenges\/","url_meta":{"origin":263,"position":2},"title":"Enhancing AIOps Efficiency: Salesforce\u2019s New Similarity Model Overcomes 4 Major Incident Management Challenges","date":"April 9, 2024","format":false,"excerpt":"Optimizing the management of alerts from monitoring tools is crucial for efficient operations. However, it can be challenging due to the lack of confirmation on whether subsequent alerts indicate the same underlying problem. This leads to a repetitive and time-consuming process for an organization\u2019s operations team \u2014 including site reliability\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":272,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/how-not-why-an-alternative-to-the-five-whys-for-post-mortems\/","url_meta":{"origin":263,"position":3},"title":"How, Not Why: An Alternative to the Five Whys for Post-Mortems","date":"August 31, 2021","format":false,"excerpt":"When I got into the DevOps field, I was exposed to The Five Whys\u200a\u2014\u200aa popular analytical method used in incident postmortems. The Five Whys is one type of root cause analysis (RCA): \u201cThe primary goal of the technique is to determine the root cause of a defect or problem by\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":253,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/sre-weekly-issue-254\/","url_meta":{"origin":263,"position":4},"title":"SRE Weekly Issue #254","date":"August 31, 2021","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, StackHawk: Need to run a standalone Kotlin app as a fat jar in a Gradle project? Check out how we handled that! http:\/\/sthwk.com\/kotlin-with-gradle Articles Coinbase Incident Post Mortem: January 6\u20137, 2021 This one\u2019s juicy. At one point, the front-end was blocked up,\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":172,"url":"https:\/\/fde.cat\/index.php\/2020\/12\/09\/how-facebook-keeps-its-large-scale-infrastructure-hardware-up-and-running\/","url_meta":{"origin":263,"position":5},"title":"How Facebook keeps its large-scale infrastructure hardware up and running","date":"December 9, 2020","format":false,"excerpt":"Facebook\u2019s services rely on fleets of servers in data centers all over the globe \u2014 all running applications and delivering the performance our services need. This is why we need to make sure our server hardware is reliable and that we can manage server hardware failures at our scale with\u2026","rel":"","context":"In &quot;External&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/263","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=263"}],"version-history":[{"count":1,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/263\/revisions"}],"predecessor-version":[{"id":444,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/263\/revisions\/444"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=263"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=263"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=263"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}