{"id":272,"date":"2021-08-31T14:40:46","date_gmt":"2021-08-31T14:40:46","guid":{"rendered":"https:\/\/fde.cat\/?p=272"},"modified":"2021-08-31T14:40:46","modified_gmt":"2021-08-31T14:40:46","slug":"how-not-why-an-alternative-to-the-five-whys-for-post-mortems","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/how-not-why-an-alternative-to-the-five-whys-for-post-mortems\/","title":{"rendered":"How, Not Why: An Alternative to the Five Whys for Post-Mortems"},"content":{"rendered":"<p>When I got into the DevOps field, I was exposed to <a href=\"https:\/\/en.wikipedia.org\/wiki\/Five_whys\">The Five Whys<\/a>\u200a\u2014\u200aa popular analytical method used in incident postmortems. The Five Whys is one type of <a href=\"https:\/\/en.wikipedia.org\/wiki\/Root_cause_analysis\">root cause analysis<\/a> (RCA): \u201cThe primary goal of the technique is to determine the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Root_cause\">root cause<\/a> of a <a href=\"https:\/\/en.wiktionary.org\/wiki\/defect\">defect<\/a> or problem by repeating the question \u2018Why?.\u2019 Each answer forms the basis of the next question (<a href=\"https:\/\/en.wikipedia.org\/wiki\/Five_whys\">link<\/a>).\u201d <\/p>\n<p>In a body of research about how systems <em>really<\/em> fail, I discovered a powerful critique of root cause analysis. The critique is known as \u201c<a href=\"http:\/\/www.humanfactors.lth.se\/fileadmin\/lusa\/Sidney_Dekker\/articles\/2007\/SafetyScienceMonitor.pdf\">the new view<\/a> of<a href=\"https:\/\/erikhollnagel.com\/ideas\/no-view-of-human-error.html#:~:text=The%20'new'%20view%20softened%20the,decisions%20by%20the%20blunt%20end.\"> human error<\/a>.\u201d Any place where accidents or unwanted outages can occur has a lot in common with managing information technology systems. While most readers of this post will be in the software industry, much of the research comes from other fields such as industrial accidents, medicine, shipping, and aeronautics. After learning about this research, I came to the conclusion that root cause analysis is misleading, and even harmful. In this piece, I will explain their critique of RCA, and I will present what I think is a better\u00a0idea.<\/p>\n<figure><img decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1*s7ihapR_qpiGTZ6-cUMgrQ.jpeg?w=750&#038;ssl=1\" data-recalc-dims=\"1\"><\/figure>\n<p>To illustrate the conventional thinking, I will start with media coverage of one IT outage. There is nothing special about the story that I have chosen\u200a\u2014\u200aevery company that operates software in production has had something similar happen. On <em>Business Insider <\/em>can be seen the headline <a href=\"https:\/\/www.businessinsider.com\/amazon-aws-internet-outage-caused-by-engineer-typing-wrong-command-2017-3\">Amazon took down parts of the internet because an employee fat-fingered the wrong command<\/a>. According to writer Kif Leswing, an employee made a mistake and consequently, a part of Amazon\u2019s cloud crashed. It sounds very simple. Every story turns out to be about how some careless\u200a\u2014\u200aor perhaps incompetent\u200a\u2014\u200aperson spilled their coffee on a network switch and the internet went down. But in reality, it\u2019s not that simple.<\/p>\n<p>Writing about the same story for <em>The New Stack<\/em>, Lee Atchison leads with <a href=\"https:\/\/thenewstack.io\/dont-write-off-aws-s3-outage-fat-finger-folly\/\">Don\u2019t Write Off the AWS S3 Outage as a Fat-Finger Folly.<\/a> In my opinion, Atchison had a better explanation than the other writer. He points out that AWS follows DevOps best practices, such as scripting instead of typing, validation of inputs, and audit trailing. While it was not entirely untrue that the ops engineer made a mistake (the engineer typed the wrong command). However, there was a bug in the validation part of the script, so it did not reject the incorrect command. The command kicked off a cascade of failures that took down part of their system. The full story shows a reality more complex than one user: one mistake. <\/p>\n<p>An important distinction in the research about failures is the division between simple and complex systems. Simple systems are linear and sequential. Each part is connected to one adjacent part. Simple systems fail in simple ways\u200a\u2014\u200alike a row of dominoes falling over. Researcher Dr. Eric Hollnagel has noted the pervasiveness of the \u201crow of dominoes\u201d metaphor in the coverage of all kinds of incidents. It\u2019s the easiest metaphor to reach for. Hollnagel has a great <a href=\"https:\/\/pixabay.com\/photos\/domino-dominoes-game-playing-row-21176\/\">slide<\/a> on this point that incorporates pictures of dominoes from a variety of sources. <\/p>\n<p>The computer systems that that include our applications are complex, not simple. As Kevin Heselin says in <a href=\"https:\/\/journal.uptimeinstitute.com\/examining-and-learning-from-complex-systems-failures\/\">Examining and Learning from Complex Systems Failures<\/a>, \u201ccomplex systems fail in complex ways.\u201d Heselin continues \u201cthe hallmarks of complex systems are a large number of interacting components, emergent properties, difficult to anticipate from the knowledge of single components, ability to absorb random disruptions and highly vulnerable to widespread failure under adverse conditions\u201d. Some of the best thinking in this area is from Dr. Richard L. Cook, a medical doctor and author of the landmark paper <a href=\"https:\/\/www.researchgate.net\/publication\/228797158_How_complex_systems_fail\">How Complex Systems Fail<\/a>. In this source, Dr. Cook covers 18 points that characterize complex systems. I\u2019ll go through some of them below, but I encourage everyone to read the entire paper to get the others. <\/p>\n<p>John Allspaw, another one of the thought leaders in this field, <a href=\"https:\/\/www.oreilly.com\/library\/view\/velocity-conference-new\/9781491900406\/\">characterizes<\/a> complex systems as those systems built from components having certain properties: diversity interdependence, adaptive, and connected. Allspaw emphasizes that complex systems exhibit very nonlinear behaviors, meaning a small action or small change can lead to what seems like a disproportionate and unpredictably large event.<\/p>\n<p>In a complex system, it is difficult\u200a\u2014\u200aor impossible\u200a\u2014\u200ato understand the relationships between all of the parts, and therefore how the system as a whole will respond to one single small disturbance. One of the constant themes in this literature is that <em>there is no single cause for an incident<\/em>. Dr. Cook says: \u201csmall, apparently innocuous failures join to create [the] opportunity for a systemic accident.\u201d He continues, \u201ceach small failure is necessary, but only the combination is sufficient to permit failure.\u201d Multiple things must go wrong in order to produce a systemic outage. <\/p>\n<p>Dr. Cook emphasizes that we\u200a\u2014\u200adesigners and operators of systems\u200a\u2014\u200atake proactive steps to protect our systems, and we are pretty good at it. A big part of our job consists of designing systems to be resilient. We have a good idea of many of the things that can go wrong, and we have a \u201cPlan B\u201d in place for the more common failure modes\u200a\u2014\u200aand a good many of the less common ones too. Our efforts succeed\u200a\u2014\u200aoften: the systems we manage don\u2019t fall over at the slightest sign of trouble. If one thing goes wrong, we are pretty good at preventing that from turning into a complete outage. If a server fails, a database host, or even a datacenter, most production systems will keep humming along. According to Dr. Cook, \u201cthere are many more failure opportunities than overt system accidents.\u201d<\/p>\n<p>I hope it is now evident why the idea of a single cause does not reflect reality. Mathias Lafeltdt writes in <a href=\"https:\/\/www.scalyr.com\/blog\/the-myth-of-the-root-cause-how-complex-web-systems-fail\/\">The Myth of the Root Cause<\/a> that \u201cSingle-point failures alone are not enough to trigger an incident. Instead, incidents require multiple contributors, each necessary but only jointly sufficient. It is the combination of these causes\u200a\u2014\u200aoften small and innocuous failures like a memory leak, a server replacement, and a bad DNS update\u200a\u2014\u200athat is the prerequisite for an incident. We therefore can\u2019t isolate a single root cause.\u201d <\/p>\n<p>For example, let\u2019s look at a well known <a href=\"https:\/\/www.independent.co.uk\/news\/uk\/home-news\/zeebrugge-ferry-disaster-ms-herald-free-enterprise-uk-30-years-maritime-tragedy-killed-a7583131.html\">accident<\/a> in which a ship capsized in the port of Zeebrugge. Researcher, Takafumi Nakamura <a href=\"https:\/\/www.semanticscholar.org\/paper\/METHOD-FOR-MITIGATING-SYSTEM-FAILURES-Nakamura\/0130cc17af7556dbee253fd297089ea4ea6f55e2\">looked at the entire system<\/a>. He created a diagram explaining how the failure occurred with the failure on the far right. You can see the multiple causes and the relationships between them. All of the contributing causes and relationships, in some way, contributed to the failure. If you take away maybe even just one of them, then you might have avoided the\u00a0failure.<\/p>\n<figure><img decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1*ayPYsFckmqZ6SbiogJIGMw.png?w=750&#038;ssl=1\" data-recalc-dims=\"1\"><\/figure>\n<p>Systems contain what Dr. Cook refers to as latent failures. When a realized failure requires multiple interacting causes, complex systems are always in in a state in which some\u200a\u2014\u200abut not all\u200a\u2014\u200aof the multiple contributing causes, necessary for a complete failure, have already occurred, only not enough of them for a systemic failure. It is as if a few threads in your sweater are broken, but you don\u2019t see the hole, yet. It is impossible to eliminate all of these partial failures. <\/p>\n<p>According to Dr. Cook, \u201cdisaster can occur at any time and in nearly any place. The potential for catastrophic outcome is a hallmark of complex systems. It is impossible to eliminate the potential for such catastrophic failure; the potential for such failure is always present; by the system\u2019s own nature.\u201d <\/p>\n<p>Some examples of latent failures\u00a0are:<\/p>\n<ul>\n<li>DNS records resolved an incorrect IP\u00a0address<\/li>\n<li>Degraded hardware has not yet been removed from\u00a0cluster<\/li>\n<li>Configuration management failed to run, so the host has not received security patches or other\u00a0changes.<\/li>\n<li>Deploy did not complete in a consistent state but failed to report\u00a0it<\/li>\n<li>Software upgrade contained an unreported bug<\/li>\n<\/ul>\n<p>Latent failures may exist in a system you maintain without causing an outage, but, when an outage occurs, you may find that these latent failures were there for quite some time and contributed to the outage, but were not discovered until after the fact. Latent failures exist because complexity prevents us from understanding and eliminating all of them. And if we could, our changes to the system would introduce other\u200a\u2014\u200adifferent\u200a\u2014\u200alatent failures. <\/p>\n<p>We are constantly changing our systems. We must improve our products in order to stay competitive. No one can afford to sit still for too long. One of the major motivations for making changes is to deliver more business value to our customers. Another key reason for changes is to avoid failure. We are constantly either fixing things that are partially broken, improving the robustness of systems, adding more monitoring, improving alert thresholds, and in other ways trying to reduce the chance of failure. And while we often succeed, avoiding one failure paradoxically introduces new and different failure modes into the system. Every change represents a new opportunity for misconfiguration, introduces a new failure mode, or creates different interactions between parts of the system in ways that may we may not fully understand. <\/p>\n<p>Point #2 in Dr. Cook\u2019s paper is \u201cComplex systems are heavily and successfully defended against failure.\u201d Defending against failure is a big part of our job. We defend systems against failure with techniques like redundancy, auto-scaling load, balancing load, shedding monitoring, health checks, and backups. One of the main elements in defense against failure is ops and DevOps, the people who manage systems. Because we are creative, we can make decisions on the spot to mitigate potential failures and prevent them from turning into outages. According to Dr. Cook, \u201cthe effect of these measures is to provide a series of shields that normally divert operations away from accidents.\u201d Because of the efforts of system operators, many accidents and outages do not occur.<\/p>\n<p>Dr. Cook also emphasizes that everything does not have to be perfect. Number 5 among his 18 points is \u201ccomplex systems can run in a degraded mode.\u201d Complex systems are partially broken\u200a\u2014\u200aall the time. They are always somewhere in the grey area between succeeding and failing\u200a\u2014\u200aand being improved or replaced. This works well enough, more often than not in part because the system operators are able to work around the flaws. Hollnagel, in <a href=\"https:\/\/www.uis.no\/getfile.php\/1322751\/Konferanser\/Presentasjoner\/Ulykkesgransking%202010\/EH_AcciLearn_short.pdf\">How Not to Learn from Accidents<\/a>, characterizes systems as moving somewhat randomly in a two-dimensional color-coded space scattered with potholes representing possible outages. And sometimes you are lucky when you pass nearby a pothole but you don\u2019t step in it. Other times, not so much. <\/p>\n<p>The problem with root cause analysis is that RCA assumes one originating cause for each failure, which then ripples down the row of dominoes, ending in an outage when the last domino falls. The <a href=\"https:\/\/workplacepsychology.net\/2012\/02\/07\/the-5-whys-and-some-limitations\/\">five whys<\/a> assume that each cause has one antecedent, and, when you step through it five times, you find the root. Why five? Five is a completely arbitrary number. If we did not stop at five, we could have The Six Whys, or The Seven Whys\u200a\u2014\u200aand we would get a different root cause each time. Real-world systems are not made out of five dominos arranged in a line. The system probably could have survived any one of the five causes by itself. All of the five (or more) contributing causes were in some way part of the story. There is nothing that makes any one of them the root. <\/p>\n<p>To understand the nature of the failure, the collaboration of the jointly contributing causes must be understood. According to Dr. Cook, root cause analysis is fundamentally wrong because \u201covert failure requires multiple faults. There is no isolated cause of an accident. There are multiple contributors to accidents. Each of these is necessary, but insufficient in itself, to create an accident. Only jointly are these causes sufficient to create an accident.\u201d <a href=\"https:\/\/www.kitchensoap.com\/2012\/02\/10\/each-necessary-but-only-jointly-sufficient\/\">According to John Allspaw<\/a>, \u201cthese linear chain of events approaches are akin to viewing the past as a lineup of dominoes and, in reality, complex systems simply don\u2019t work like that. Looking at an accident this way, ignore surrounding circumstances in favor of a cherry-picked list of events, it validates hindsight and outcome bias and focuses too much on components and not enough on interconnectedness of components.\u201d<\/p>\n<p>Some of the researchers in this area emphasize the subjectivity of root cause analysis. RCA is non-repeatable. When an event has multiple jointly cooperating causes, you are forced to pick one. It is completely arbitrary which one you pick. The one you pick may be a result of your personal view of things. If someone else examined the same incident, they could equally well pick a different one from among the multiple joint precedents. Eric Hollnagel has coined the acronym, <a href=\"https:\/\/safetydifferently.com\/wylfiwyf\/\">WYLFIWYF<\/a>: What You Look For Is What You Find. Hollnagel\u2019s point is that, when entering the investigation of an accident looking for a certain thing, the investigator is to choose from among many possible contributors the one that meets their preconceived biases.<\/p>\n<p>I can illustrate this point with the following diagram. The postmortem is conducted following the well-known \u201cThree Whys\u201d method. Causality runs from left to right. During the post-mortem, we work backwards to unearth the root cause, moving from right to left. The incident has three immediate jointly contributing causes [F1, F2, F3]. The investigation settles on F1. Proceeding in the same manner from F1 which has three causes [F4, F5, F6] the investigators arbitrarily and subjectively settle on F4, and then in the third step, on F9. Is F9 the \u201croot cause?\u201d What about the three antecedents prior to F2 and three more prior to F3, and so on for all the other nodes that I did not draw, which would be a tree of 27 nodes at three levels of depth if the fan-out is three (not to mention, why stop after three?). So out of quite a lot of nodes, you arbitrarily picked this one and called that the root cause. But there was nothing so special about that one. All 27 nodes in the graph participated to some extent and the incident might not have occurred had even one of those nodes not been\u00a0present.<\/p>\n<figure><img decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1*dHmuBQ1vv5zEqtI2vPTJ3A.png?w=750&#038;ssl=1\" data-recalc-dims=\"1\"><\/figure>\n<p>Let\u2019s look at another <a href=\"https:\/\/www.telegraph.co.uk\/travel\/comment\/tenerife-airport-disaster\/\">example from the horrific Tenerife air traffic accident<\/a>, where two 747s collided, said to be the deadliest air traffic accident in history. It was really astonishingly bad luck, which had partly to do with these two planes being rerouted to the same airport, which did not typically have 747s landing there. The air traffic controllers weren\u2019t used to them. A lot went wrong, and it all contributed to this result. Nearly everything had to happen as it did or the accident would not have occurred. I went through some stories about this and made my own graph showing the antecedents.<\/p>\n<figure><img decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1*-psJfxQk3MVKaNPKE_4oWg.jpeg?w=750&#038;ssl=1\" data-recalc-dims=\"1\"><\/figure>\n<p>If you agree with me so far, then you might ask, \u201cWhy do we do RCA anyway?\u201d Researchers have identified several reasons. One idea why <a href=\"https:\/\/player.oreilly.com\/videos\/9781491900390\">is from John Allspaw<\/a>. \u201cEngineers don\u2019t like complexity. We want to think that things are simple and root cause does give us a simple answer. It might not be correct, but it is simple. So it does address that need. And if we find the root cause and we fix it, at least for a time, we can tell ourselves that we\u2019ve prevented a recurrence of the accident.\u201d<\/p>\n<p>Another reason for RCA is organizational and political drivers. Some companies require an RCA. The customer may want to know what\u2019s happened. The idea of finality means that, if we can get past it, it helps us deal with the trauma. And it may create the illusion that we found something we can fix.<\/p>\n<p>Other researchers have pointed out different kinds of cognitive biases, such as <a href=\"https:\/\/www.investopedia.com\/terms\/h\/hindsight-bias.asp#:~:text=Hindsight%20bias%20is%20a%20psychological,can%20accurately%20predict%20other%20events.\">hindsight bias<\/a>. Hindsight bias makes things appear much simpler in retrospect than they were at the time. From Doctor Cook\u2019s paper, \u201cknowledge of the outcome makes it seem that events leading to the outcome should have appeared more salient to the practitioners at the time than was actually the case.\u201d It always appears after the fact that somebody made a mistake\u200a\u2014\u200a<em>if you are predisposed to look at it that way<\/em>. But at the time people made the decisions that are, in the post-mortem, being identified as causes of failure, it was much less clearly that anyone was making a mistake. The operator did not have all the information that we now have, in retrospect. The operator thought that there was a good chance that the action would fix a problem and avoid an outage. Their actions were taken under a much greater degree of uncertainty at the time. <\/p>\n<p>Human error is often surfaced as a cause of failure during an RCA. Recall the Amazon engineer who spilled some coffee on his keyboard as one example of an outage being attributed to human error. This is yet another cognitive bias: we tend to look for mistakes that humans made when those are present among other contributing factors. The human decisions stand out to us as distinct from other types of perhaps equally important problems. <\/p>\n<p>However, the attribution of human error as a cause of failure is overstated. We need to look at the awareness of the people operating systems and understand their actions at the time. Often, what we (after the fact) call human error was a reasonable decision that a person made under conditions of uncertainty and stress. John Allspaw stated in a presentation at Velocity Conf [link no longer available] \u201cif you list human error as a cause, you\u2019re saying, \u2018I can\u2019t be bothered to dig any deeper.\u2019 Human error consists of normal human behavior in real world situations, under conditions of adaptation and learning. What we do every day is make mistakes and adapt.\u201d<\/p>\n<p>Dr. Cook emphasizes that all operators have two roles. One is producing outputs and the other is avoiding errors. Producing outputs means keeping the system running so our customers can use it. Avoiding errors means preventing the system from falling over. Note that a straightforward way to avoid all errors would be turning off the entire system. However, this is not a realistic option because a business must deliver products in order to create value. These two conflicting objectives must always be balanced against each other: we must produce outputs, and some errors are a cost of that. But as Dr. Cook emphasizes, after an accident, we tend to place more emphasis on avoiding errors and\u200a\u2014\u200ato some extent\u200a\u2014\u200aforget (or underweight) the importance of producing outputs. The many times that operators produced business value and avoided errors are not counted because we don\u2019t do post-mortems when things go well. <\/p>\n<p>Hollnagel, who I cited earlier, emphasizes that avoiding error <em>is itself is a cause of error. <\/em>System operators cannot be told to take no actions ever, at al. When issues are discovered, we must assess whether a corrective action is a risk worth taking. System operators take actions which are aimed at preserving output and avoiding error\u200a\u2014\u200aand some of those changes, while succeeding in their own context, will cause a different set of errors, either immediately or by creating latent failures which later result in actual failures. <\/p>\n<p>Dr. Cook explains that every operator action we take is always a gamble. Anything that you change may destabilize the system. In a post-mortem, it looks like the operator did something stupid and careless that caused the outage, but at the time it was an educated guess, a gamble, with the aim of preserving the successful operation of the system. When your job is to take calculated risks, some of the time you lose\u200a\u2014\u200abut that doesn\u2019t mean that taking some risks is a bad idea, nor that we should never change anything. And given our limited understanding of complex systems, it doesn\u2019t mean that the operator was incompetent. <\/p>\n<p>Human understanding of complex systems is incomplete for many reasons: other operators made changes and did not tell everyone; high stress situations, or perhaps a lack of sleep if the operator was alerted during off hours. <a href=\"https:\/\/www.youtube.com\/watch?v=rHeukoWWtQ8\">Dr. Johan Bergstr\u00f6m<\/a>, one of the advocates of the new view says, \u201chuman error is never a cause; instead, it is an attribution of other problems located deeper in or higher up the system, a symptom of those problems, not a cause.\u201d <\/p>\n<p>We have to step back and ask, \u201cWhat is the point of doing post-mortems?\u201d It is not to find the root cause. It is to find what are the most unstable areas of the system. Where we can make some improvements to improve stability or remove latent errors? Given the vulnerabilities that we identify, and our finite resources that we can devote to improvement, we should ask \u201cWhere is the greatest return on our efforts?\u201d Focusing on the two goals of maintaining stability and avoiding error, we should be asking, \u201cHow did the system fail?\u201d We need to understand more about the complexity of our systems and how the parts are interrelated. <\/p>\n<p>I have one last example here. A vendor integration stopped working on a product, and an outage resulted. The simple explanation for what happened is that the vendor changed the behavior of their API and the system could not handle that. Here\u2019s a diagram which shows multiple jointly contributing causes. You can see that any one of these might be something that the organization could improve upon. You might decide not to do all of them, but you could certainly decide on the top three or more improvements you could make that will make your system more\u00a0stable.<\/p>\n<figure><img decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1*EasASI0ph3CelqQUv-jiiA.jpeg?w=750&#038;ssl=1\" data-recalc-dims=\"1\"><\/figure>\n<p>My suggestion for ops teams who use the five whys is to instead ask, \u201cHow did the system fail?\u201d Identify the contributing factors that jointly contributed to the outage. Draw a diagram like the diagrams in this article. And then, through that diagram, identify the maximum leverage points to make improvements in your system. Create tickets, and put those in your tracking\u00a0system.<\/p>\n<p><em>Robert Blumen is a DevOps Engineer at Salesforce<\/em><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=4518098cca17\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/engineering.salesforce.com\/how-not-why-an-alternative-to-the-five-whys-for-post-mortems-4518098cca17\">How, Not Why: An Alternative to the Five Whys for Post-Mortems<\/a> was originally published in <a href=\"https:\/\/engineering.salesforce.com\/\">Salesforce Engineering<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<p><a href=\"https:\/\/engineering.salesforce.com\/how-not-why-an-alternative-to-the-five-whys-for-post-mortems-4518098cca17?source=rss----cfe1120185d3---4\" target=\"_blank\" rel=\"noopener\">Read More<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>When I got into the DevOps field, I was exposed to The Five Whys\u200a\u2014\u200aa popular analytical method used in incident postmortems. The Five Whys is one type of root cause analysis (RCA): \u201cThe primary goal of the technique is to determine the root cause of a defect or problem by repeating the question \u2018Why?.\u2019 Each&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2021\/08\/31\/how-not-why-an-alternative-to-the-five-whys-for-post-mortems\/\">Continue reading <span class=\"screen-reader-text\">How, Not Why: An Alternative to the Five Whys for Post-Mortems<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-272","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":507,"url":"https:\/\/fde.cat\/index.php\/2021\/11\/22\/sre-weekly-issue-297\/","url_meta":{"origin":272,"position":0},"title":"SRE Weekly Issue #297","date":"November 22, 2021","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, Rootly: Manage incidents directly from Slack with Rootly \ud83d\ude92. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging the right team, postmortem timeline, setting up reminders, and more. Book a demo:https:\/\/rootly.com\/?utm_source=sreweekly Articles Avoid frostbite: Stop doing code freezes It\u2019s\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":771,"url":"https:\/\/fde.cat\/index.php\/2023\/10\/15\/sre-weekly-issue-394\/","url_meta":{"origin":272,"position":1},"title":"SRE Weekly Issue #394","date":"October 15, 2023","format":false,"excerpt":"View on sreweekly.com A warm welcome to my new sponsor, FireHydrant! A message from our sponsor, FireHydrant: The 2023 DORA report has two conclusions with big impacts on incident management: incremental steps matter, and good culture contributes to performance. Dig into both topics and explore ideas for how to start\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":886,"url":"https:\/\/fde.cat\/index.php\/2024\/06\/24\/leveraging-ai-for-efficient-incident-response\/","url_meta":{"origin":272,"position":2},"title":"Leveraging AI for efficient incident response","date":"June 24, 2024","format":false,"excerpt":"We\u2019re sharing how we streamline system reliability investigations using a new AI-assisted root cause analysis system. The system uses a combination of heuristic-based retrieval and large language model-based ranking to speed up root cause identification during investigations. Our testing has shown this new system achieves 42% accuracy in identifying root\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":263,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/minesweeper-automates-root-cause-analysis-as-a-first-line-defense-against-bugs\/","url_meta":{"origin":272,"position":3},"title":"Minesweeper automates root cause analysis as a first-line defense against bugs","date":"August 31, 2021","format":false,"excerpt":"Root cause analysis (RCA) is an important part of fixing any bug. After all, you can\u2019t solve a problem without getting to the heart of it. But RCA isn\u2019t always simple, especially at a scale like Facebook\u2019s. When billions of people are using an app on a variety of platforms\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":541,"url":"https:\/\/fde.cat\/index.php\/2022\/02\/14\/sre-weekly-issue-309\/","url_meta":{"origin":272,"position":4},"title":"SRE Weekly Issue #309","date":"February 14, 2022","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, Rootly: Manage incidents directly from Slack with Rootly \ud83d\ude92. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging the right team, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly shirt): https:\/\/rootly.com\/demo\/?utm_source=sreweekly Articles\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":455,"url":"https:\/\/fde.cat\/index.php\/2021\/09\/20\/sre-weekly-issue-285\/","url_meta":{"origin":272,"position":5},"title":"SRE Weekly Issue #285","date":"September 20, 2021","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, StackHawk: Check out the latest from StackHawk\u2019s Chief Security Officer, Scott Gerlach, on why security should be part of building software, and how StackHawk helps teams catch vulns before prod. https:\/\/sthwk.com\/cloudnative Articles Computers are the easy part What\u2019s so great about this\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/272","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=272"}],"version-history":[{"count":1,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/272\/revisions"}],"predecessor-version":[{"id":438,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/272\/revisions\/438"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=272"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=272"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=272"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}