{"id":312,"date":"2021-08-31T14:40:03","date_gmt":"2021-08-31T14:40:03","guid":{"rendered":"https:\/\/fde.cat\/?p=312"},"modified":"2021-08-31T14:40:03","modified_gmt":"2021-08-31T14:40:03","slug":"sre-weekly-issue-272","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/sre-weekly-issue-272\/","title":{"rendered":"SRE Weekly Issue #272"},"content":{"rendered":"<p><a href=\"https:\/\/sreweekly.com\/sre-weekly-issue-272\/\" title=\"Permalink to SRE Weekly Issue #272\" class=\"email_only\">View on sreweekly.com<\/a><\/p>\n<div class=\"sreweekly-sponsor-message\">\n<h2>A message from our sponsor, StackHawk:<\/h2>\n<p>See how automated security testing can change how your teams find and fix security vulnerabilities.<br \/>\n<a href=\"http:\/\/sthwk.com\/security-automation\">http:\/\/sthwk.com\/security-automation<\/a><\/p>\n<\/div>\n<h2>Articles<\/h2>\n<div class=\"sreweekly-entry\">\n<div class=\"sreweekly-title\"><a href=\"https:\/\/help.salesforce.com\/articleView?id=000358392&amp;type=1&amp;mode=1\">[Salesforce] Multi-Instance Service Disruption on May 11-12, 2021<\/a><\/div>\n<div class=\"sreweekly-description\">\n<p>Salesforce has posted a ton of information about their major outage two weeks ago.<br \/>\nIt involved a change to their DNS system that combined with an issue in BIND daemon shutdown that prevented it from starting back up.<\/p>\n<p>The analysis goes into great detail on the fact that an engineer used the Emergency Break-Fix (EBF) process to rush out the DNS configuration change.<\/p>\n<p>In this case, the engineer subverted the known policy and the appropriate disciplinary action has been taken to ensure this does not happen in the future.<\/p>\n<p><em>Thanks to an anonymous reader for pointing this out to me.<\/em><\/p>\n<p>Salesforce<\/p>\n<\/div>\n<\/div>\n<div class=\"sreweekly-entry\">\n<div class=\"sreweekly-title\"><a href=\"https:\/\/www.theregister.com\/AMP\/2021\/05\/19\/salesforce_root_cause\/?__twitter_impression=true\">That Salesforce outage: Global DNS downfall started by one engineer trying a quick fix<\/a><\/div>\n<div class=\"sreweekly-description\">\n<p>This article calls out the heavily blame-ridden language in the above incident analysis and the briefing given by Salesforce\u2019s Chief Reliability Officer.<\/p>\n<p>I\u2019m dismayed to see such language from someone who is at the C-level for reliability.<\/p>\n<p>\u201cFor whatever reason that we don\u2019t understand, the employee decided to do a global deployment,\u201d Dieken went on.<\/p>\n<p>Richard Speed \u2014 The Register<\/p>\n<\/div>\n<\/div>\n<div class=\"sreweekly-entry\">\n<div class=\"sreweekly-title\"><a href=\"https:\/\/mobile.twitter.com\/ReinH\/status\/1395906200210837510\">@ReinH on Twitter Re: Salesforce Outage<\/a><\/div>\n<div class=\"sreweekly-description\">\n<p>\u2026and the Twittersphere agrees with me.<\/p>\n<p>If you want to blame someone, maybe try blaming the \u201cchief availability officer\u201d who oversees a system so fragile that one action by one engineer can cause this much damage. But it\u2019s never that simple, is it.<\/p>\n<p>@ReinH on Twitter<\/p>\n<\/div>\n<\/div>\n<div class=\"sreweekly-entry\">\n<div class=\"sreweekly-title\"><a href=\"https:\/\/surfingcomplexity.blog\/2021\/05\/25\/subverting-the-process\/\">Subverting the process<\/a><\/div>\n<div class=\"sreweekly-description\">\n<p>Another really great take on the Salesforce outage followup.<\/p>\n<p>Lorin Hochstein<\/p>\n<\/div>\n<\/div>\n<div class=\"sreweekly-entry\">\n<div class=\"sreweekly-title\"><a href=\"https:\/\/www.blameless.com\/blog\/sre-team\">Building an SRE Team? How to Hire, Assess, &amp; Manage SREs<\/a><\/div>\n<div class=\"sreweekly-description\">\n<p>I like how this article covers the different roles that SREs play.<\/p>\n<p>Emily Arnott \u2014 Blameless<\/p>\n<\/div>\n<\/div>\n<div class=\"sreweekly-entry\">\n<div class=\"sreweekly-title\"><a href=\"https:\/\/www.verica.io\/blog\/the-advanced-principles-of-chaos-engineering\/\">The Advanced Principles of Chaos Engineering<\/a><\/div>\n<div class=\"sreweekly-description\">\n<p>The principles covered in this article are:<\/p>\n<p>Build a hypothesis around steady-state behavior<br \/>\nVary real-world events<br \/>\nRun experiments in production<br \/>\nAutomate experiments to run continuously<br \/>\nMinimize blast radius<\/p>\n<p>Casey Rosenthal \u2014 Verica<\/p>\n<\/div>\n<\/div>\n<div class=\"sreweekly-entry\">\n<div class=\"sreweekly-title\"><a href=\"https:\/\/surfingcomplexity.blog\/2021\/05\/29\/why-do-config-changes-keep-coming-up-in-major-incidents\/\">Why do config changes keep coming up in major incidents?<\/a><\/div>\n<div class=\"sreweekly-description\">\n<p>This post is full of thought-provoking questions on the nature of configuration changes and incidents.<\/p>\n<p>Lorin Hochstein<\/p>\n<\/div>\n<\/div>\n<h2>Outages<\/h2>\n<p><a href=\"https:\/\/www.datacenterdynamics.com\/en\/news\/ibm-cloud-suffers-second-outage-in-five-days\/\">IBM Cloud<\/a><br \/>\n<a href=\"https:\/\/www.klarna.com\/us\/blog\/written-statement-on-app-bug\/\">Klarna<\/a><\/p>\n<p>Klarna showed users information related to other users, as detailed in this followup post.<\/p>\n<p>SRE WEEKLY<\/p>\n","protected":false},"excerpt":{"rendered":"<p>View on sreweekly.com A message from our sponsor, StackHawk: See how automated security testing can change how your teams find and fix security vulnerabilities. http:\/\/sthwk.com\/security-automation Articles [Salesforce] Multi-Instance Service Disruption on May 11-12, 2021 Salesforce has posted a ton of information about their major outage two weeks ago. It involved a change to their DNS&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2021\/08\/31\/sre-weekly-issue-272\/\">Continue reading <span class=\"screen-reader-text\">SRE Weekly Issue #272<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[8],"tags":[],"class_list":["post-312","post","type-post","status-publish","format-standard","hentry","category-sre","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":318,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/sre-weekly-issue-274\/","url_meta":{"origin":312,"position":0},"title":"SRE Weekly Issue #274","date":"August 31, 2021","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, StackHawk: Join the GraphQL Security Testing Learning Lab on June 29 at 9 AM PT. Learn how to run automated security testing against your GraphQL APIs so you can find and fix vulnerabilities fast. http:\/\/sthwk.com\/graphql-learning-lab Articles Chicken Soup for the SLO The\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":273,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/sre-weekly-issue-258\/","url_meta":{"origin":312,"position":1},"title":"SRE Weekly Issue #258","date":"August 31, 2021","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, StackHawk: On February 25 at 10 am PT we are going to show you how easy it is to add application security testing to a #GitLab pipeline. Save your spot for our live session http:\/\/sthwk.com\/gitlab-stackhawk-automation Articles Practiced Humility in Retrospectives When acting\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":297,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/sre-weekly-issue-265\/","url_meta":{"origin":312,"position":2},"title":"SRE Weekly Issue #265","date":"August 31, 2021","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, StackHawk: Join StackHawk and WhiteSource tomorrow morning to learn about automated security testing in the DevOps pipeline. With automated dynamic testing and software composition analysis, you can be sure you\u2019re shipping secure APIs and applications. Grab your spot: http:\/\/sthwk.com\/stackhawk-whitesource Articles Insights into\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":455,"url":"https:\/\/fde.cat\/index.php\/2021\/09\/20\/sre-weekly-issue-285\/","url_meta":{"origin":312,"position":3},"title":"SRE Weekly Issue #285","date":"September 20, 2021","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, StackHawk: Check out the latest from StackHawk\u2019s Chief Security Officer, Scott Gerlach, on why security should be part of building software, and how StackHawk helps teams catch vulns before prod. https:\/\/sthwk.com\/cloudnative Articles Computers are the easy part What\u2019s so great about this\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":740,"url":"https:\/\/fde.cat\/index.php\/2023\/08\/07\/sre-weekly-issue-384\/","url_meta":{"origin":312,"position":4},"title":"SRE Weekly Issue #384","date":"August 7, 2023","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, Rootly: When incidents impact your customers, failing to communicate with them effectively can erode trust even further and compound an already difficult situation. Learn the essentials of customer-facing incident communication in Rootly\u2019s latest blog post: https:\/\/rootly.com\/blog\/the-medium-is-the-message-how-to-master-the-most-essential-incident-communication-channels Articles Scaling merge-ort across GitHub They\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":487,"url":"https:\/\/fde.cat\/index.php\/2021\/10\/11\/sre-weekly-issue-291\/","url_meta":{"origin":312,"position":5},"title":"SRE Weekly Issue #291","date":"October 11, 2021","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, Rootly: Manage incidents directly from Slack with Rootly \ud83d\ude92. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging the right team, postmortem timeline, setting up reminders, and more. Book a demo: https:\/\/rootly.io\/?utm_source=sreweekly Articles Understanding How Facebook Disappeared from the\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/312","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=312"}],"version-history":[{"count":1,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/312\/revisions"}],"predecessor-version":[{"id":398,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/312\/revisions\/398"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=312"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=312"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=312"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}