{"id":574,"date":"2022-05-09T00:53:13","date_gmt":"2022-05-09T00:53:13","guid":{"rendered":"https:\/\/fde.cat\/index.php\/2022\/05\/09\/sre-weekly-issue-321\/"},"modified":"2022-05-09T00:53:13","modified_gmt":"2022-05-09T00:53:13","slug":"sre-weekly-issue-321","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2022\/05\/09\/sre-weekly-issue-321\/","title":{"rendered":"SRE Weekly Issue #321"},"content":{"rendered":"<p><a href=\"https:\/\/sreweekly.com\/sre-weekly-issue-321\/\" title=\"Permalink to SRE Weekly Issue #321\" class=\"email_only\">View on sreweekly.com<\/a><\/p>\n<div class=\"sreweekly-sponsor-message\">\n<h2>A message from our sponsor, <a href=\"https:\/\/rootly.com\/demo\/?utm_source=sreweekly\">Rootly<\/a>:<\/h2>\n<p>Manage incidents directly from Slack with Rootly \ud83d\ude92. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):<br \/>\n<a href=\"https:\/\/rootly.com\/demo\/\">https:\/\/rootly.com\/demo\/<\/a><\/p>\n<\/div>\n<h2>Articles<\/h2>\n<div class=\"wp-block-group\">\n<div class=\"wp-block-group__inner-container\">\n<div class=\"sreweekly-entry\">\n<div class=\"sreweekly-title\"><a href=\"https:\/\/doordash.engineering\/2022\/04\/25\/using-fault-injection-testing-to-improve-doordash-reliability\/\" target=\"_blank\" rel=\"noopener\">Using Fault Injection Testing to Improve DoorDash Reliability\u00a0<\/a><\/div>\n<div class=\"sreweekly-description\">\n<p>A researcher explains how they implemented their microservice failure testing tool at DoorDash.  The tool, Fillibuster, automatically discovers microservice dependencies and injects faults, avoiding the need to design specific individual failure scenarios.<\/p>\n<p>\u00a0\u00a0Christopher Meiklejohn \u2014 DoorDash<\/p>\n<\/div>\n<\/div>\n<div class=\"sreweekly-entry\">\n<div class=\"sreweekly-title\"><a href=\"https:\/\/twitter.com\/ReinH\/status\/1520530487663480832\" target=\"_blank\" rel=\"noopener\">Twitter: @ReinH on Atlassian\u2019s incident write-up<\/a><\/div>\n<div class=\"sreweekly-description\">\n<p>Last week, I shared Atlassian\u2019s outage write-up.  This link is a Twitter thread with a critique.<\/p>\n<p>I feel like it is perhaps not a \u201cgood look\u201d to repeatedly try to sell your product in your writeup about your product\u2019s catastrophic outage<\/p>\n<p>\u00a0\u00a0@ReinH<\/p>\n<\/div>\n<\/div>\n<div class=\"sreweekly-entry\">\n<div class=\"sreweekly-title\"><a href=\"http:\/\/wiki.dbbs.co\/view\/usefulness-of-error\" target=\"_blank\" rel=\"noopener\">usefulness of error<\/a><\/div>\n<div class=\"sreweekly-description\">\n<p>\u201cError\u201d serves a number of functions for an organization: as a defense against entanglement, the illusion of control, as a means for distancing, and as a marker for a failed investigation.<\/p>\n<p>\u00a0\u00a0Eric Dobbs<\/p>\n<\/div>\n<\/div>\n<div class=\"sreweekly-entry\">\n<div class=\"sreweekly-title\"><a href=\"https:\/\/enomstatus.com\/incidents\/03q064h6rb7x\" target=\"_blank\" rel=\"noopener\">Incident Report for Enom (January 15, 2022)<\/a><\/div>\n<div class=\"sreweekly-description\">\n<p>This is a write-up posted in January for an incident that occurred during an infrastructure migration.  I feel like I can relate to every one of the learnings.<\/p>\n<p>\u00a0\u00a0Enom (Tucows)<\/p>\n<\/div>\n<\/div>\n<div class=\"sreweekly-entry\">\n<div class=\"sreweekly-title\"><a href=\"https:\/\/ernestas.me\/on-call-leave-it-better-than-you-found-it\" target=\"_blank\" rel=\"noopener\">On-Call: Leave It Better Than You Found It<\/a><\/div>\n<div class=\"sreweekly-description\">\n<p>In the past two years, I\u2019ve been participating in on-call rotations as a Site Reliability Engineer at Vinted. Here are some of the practical lessons I\u2019ve learned about the process.<\/p>\n<p>\u00a0\u00a0Ernestas Narmontas<\/p>\n<\/div>\n<\/div>\n<div class=\"sreweekly-entry\">\n<div class=\"sreweekly-title\"><a href=\"https:\/\/cloud.google.com\/blog\/products\/devops-sre\/how-sres-analyze-risks-to-evaluate-slos\/\" target=\"_blank\" rel=\"noopener\">How SREs analyze risks to evaluate SLOs<\/a><\/div>\n<div class=\"sreweekly-description\">\n<p>This article is all about finding out what risks exist that may impact your ability to meet your SLOs.  Once you\u2019ve done that, you can determine whether your SLOs are realistic.<\/p>\n<p>\u00a0\u00a0Ayelet Sachto \u2014 Google<\/p>\n<\/div>\n<\/div>\n<div class=\"sreweekly-entry\">\n<div class=\"sreweekly-title\"><a href=\"https:\/\/engineering.klarna.com\/how-we-aligned-200-teams-to-monitor-services-with-slos-1-2-1552fab0faab\" target=\"_blank\" rel=\"noopener\">How we aligned 200 teams to monitor services with SLOs <\/a><\/div>\n<div class=\"sreweekly-description\">\n<p>When your organization chooses to implement SLOs, how do you get everyone on board?  This two-part series has an in-depth look at how Klarna did it.<\/p>\n<p>\u00a0\u00a0Andrew Cartine \u2014 Klarna<\/p>\n<\/div>\n<\/div>\n<div class=\"sreweekly-entry\">\n<div class=\"sreweekly-title\"><a href=\"https:\/\/www.detech.ai\/blog\/what-is-an-sre-product-manager\" target=\"_blank\" rel=\"noopener\">What is an SRE Product Manager?<\/a><\/div>\n<div class=\"sreweekly-description\">\n<p>Subtitle: <em>And why do SRE teams need PMs?<\/em><\/p>\n<p>After laying out the reasons why SREs need PMs, this article goes into detail about what a PM can bring to an SRE team.<\/p>\n<p>\u00a0\u00a0Ant\u00f3nio Ara\u00fajo \u2014 detech.ai<\/p>\n<\/div>\n<\/div>\n<div class=\"sreweekly-entry\">\n<div class=\"sreweekly-title\"><a href=\"https:\/\/engineering.fb.com\/2022\/05\/05\/developer-tools\/belljar\/\" target=\"_blank\" rel=\"noopener\">BellJar: A new framework for testing system recoverability at scale<\/a><\/div>\n<div class=\"sreweekly-description\">\n<p>BellJar helps users find cyclic dependencies in their services, by running totally isolated VMs and requiring users to explicitly enable every external dependency they need in order to bootstrap each service.  It has a really neat feature of automatically generating runbooks based on these test cases.<\/p>\n<p>\u00a0\u00a0Christopher Bunn and Jie Huang \u2014 Meta<\/p>\n<\/div>\n<\/div>\n<div class=\"sreweekly-entry\">\n<div class=\"sreweekly-title\"><a href=\"https:\/\/www.netflix.com\/title\/81198239\" target=\"_blank\" rel=\"noopener\">Meltdown: Three Mile Island<\/a><\/div>\n<div class=\"sreweekly-description\">\n<p>This week, I watched Netflix\u2019s <em>Meltdown: Three Mile Island<\/em>, a documentary about the nuclear accident in the US in 1979.   It\u2019s not exactly a post-incident write-up, but there\u2019s a lot in there about normalization of deviance, situational awareness, and risk-taking (both in and out of incidents).<\/p>\n<p>\u00a0\u00a0Netflix<\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<h2>Outages<\/h2>\n<p><a href=\"https:\/\/status.slack.com\/\/2022-05\/8c4ea6c334ebe48a\">Slack<\/a><\/p>\n<p>and <a href=\"https:\/\/status.slack.com\/\/2022-05\/cc9534bd47a6152e\">this one<\/a><\/p>\n<p><a href=\"https:\/\/status.heroku.com\/incidents\/2413?updated\">Heroku<\/a><\/p>\n<p>Heroku\u2019s been dealing with a security incident since April 13.  They performed a mass password reset of all accounts and their GitHub integration has been disabled for days.<\/p>\n<p><a href=\"https:\/\/status.roblox.com\/pages\/incident\/59db90dbcdeb2f04dadcf16d\/6271c785900216053283caf8\">Roblox<\/a><br \/>\nSRE WEEKLY<\/p>","protected":false},"excerpt":{"rendered":"<p>View on sreweekly.com A message from our sponsor, Rootly: Manage incidents directly from Slack with Rootly \ud83d\ude92. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set): https:\/\/rootly.com\/demo\/ Articles Using Fault Injection Testing&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2022\/05\/09\/sre-weekly-issue-321\/\">Continue reading <span class=\"screen-reader-text\">SRE Weekly Issue #321<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[8],"tags":[],"class_list":["post-574","post","type-post","status-publish","format-standard","hentry","category-sre","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":543,"url":"https:\/\/fde.cat\/index.php\/2022\/02\/21\/sre-weekly-issue-310\/","url_meta":{"origin":574,"position":0},"title":"SRE Weekly Issue #310","date":"February 21, 2022","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, Rootly: Manage incidents directly from Slack with Rootly \ud83d\ude92. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging the right team, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly shirt): https:\/\/rootly.com\/demo\/?utm_source=sreweekly Articles\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":579,"url":"https:\/\/fde.cat\/index.php\/2022\/05\/30\/sre-weekly-issue-324\/","url_meta":{"origin":574,"position":1},"title":"SRE Weekly Issue #324","date":"May 30, 2022","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, Rootly: Manage incidents directly from Slack with Rootly \ud83d\ude92. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set): https:\/\/rootly.com\/demo\/\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":535,"url":"https:\/\/fde.cat\/index.php\/2022\/01\/24\/sre-weekly-issue-306\/","url_meta":{"origin":574,"position":2},"title":"SRE Weekly Issue #306","date":"January 24, 2022","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, Rootly: Manage incidents directly from Slack with Rootly \ud83d\ude92. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging the right team, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly shirt): https:\/\/rootly.com\/demo\/?utm_source=sreweekly Articles\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":537,"url":"https:\/\/fde.cat\/index.php\/2022\/01\/31\/sre-weekly-issue-307\/","url_meta":{"origin":574,"position":3},"title":"SRE Weekly Issue #307","date":"January 31, 2022","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, Rootly: Manage incidents directly from Slack with Rootly \ud83d\ude92. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging the right team, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly shirt): https:\/\/rootly.com\/demo\/?utm_source=sreweekly Articles\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":546,"url":"https:\/\/fde.cat\/index.php\/2022\/03\/07\/sre-weekly-issue-312\/","url_meta":{"origin":574,"position":4},"title":"SRE Weekly Issue #312","date":"March 7, 2022","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, Rootly: Manage incidents directly from Slack with Rootly \ud83d\ude92. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging the right team, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly shirt): https:\/\/rootly.com\/demo\/?utm_source=sreweekly Articles\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":603,"url":"https:\/\/fde.cat\/index.php\/2022\/07\/04\/sre-weekly-issue-329\/","url_meta":{"origin":574,"position":5},"title":"SRE Weekly Issue #329","date":"July 4, 2022","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, Rootly: Manage incidents directly from Slack with Rootly \ud83d\ude92. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set): https:\/\/rootly.com\/demo\/\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/574","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=574"}],"version-history":[{"count":0,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/574\/revisions"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=574"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=574"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=574"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}