{"id":333,"date":"2021-08-31T14:39:28","date_gmt":"2021-08-31T14:39:28","guid":{"rendered":"https:\/\/fde.cat\/?p=333"},"modified":"2021-08-31T14:39:28","modified_gmt":"2021-08-31T14:39:28","slug":"sre-weekly-issue-279","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/sre-weekly-issue-279\/","title":{"rendered":"SRE Weekly Issue #279"},"content":{"rendered":"<p><a href=\"https:\/\/sreweekly.com\/sre-weekly-issue-279\/\" title=\"Permalink to SRE Weekly Issue #279\" class=\"email_only\">View on sreweekly.com<\/a><\/p>\n<div class=\"sreweekly-sponsor-message\">\n<h2>A message from our sponsor, StackHawk:<\/h2>\n<p>On July 28, ZAP Creator Simon Bennetts is giving a first look at ZAP\u2019s new automation framework. Grab your spot:<br \/>\n<a href=\"https:\/\/sthwk.com\/ZAP-Automation\">https:\/\/sthwk.com\/ZAP-Automation<\/a><\/p>\n<\/div>\n<h2>Articles<\/h2>\n<div class=\"sreweekly-entry\">\n<div class=\"sreweekly-title\"><a href=\"https:\/\/www.infoq.com\/presentations\/cascading-failure-risk\/?topicPageSponsorship=e60745be-2b04-48b2-b701-c941b7afb84c&amp;itm_source=presentations_about_architecture-design&amp;itm_medium=link&amp;itm_campaign=architecture-design\">Managing the Risk of Cascading Failure <\/a><\/div>\n<div class=\"sreweekly-description\">\n<p>This is a presentation by Laura Nolan (with text transcript) all about cascading failure, what causes it, how to avoid it, and how to deal with it when it happens.<\/p>\n<p>I love how succinct this is:<\/p>\n<p>[\u2026] in any system where we design to fail over, so any mechanism at all that redistributes load from a failed component to still working components, we create the potential for a cascading failure to happen.<\/p>\n<p>Laura Nolan \u2014 Slack (presented at InfoQ)<\/p>\n<\/div>\n<\/div>\n<div class=\"sreweekly-entry\">\n<div class=\"sreweekly-title\"><a href=\"https:\/\/surfingcomplexity.blog\/2021\/07\/11\/the-greedy-exec-trap\/\">The greedy exec trap<\/a><\/div>\n<div class=\"sreweekly-description\">\n<p>It\u2019s so easy to explain an incident by describing how management could have prevented it from investing additional resources.<\/p>\n<p>Lorin goes on to explain the \u201ctrap\u201d part: it\u2019s easy to stop investigating an incident too soon and declare the cause \u201cgreedy executives\u201d, preventing us from learning more.<\/p>\n<p>Lorin Hochstein<\/p>\n<\/div>\n<\/div>\n<div class=\"sreweekly-entry\">\n<div class=\"sreweekly-title\"><a href=\"https:\/\/www.reddit.com\/r\/RedditEng\/comments\/o4yjpd\/rwallstreetbets_incident_anthology_what_worked\/\">r\/WallStreetBets Incident Anthology (What Worked Edition): Recently Consumed<\/a><\/div>\n<div class=\"sreweekly-description\">\n<p>They redesigned one of their caching systems in 2020, and it paid off handsomely during the GameStop saga. This article discusses the redesign and considers what would have happened without it.<\/p>\n<p>Garrett Hoffman \u2014 Reddit<\/p>\n<\/div>\n<\/div>\n<div class=\"sreweekly-entry\">\n<div class=\"sreweekly-title\"><a href=\"https:\/\/firehydrant.io\/blog\/pragmatic-incident-response\/\">Pragmatic Incident Response: 3 Lessons Learned from Failures<\/a><\/div>\n<div class=\"sreweekly-description\">\n<p>The lessons are:<\/p>\n<p>Do retrospectives for small incidents first.<br \/>\nDo a retrospective soon after the incident.<br \/>\nAlert on the user experience.<\/p>\n<p>All great advice, and #1 is an interesting idea I hadn\u2019t heard before.<\/p>\n<p>Robert Ross \u2014 FireHydrant<\/p>\n<\/div>\n<\/div>\n<div class=\"sreweekly-entry\">\n<div class=\"sreweekly-title\"><a href=\"https:\/\/rootly.io\/blog\/de-siloing-incident-management-how-to-make-reliability-engineering-everyone-s-job\">De-Siloing Incident Management: How to Make Reliability Engineering Everyone\u2019s Job<\/a><\/div>\n<div class=\"sreweekly-description\">\n<p>We can\u2019t engineer reliability in a vacuum. This is a great explainer on how SRE siloing happens, the problems it causes, and how to break SRE out of its shell.<\/p>\n<p>JJ Tang \u2014 Rootly<\/p>\n<\/div>\n<\/div>\n<div class=\"sreweekly-entry\">\n<div class=\"sreweekly-title\"><a href=\"https:\/\/myemail.constantcontact.com\/CALLBACK-498--July-2021---Aircrew-Resilience.html?soid=1101073741327&amp;aid=3yDpH3qvWg8\">CALLBACK 498, July 2021 \u2013 Aircrew Resilience<\/a><\/div>\n<div class=\"sreweekly-description\">\n<p>This ASRS (Aviation Safety Reporting System) Callback issue has some real-world examples of resilient systems in action.<\/p>\n<p>Nasa Asrs<\/p>\n<\/div>\n<\/div>\n<div class=\"sreweekly-entry\">\n<div class=\"sreweekly-title\"><a href=\"https:\/\/blog.cloudflare.com\/automatic-remediation-of-kubernetes-nodes\/\">Automatic Remediation of Kubernetes Nodes<\/a><\/div>\n<div class=\"sreweekly-description\">\n<p>Facing a common kubernetes node failure modes, Cloudflare uses open source tools (one published by them) to perform automatic restarts.<\/p>\n<p>In the past 30 days, we\u2019ve used the above automatic node remediation process to action 571 nodes. That has saved our humans a considerable amount of time.<\/p>\n<p>Andrew DeMaria \u2014 Cloudflare<\/p>\n<\/div>\n<\/div>\n<h2>Outages<\/h2>\n<p><a href=\"https:\/\/piunikaweb.com\/2021\/07\/12\/amazon-and-aws-down-and-not-working-for-many\/\">Amazon.com<\/a><br \/>\n<a href=\"https:\/\/www.macrumors.com\/2021\/07\/12\/icloud-mail-is-down\/\">iCloud Mail<\/a><br \/>\n<a href=\"https:\/\/piunikaweb.com\/2021\/07\/12\/adobe-creative-cloud-down-or-not-working-youre-not-alone\/\">Adobe Creative Cloud<\/a><br \/>\n<a href=\"https:\/\/www.theaustralian.com.au\/news\/latest-news\/betting-company-sportsbet-hit-by-massive-outage\/news-story\/4d9dab4fb8df85acc21e6919e33fd789\">Sportsbet<\/a><br \/>\nSRE WEEKLY<\/p>\n","protected":false},"excerpt":{"rendered":"<p>View on sreweekly.com A message from our sponsor, StackHawk: On July 28, ZAP Creator Simon Bennetts is giving a first look at ZAP\u2019s new automation framework. Grab your spot: https:\/\/sthwk.com\/ZAP-Automation Articles Managing the Risk of Cascading Failure This is a presentation by Laura Nolan (with text transcript) all about cascading failure, what causes it, how&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2021\/08\/31\/sre-weekly-issue-279\/\">Continue reading <span class=\"screen-reader-text\">SRE Weekly Issue #279<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[8],"tags":[],"class_list":["post-333","post","type-post","status-publish","format-standard","hentry","category-sre","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":343,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/sre-weekly-issue-282\/","url_meta":{"origin":333,"position":0},"title":"SRE Weekly Issue #282","date":"August 31, 2021","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, StackHawk: ICYMI ZAP Creator and Project Lead Simon Bennetts recently unveiled ZAP\u2019s new automation framework. Watch the session and see how it works: https:\/\/sthwk.com\/Automation-Framework Articles A thorough introduction to bpftrace I really need to learn bpftrace, and this article is a great\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":298,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/sre-weekly-issue-266\/","url_meta":{"origin":333,"position":1},"title":"SRE Weekly Issue #266","date":"August 31, 2021","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, StackHawk: Are you a ZAP user looking to automate your security testing? Make sure to tune in to ZAPCon After Hours on Tuesday at 8 am PT to see how you can use Jenkins and Zest scripts to automate ZAP. http:\/\/sthwk.com\/zapcon-ah Articles\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":276,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/sre-weekly-issue-259\/","url_meta":{"origin":333,"position":2},"title":"SRE Weekly Issue #259","date":"August 31, 2021","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, StackHawk: Mark your calendars! The first conference for OWASP ZAP users is taking place March 9. Get your free ticket to connect with other ZAP users and learn about the project\u2019s roadmap http:\/\/sthwk.com\/zapcon-sreweekly Articles Increment: Reliability This quarter\u2019s Increment issue is about\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":766,"url":"https:\/\/fde.cat\/index.php\/2023\/10\/02\/sre-weekly-issue-392\/","url_meta":{"origin":333,"position":3},"title":"SRE Weekly Issue #392","date":"October 2, 2023","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, Rootly: Rootly is proud to have been recognized by G2 as a High Performer and Enterprise Leader in Incident Management for the sixth consecutive quarter! In total, we received nine G2 awards in the Summer Report. As a thank-you to our community,\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":320,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/sre-weekly-issue-275\/","url_meta":{"origin":333,"position":4},"title":"SRE Weekly Issue #275","date":"August 31, 2021","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, StackHawk: Join ZAP Founder & Project Lead Simon Bennetts on June 30 for a live AMA where he will be answering questions on all things open source and AppSec. Register: http:\/\/sthwk.com\/Simon-AMA Articles Practical Guide to SRE: Incident Severity Levels Here\u2019s a take\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":261,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/sre-weekly-issue-256\/","url_meta":{"origin":333,"position":5},"title":"SRE Weekly Issue #256","date":"August 31, 2021","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, StackHawk: Register now for the first-ever ZAPCon taking place March 9th. The free event will focus on OWASP ZAP and application security best practices. You wont want to miss it! http:\/\/sthwk.com\/zapcon-sre-weekly Articles Slack\u2019s Outage on January 4th 2021 Here\u2019s a blog post\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/333","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=333"}],"version-history":[{"count":1,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/333\/revisions"}],"predecessor-version":[{"id":377,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/333\/revisions\/377"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=333"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=333"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=333"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}