{"id":481,"date":"2021-10-04T02:01:14","date_gmt":"2021-10-04T02:01:14","guid":{"rendered":"https:\/\/fde.cat\/index.php\/2021\/10\/04\/sre-weekly-issue-290\/"},"modified":"2021-10-04T02:01:14","modified_gmt":"2021-10-04T02:01:14","slug":"sre-weekly-issue-290","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2021\/10\/04\/sre-weekly-issue-290\/","title":{"rendered":"SRE Weekly Issue #290"},"content":{"rendered":"<p><a href=\"https:\/\/sreweekly.com\/sre-weekly-issue-290\/\" title=\"Permalink to SRE Weekly Issue #290\" class=\"email_only\">View on sreweekly.com<\/a><\/p>\n<div class=\"sreweekly-sponsor-message\">\n<h2>A message from our sponsor, <a href=\"https:\/\/rootly.io\/?utm_source=sreweekly\">Rootly<\/a>:<\/h2>\n<p>Manage incidents directly from Slack with Rootly \ud83d\ude92. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging the right team, postmortem timeline, setting up reminders, and more. Book a demo:<br \/>\n<a href=\"https:\/\/rootly.io\/?utm_source=sreweekly\">https:\/\/rootly.io\/?utm_source=sreweekly<\/a><\/p>\n<\/div>\n<h2>Articles<\/h2>\n<div class=\"sreweekly-entry\">\n<div class=\"sreweekly-title\"><a href=\"https:\/\/ayende.com\/blog\/194785-B\/postmortem-partial-ravendb-cloud-outage?Key=b35f9f49-5930-4782-9f07-513c1437b21e\"> Postmortem: Partial RavenDB Cloud outage<\/a><\/div>\n<div class=\"sreweekly-description\">\n<p>Despite carefully testing how they would handle this week\u2019s expiration of the root CA that cross-signed Let\u2019s Encrypt\u2019s CA certificate, they had an outage. The reason? Poor behavior in OpenSSL. See the next article for a deeper explanation of what went wrong with OpenSSL.<\/p>\n<p>Oren Eini \u2014 RavenDB<\/p>\n<\/div>\n<\/div>\n<div class=\"sreweekly-entry\">\n<div class=\"sreweekly-title\"><a href=\"https:\/\/medium.com\/@sleevi_\/path-building-vs-path-verifying-the-chain-of-pain-9fbab861d7d6\">Path Building vs Path Verifying: The Chain of Pain<\/a><\/div>\n<div class=\"sreweekly-description\">\n<p>This article explains why some versions of OpenSSL are unable to validate certificates issued by Let\u2019s Encrypt now, even though the certificates should be considered valid.<\/p>\n<p>Ryan Sleevi<\/p>\n<\/div>\n<\/div>\n<div class=\"sreweekly-entry\">\n<div class=\"sreweekly-title\"><a href=\"https:\/\/www.techrepublic.com\/article\/stop-adopting-multicloud-to-achieve-application-resilience-says-honeycombs-charity-majors\/\">Stop adopting multicloud to achieve application resilience, says Honeycomb\u2019s Charity Majors<\/a><\/div>\n<div class=\"sreweekly-description\">\n<p>This says it all:<\/p>\n<p>It turns out that the path to safety isn\u2019t increased complexity.<\/p>\n<p>Matt Asay \u2014 TechRepublic<\/p>\n<\/div>\n<\/div>\n<div class=\"sreweekly-entry\">\n<div class=\"sreweekly-title\"><a href=\"https:\/\/firehydrant.io\/blog\/reliability-is-not-an-engineering-metric\/\">Reliability is not an engineering metric<\/a><\/div>\n<div class=\"sreweekly-description\">\n<p>The thrust of this article is that reliability applies to and should matter to the entire company, not just engineering. I really like the term \u201cpitchfork alerting\u201d.<\/p>\n<p>Robert Ross \u2014 FireHydrant<\/p>\n<\/div>\n<\/div>\n<div class=\"sreweekly-entry\">\n<div class=\"sreweekly-title\"><a href=\"https:\/\/iximiuz.com\/en\/posts\/reverse-proxy-http-keep-alive-and-502s\/?utm_medium=reddit&amp;utm_source=r_sre\">How HTTP Keep-Alive can cause TCP race condition<\/a><\/div>\n<div class=\"sreweekly-description\">\n<p>Lesson learned: always make your application server\u2019s timeout longer than your reverse proxy\u2019s.<\/p>\n<p>Ivan Velichko<\/p>\n<\/div>\n<\/div>\n<div class=\"sreweekly-entry\">\n<div class=\"sreweekly-title\"><a href=\"https:\/\/surfingcomplexity.blog\/2021\/09\/26\/the-strange-beauty-of-strange-loop-failure-modes\/\">The strange beauty of strange loop failure modes<\/a><\/div>\n<div class=\"sreweekly-description\">\n<p>Who deploys the deploy tool? The deploy tool, obviously \u2014 unless it\u2019s down.<\/p>\n<p>Lorin Hochstein<\/p>\n<\/div>\n<\/div>\n<div class=\"sreweekly-entry\">\n<div class=\"sreweekly-title\"><a href=\"https:\/\/github.blog\/2021-09-27-partitioning-githubs-relational-databases-scale\/\">Partitioning GitHub\u2019s relational databases to handle scale<\/a><\/div>\n<div class=\"sreweekly-description\">\n<p>Their approach: group tables into \u201cschema domains\u201d, make sure that queries don\u2019t span schema domains, and then move a schema domain to its own separate database cluster.<\/p>\n<p>Thomas Maurer \u2014 GitHub<\/p>\n<\/div>\n<\/div>\n<div class=\"sreweekly-entry\">\n<div class=\"sreweekly-title\"><a href=\"https:\/\/tech.ebayinc.com\/research\/groot-ebays-event-graph-based-approach-for-root-cause-analysis\/\">Groot: eBay\u2019s Event-graph-based Approach for Root Cause Analysis<\/a><\/div>\n<div class=\"sreweekly-description\">\n<p>Groot is about helping figure out what\u2019s wrong during an incident, not about analyzing an incident after the fact. I totally get why they need this tool, since they have over <strong>5000 microservices<\/strong>!<\/p>\n<p> Hanzhang Wang \u2014 eBay<\/p>\n<\/div>\n<\/div>\n<div class=\"sreweekly-entry\">\n<div class=\"sreweekly-title\"><a href=\"https:\/\/www.cruform.com\/sre-not-monolithic-role\/\">SRE is not a monolithic role<\/a><\/div>\n<div class=\"sreweekly-description\">\n<p>SRE is a broad, overarching responsibility that needs a multitude of role considerations to pull off properly.<\/p>\n<p>Ash P \u2014 Cruform<\/p>\n<\/div>\n<\/div>\n<h2>Outages<\/h2>\n<p><a href=\"https:\/\/status.heroku.com\/incidents\/2362\">Heroku<\/a><\/p>\n<p>(also <a href=\"https:\/\/status.heroku.com\/incidents\/2361\">this one<\/a>)Heroku had a major outage that coincided with an Amazon EBS failure in a single availability zone in us-east1. Customers of Heroku such as <a href=\"https:\/\/status.deadmanssnitch.com\/incidents\/ctkhr3fq72t3\">Dead Man\u2019s Snitch<\/a> were impacted.<\/p>\n<p><a href=\"https:\/\/status.slack.com\/2021-09\/06c1e17de93e7dc2\">Slack<\/a><\/p>\n<p>Slack had a big disruption related to DNSSEC. Here\u2019s an interesting analysis of what may have gone wrong (<a href=\"https:\/\/lists.dns-oarc.net\/pipermail\/dns-operations\/2021-September\/021340.html\">link<\/a>).<\/p>\n<p><a href=\"https:\/\/letsencrypt.status.io\/pages\/incident\/55957a99e800baa4470002da\/6155ea52e170e905361624e3\">Let\u2019s Encrypt<\/a><\/p>\n<p>Let\u2019s Encrypt saw heavy traffic as everyone clamored to renew their certificates, causing certificate issuance to slow down.<\/p>\n<p><a href=\"https:\/\/www.bleepingcomputer.com\/news\/microsoft\/microsoft-365-mfa-outage-locks-users-out-of-their-accounts\/\">Microsoft 365<\/a><br \/>\n<a href=\"https:\/\/www.theapplepost.com\/2021\/10\/02\/apples-find-my-service-down-for-some-users\/\">Apple\u2019s \u201cFind My\u201d service<\/a><br \/>\n<a href=\"https:\/\/twitter.com\/signalapp\/status\/1442354759009247232\">Signal<\/a><br \/>\n<a href=\"https:\/\/status.xero.com\/incidents\/06qfl3kj4p0j\">Xero<\/a><\/p>\n<p>This one coincided with the same Amazon EBS outage mentioned above. Xero also had <a href=\"https:\/\/status.xero.com\/incidents\/d24f1j2kxq5v\">another<\/a> outage on October 1.<\/p>\n<p>SRE WEEKLY<\/p>","protected":false},"excerpt":{"rendered":"<p>View on sreweekly.com A message from our sponsor, Rootly: Manage incidents directly from Slack with Rootly \ud83d\ude92. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging the right team, postmortem timeline, setting up reminders, and more. Book a demo: https:\/\/rootly.io\/?utm_source=sreweekly Articles Postmortem: Partial RavenDB Cloud outage Despite carefully testing how they would&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2021\/10\/04\/sre-weekly-issue-290\/\">Continue reading <span class=\"screen-reader-text\">SRE Weekly Issue #290<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[8],"tags":[],"class_list":["post-481","post","type-post","status-publish","format-standard","hentry","category-sre","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":602,"url":"https:\/\/fde.cat\/index.php\/2022\/06\/27\/sre-weekly-issue-328\/","url_meta":{"origin":481,"position":0},"title":"SRE Weekly Issue #328","date":"June 27, 2022","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, Rootly: Manage incidents directly from Slack with Rootly \ud83d\ude92. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set): https:\/\/rootly.com\/demo\/\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":540,"url":"https:\/\/fde.cat\/index.php\/2022\/02\/07\/sre-weekly-issue-308\/","url_meta":{"origin":481,"position":1},"title":"SRE Weekly Issue #308","date":"February 7, 2022","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, Rootly: Manage incidents directly from Slack with Rootly \ud83d\ude92. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging the right team, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly shirt): https:\/\/rootly.com\/demo\/?utm_source=sreweekly Articles\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":546,"url":"https:\/\/fde.cat\/index.php\/2022\/03\/07\/sre-weekly-issue-312\/","url_meta":{"origin":481,"position":2},"title":"SRE Weekly Issue #312","date":"March 7, 2022","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, Rootly: Manage incidents directly from Slack with Rootly \ud83d\ude92. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging the right team, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly shirt): https:\/\/rootly.com\/demo\/?utm_source=sreweekly Articles\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":645,"url":"https:\/\/fde.cat\/index.php\/2022\/10\/31\/sre-weekly-issue-345\/","url_meta":{"origin":481,"position":3},"title":"SRE Weekly Issue #345","date":"October 31, 2022","format":false,"excerpt":"View on sreweekly.com SRE Weekly is now on Mastodon at @SREWeekly@social.linux.pizza! Follow to get notified of each new issue as it comes out. This replaces the Twitter account @SREWeekly, which I am now retiring in favor of Mastodon. For those of you following @SREWeekly on Twitter, you\u2019ll need to choose\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":552,"url":"https:\/\/fde.cat\/index.php\/2022\/03\/14\/sre-weekly-issue-313\/","url_meta":{"origin":481,"position":4},"title":"SRE Weekly Issue #313","date":"March 14, 2022","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, Rootly: Manage incidents directly from Slack with Rootly \ud83d\ude92. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set): https:\/\/rootly.com\/demo\/\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":487,"url":"https:\/\/fde.cat\/index.php\/2021\/10\/11\/sre-weekly-issue-291\/","url_meta":{"origin":481,"position":5},"title":"SRE Weekly Issue #291","date":"October 11, 2021","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, Rootly: Manage incidents directly from Slack with Rootly \ud83d\ude92. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging the right team, postmortem timeline, setting up reminders, and more. Book a demo: https:\/\/rootly.io\/?utm_source=sreweekly Articles Understanding How Facebook Disappeared from the\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/481","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=481"}],"version-history":[{"count":0,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/481\/revisions"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=481"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=481"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=481"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}