SRE Weekly Issue #290

Articles

Postmortem: Partial RavenDB Cloud outage

Despite carefully testing how they would handle this week’s expiration of the root CA that cross-signed Let’s Encrypt’s CA certificate, they had an outage. The reason? Poor behavior in OpenSSL. See the next article for a deeper explanation of what went wrong with OpenSSL.

Oren Eini — RavenDB

Path Building vs Path Verifying: The Chain of Pain

This article explains why some versions of OpenSSL are unable to validate certificates issued by Let’s Encrypt now, even though the certificates should be considered valid.

Ryan Sleevi

Stop adopting multicloud to achieve application resilience, says Honeycomb’s Charity Majors

This says it all:

It turns out that the path to safety isn’t increased complexity.

Matt Asay — TechRepublic

Reliability is not an engineering metric

The thrust of this article is that reliability applies to and should matter to the entire company, not just engineering. I really like the term “pitchfork alerting”.

Robert Ross — FireHydrant

How HTTP Keep-Alive can cause TCP race condition

Lesson learned: always make your application server’s timeout longer than your reverse proxy’s.

Ivan Velichko

The strange beauty of strange loop failure modes

Who deploys the deploy tool? The deploy tool, obviously — unless it’s down.

Lorin Hochstein

Partitioning GitHub’s relational databases to handle scale

Their approach: group tables into “schema domains”, make sure that queries don’t span schema domains, and then move a schema domain to its own separate database cluster.

Thomas Maurer — GitHub

Groot: eBay’s Event-graph-based Approach for Root Cause Analysis

Groot is about helping figure out what’s wrong during an incident, not about analyzing an incident after the fact. I totally get why they need this tool, since they have over 5000 microservices!

Hanzhang Wang — eBay

SRE is not a monolithic role

SRE is a broad, overarching responsibility that needs a multitude of role considerations to pull off properly.

Ash P — Cruform

Outages

Heroku

(also this one)Heroku had a major outage that coincided with an Amazon EBS failure in a single availability zone in us-east1. Customers of Heroku such as Dead Man’s Snitch were impacted.

Slack

Slack had a big disruption related to DNSSEC. Here’s an interesting analysis of what may have gone wrong (link).

Let’s Encrypt

Let’s Encrypt saw heavy traffic as everyone clamored to renew their certificates, causing certificate issuance to slow down.

Microsoft 365
Apple’s “Find My” service
Signal
Xero

This one coincided with the same Amazon EBS outage mentioned above. Xero also had another outage on October 1.

SRE WEEKLY

A message from our sponsor, Rootly:

Articles

Outages

Related