Despite carefully testing how they would handle this week’s expiration of the root CA that cross-signed Let’s Encrypt’s CA certificate, they had an outage. The reason? Poor behavior in OpenSSL. See the next article for a deeper explanation of what went wrong with OpenSSL.
Oren Eini — RavenDB
This article explains why some versions of OpenSSL are unable to validate certificates issued by Let’s Encrypt now, even though the certificates should be considered valid.
This says it all:
It turns out that the path to safety isn’t increased complexity.
Matt Asay — TechRepublic
The thrust of this article is that reliability applies to and should matter to the entire company, not just engineering. I really like the term “pitchfork alerting”.
Robert Ross — FireHydrant
Lesson learned: always make your application server’s timeout longer than your reverse proxy’s.
Who deploys the deploy tool? The deploy tool, obviously — unless it’s down.
Their approach: group tables into “schema domains”, make sure that queries don’t span schema domains, and then move a schema domain to its own separate database cluster.
Thomas Maurer — GitHub
Groot is about helping figure out what’s wrong during an incident, not about analyzing an incident after the fact. I totally get why they need this tool, since they have over 5000 microservices!
Hanzhang Wang — eBay
SRE is a broad, overarching responsibility that needs a multitude of role considerations to pull off properly.
Ash P — Cruform
Slack had a big disruption related to DNSSEC. Here’s an interesting analysis of what may have gone wrong (link).
Let’s Encrypt saw heavy traffic as everyone clamored to renew their certificates, causing certificate issuance to slow down.
This one coincided with the same Amazon EBS outage mentioned above. Xero also had another outage on October 1.