Less than 12 hours after their outage, Cloudflare posted this detailed run-down of what happened.
Tom Strick and Jeremy Hartman — Cloudflare
Metastable failures occur in open systems with an uncontrolled source of load where a trigger causes the system to enter a bad state that persists even when the trigger is removed.
By “derivatives”, the author means rate-of-change, like Prometheus’s irate(). Derivatives have their place, but this article has good reasons to reconsider using them for alerts.
In this article, I’ll dive into what it takes to get into site reliability engineering, how to adopt it within your own organization and some of the core principles and best practices you’ll need to keep in mind as you move forward in your SRE maturity journey.
Jemiah Sius — devops.com
I have given and received this question in many SRE interviews, and it’s famously used by Google in their interviews. This article dissects the question and its merits and downsides for the benefit of both interviewers and interviewees.
Cloudflare had a major outage, taking many sites and services with it.