This research paper summary goes into Mode Error and the dangers of adding more features to a system in the form of modes, especially if the system can change modes on its own.
Fred Hebert (summary)
Dr. Nadine B. Sarter (original paper)
Cloudflare suffered a power outage in one of the datacenters housing their control and data planes. The outage itself is intriguing, and in its aftermath, Cloudflare learned that their system wasn’t as HA as they thought.
Lots of great lessons here, and if you want more, they posted another incident writeup recently.
Matthew Prince — Cloudflare
Separating write from read workloads can increase complexity but also open the door to greater scalability, as this article explains.
Pier-Jean Malandrino
Covers four strategies for load shedding, with code examples:
Random Shedding
Priority-Based Shedding
Resource-Based Shedding
Node Isolation
Code Reliant
Lots of juicy details about the three outages, including a link to AWS’s write-up of their Lambda outage in June.
Gergely Orosz
The diagrams in this article are especially useful for understanding how the circuit-breaker pattern works.
Pier-Jean Malandrino
This one’s about how on-call can go bad, and how to structure your team’s on-call so to be livable and sustainable.
Michael Hart
Execs cast a big shadow in an incident, so it’s important to have a plan for how to communicate with them, as this article explains.
Ashley Sawatsky — Rootly
SRE WEEKLY