Wow, this interactive tool for choosing SLOs is fun to play with! Dragging the sliders really gives you a feel for the math involved, and then you get a formula that you can actually use.
Alex Ewerlöf
A riveting story of a service that was the victim of its own success, a potential solution, and then further challenges to overcome.
Tanat Lokejaroenlarb — Adevinta
Here’s a classic example of “work as imagined” vs “work as done”, as health care workers struggle against difficult security constraints while trying to care for patients.
Fred Hebert — summary
Ross Koppel, Sean Smith, Jim Blythe, and Vijay Kothari — original paper
This article covers a lot of ground, touching on a lot of components of a successful SRE program, and even includes a code example for SLO calculation.
Vishal Padghan — Squadcast
More on the weird EBS performance regression I linked to last week. Still no full explanation of what changed, but at least they have a solution (gp3 volumes).
Dustin Brown — dolthub
After a massive 73-hour outage, Roblox set out to redesign their infrastructure to make that kind of incident much less likely. They’ve charted a path through several intermediate architectures, with the ultimate goal of active-active datacenters.
Daniel Sturman, Max Ross, and Michael Wolf — Roblox
Now here’s one that really makes me think. I can’t really summarize it in a sentence, so just go read it.
Lorin Hochstein
SRE WEEKLY