Articles
Why do we use the term “root cause”? I especially love the opening analogy.
John Allspaw — The ReadME Project
Our Reliability Manifesto is a succinct collection of rules, guidelines, and best practices that reflect our current thinking on what it takes to build a reliable system.
Christian Hardenberg — Delivery Hero
Here’s how New Relic sets their S*Os.
Set SLIs and SLOs against system boundaries
Dan Holloran and Elisa Binette — New Relic
It involves lots of machine learning and a “team resilience score”.
Jennifer Riggins — The New Stack
Every incident is unique, so incident analysis is about learning in order to improve resilience, rather than trying to “fix” a “root cause”.
Laura Maguire — Jeli
A lot of incident management guides out there are aimed at established, big-scale companies. Things are different when you’re in startup mode.
Chris Evans — incident.io
This is so cool! It’s a guide for what kinds of incidents you’re likely to learn the most from. There’s a long list of things to look out for with explanations.
Laura Maguire and Vanessa Huerta Granda — Jeli
Outages
Twitter
PokerStars
reddit
SRE WEEKLY