Datadog posted a report on their major outage in March, and it’s a doozy. An unattended updates system that they didn’t even want, need, or know about triggered across all hosts in multiple clouds nearly simultaneously, causing a regression.
Alexis Lê-Quôc — Datadog
GitHub has had a string of apparently unrelated outages recently, and they’ve posted this description.
Mike Hanley — GitHub
Oh look, another awesome-* repo relevant to our interests!
A repo of links to articles, papers, conference talks, and tooling related to load management in software services: loadshedding, circuitbreaking, quota management and throttling. PRs welcome.
Laura Nolan and Niall Murphy — Stanza Systems
This interview covers a lot of ground including looking beyond just “up or down” when considering reliability.
Prathamesh Sonpatki — SRE Stories
If you’re in the mood for a deep systems debugging story, you’re in for a treat. The author takes you along for the ride with a wealth of detailed code snippets.
Tycho Andersen — Netflix
Regardless of the replication mechanism you must fsync() your data to prevent global data loss in non-Byzantine protocols.
Denis Rystsov and Alexander Gallego — Redpanda
Emotional intelligence is a critical skill for SREs, especially when we interact with other teams in fraught situations.
Amin Astaneh — Certo Modo
Wow! Spotify created a set of tools to perform automated refactoring of thousands of repositories at once. This includes the ability to run tests, automatically merge pull requests without human review, and roll refactorings out gradually.
Matt Brown — Spotify
Jeli has published a one-page cheat-sheet for their highly-detailed Howie guide for running incident retrospectives.