Even when your system has redundancy, sometimes all the redundant copies fail at once because of what they share in common.
Feature flags make it easy to roll out database schema migrations without downtime. This example uses double-writing and a data migration script.
Tom Hombergs — Reflectoring
Like some kind of Netflix of SRE writing, incident.io just dropped an entire guide on incident management, ready for bingeing. My favorite is the section on on-call compensation.
Chris Evans — incident.io
A major part of SRE is deciding what level of reliability makes sense, and how prepared you should be. This article drives that point home with an analogy to the James Webb Space Telescope.
Robert Barron — IBM
Ably posted this design overview of their HA real-time messaging system, with lots of juicy details.
Jo Stichbury — Ably
An advice columnist helps a newbie on-caller ease into the pager life.
Liz Fong-Jones — Honeycomb
I like that this article advocates using different templates for different kinds of retrospectives with different goals.
Myra Nizami — Blameless
Yes, we need more of this! The skills covered are: Communication, Empathy, Teamwork, Motivation, and Documentation.
Paul Marsicovetere — Formidable