SRE Weekly Issue #327

Articles

Even when your system has redundancy, sometimes all the redundant copies fail at once because of what they share in common.

Marc Brooker

Zero Downtime Database Changes with Feature Flags

Feature flags make it easy to roll out database schema migrations without downtime. This example uses double-writing and a data migration script.

Tom Hombergs — Reflectoring

Incident Management Guide

Like some kind of Netflix of SRE writing, incident.io just dropped an entire guide on incident management, ready for bingeing. My favorite is the section on on-call compensation.

Chris Evans — incident.io

Known Unknowns —Webb Struck by Meteoroid!

A major part of SRE is deciding what level of reliability makes sense, and how prepared you should be. This article drives that point home with an analogy to the James Webb Space Telescope.

Robert Barron — IBM

A multi-region AWS architecture for low latency edge messaging

Ably posted this design overview of their HA real-time messaging system, with lots of juicy details.

Jo Stichbury — Ably

Going On Call for the First Time

An advice columnist helps a newbie on-caller ease into the pager life.

Liz Fong-Jones — Honeycomb

Retrospective Template (What They Are & How To Use One)

I like that this article advocates using different templates for different kinds of retrospectives with different goals.

Myra Nizami — Blameless

Five Essential Non-technical Skills for SRE Success

Yes, we need more of this! The skills covered are: Communication, Empathy, Teamwork, Motivation, and Documentation.