SRE Weekly Issue #327

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

Even when your system has redundancy, sometimes all the redundant copies fail at once because of what they share in common.

  Marc Brooker

Feature flags make it easy to roll out database schema migrations without downtime. This example uses double-writing and a data migration script.

  Tom Hombergs — Reflectoring

Like some kind of Netflix of SRE writing, incident.io just dropped an entire guide on incident management, ready for bingeing. My favorite is the section on on-call compensation.

  Chris Evans — incident.io

A major part of SRE is deciding what level of reliability makes sense, and how prepared you should be. This article drives that point home with an analogy to the James Webb Space Telescope.

  Robert Barron — IBM

Ably posted this design overview of their HA real-time messaging system, with lots of juicy details.

  Jo Stichbury — Ably

An advice columnist helps a newbie on-caller ease into the pager life.

  Liz Fong-Jones — Honeycomb

I like that this article advocates using different templates for different kinds of retrospectives with different goals.

  Myra Nizami — Blameless

Yes, we need more of this! The skills covered are: Communication, Empathy, Teamwork, Motivation, and Documentation.

  Paul Marsicovetere — Formidable

Outages

Digital Ocean
Google Cloud Service Health
Amazon.com
Adobe Creative Cloud outages
Xiaomi Mijia (home automation)
Soundcloud
reddit
Cloudflare
SRE WEEKLY

Published
Categorized as SRE