SRE Weekly Issue #306

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging the right team, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly shirt):


In the past, NASA has increased the likelihood of mission success by sending duplicate spacecraft. In the case of the JWST, that’s not an option.

  Robert Barron

This article makes a case that agile development practices depend on SRE.

  Ash P — Cruform Newsletter

This history covers the advent of the Incident Command System (ICS) and subsequently the National Incident Management System (NIMS).

  JJ Tang — Rootly
This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.

Meta migrated their Facebook Ordered Queueing Service (FOQS) system to a global, highly-available deployment. This article describes the original architecture, lists its shortcomings, and explains how they did the migration with zero downtime.

  Jasmit Kaur Saluja and Dillon George — Meta

This is the first time I’ve heard of a “Problem Manager” role, and I like it.

  Laurel Frazier — Transposit

How do you make an SLO for a service with long-running requests? One method is to report metrics on regular time intervals.

  Liz Fong-Jones — Honeycomb

A failure in their Software-Defined Networking (SDN) configuration system required manual recovery.



Amazon Alexa

This link points to their post-incident report including a detailed section on what they learned from the incident.


Categorized as SRE
Generated by Feedzy