In the past, NASA has increased the likelihood of mission success by sending duplicate spacecraft. In the case of the JWST, that’s not an option.
This article makes a case that agile development practices depend on SRE.
Ash P — Cruform Newsletter
This history covers the advent of the Incident Command System (ICS) and subsequently the National Incident Management System (NIMS).
JJ Tang — Rootly
This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.
Meta migrated their Facebook Ordered Queueing Service (FOQS) system to a global, highly-available deployment. This article describes the original architecture, lists its shortcomings, and explains how they did the migration with zero downtime.
Jasmit Kaur Saluja and Dillon George — Meta
This is the first time I’ve heard of a “Problem Manager” role, and I like it.
Laurel Frazier — Transposit
How do you make an SLO for a service with long-running requests? One method is to report metrics on regular time intervals.
Liz Fong-Jones — Honeycomb
A failure in their Software-Defined Networking (SDN) configuration system required manual recovery.
This link points to their post-incident report including a detailed section on what they learned from the incident.