SRE Weekly Issue #306

Articles

The James Webb Space Telescope — Success through Redundancy

In the past, NASA has increased the likelihood of mission success by sending duplicate spacecraft. In the case of the JWST, that’s not an option.

Robert Barron

Agile and SRE are NOT mutually exclusive

This article makes a case that agile development practices depend on SRE.

Ash P — Cruform Newsletter

A Primer on the History and Evolution of Incident Management to Today

This history covers the advent of the Incident Command System (ICS) and subsequently the National Incident Management System (NIMS).

JJ Tang — Rootly
This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.

FOQS: Making a distributed priority queue disaster-ready

Meta migrated their Facebook Ordered Queueing Service (FOQS) system to a global, highly-available deployment. This article describes the original architecture, lists its shortcomings, and explains how they did the migration with zero downtime.

Jasmit Kaur Saluja and Dillon George — Meta

Problem Manager 2.0: Resilience Engineering Advocate

This is the first time I’ve heard of a “Problem Manager” role, and I like it.

Laurel Frazier — Transposit

Ask Miss O11y: Long-Running Requests

How do you make an SLO for a service with long-running requests? One method is to report metrics on regular time intervals.

Liz Fong-Jones — Honeycomb

GCP Incident Report — us-west1-b incident on January 8, 2022

A failure in their Software-Defined Networking (SDN) configuration system required manual recovery.

Google

Outages

NS1
Telegram
Amazon Alexa
Twitter
Roku
Twitch
Enom

This link points to their post-incident report including a detailed section on what they learned from the incident.

SRE WEEKLY

A message from our sponsor, Rootly:

Articles

Outages

Related