SRE Weekly Issue #307

Articles

Roblox Return to Service 10/28-10/31 2021

This followup to their initial incident report has a lot to learn from, especially if you run Consul at scale.

Daniel Sturman and others — Roblox

This week, I came across the Byford Dolphin diving bell incident. This accident seems at face value to be “human error”, but there’s so much to it. Content warning: the accident was quite grisly.

Wikipedia

What is Canary Analysis?

Canary testing is more than just deploying your code to a small part of your fleet. You need a plan for how you’re going to spot problems.

Jyoti Sahoo — OpsMx

Fixing Performance Regressions Before they Happen

My favorite part is how they look for changes in performance, rather than using a static threshold.

Angus Croll — Netflix

How to Answer Network Outage Questions

It pays to think ahead about how you’ll answer questions from execs during an incident.

Chris Fenning — DZone

Incorrect proxying of 24 hostnames on January 24, 2022

On January 24, 2022, as a result of an internal Cloudflare product migration, 24 hostnames (including www.cloudflare.com) that were actively proxied through the Cloudflare global network were mistakenly redirected to the wrong origin.

Jeremy Hartman — Cloudflare

Analyzing SRE Job Postings – From Amazon to Microsoft

An analysis of SRE job descriptions from 4 companies highlights what businesses actually expect SREs to do.

JP Cheung — Rootly
This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.

Google Team That Keeps Services Online Rocked by Mental Health Crisis

Members of the search giant’s site reliability group say managers fostered a toxic environment. Google says a ‘safe, inclusive workplace’ is a top priority.

Nico Grant — Bloomberg

Outages

Solana
iCloud
Ticketmaster
Amazon Alexa
Reddit
Discord
SRE WEEKLY

A message from our sponsor, Rootly:

Articles

Outages

Related