SRE Weekly Issue #315

I’m going on vacation, so I’m going to prepare next week’s issue in advance. It’ll look much like most issues, except there won’t be an Outages section. See you all in two weeks!

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

In the previous articles in this series, they described a process of interviewing incident responders before a full retrospective meeting. This one discusses what to do if you can’t conduct those interviews, and the particular challenges this will bring and how to deal with them.

  Emily Ruppe — Jeli

Some interesting ideas on potential downsides of circuit breakers and how we might ameliorate them.

  Marc Brooker

GitHub has had a bit of a hard time lately. Here’s an update on what they’re dealing with and how they’re planning to address it.

  Keith Ballinger — GitHub

All sorts of “mean time to” metrics, including 6(!) different MTTR metrics and how they might be used.

  Alex Ewerlöf — InfoQ

This is a huge 100+-page report on the benefits of a model in which development teams own the operation of their systems. There’s a lot in here, with carefully spelled-out pros/cons and cost/benefit analyses. Need to convince someone? Send them this.

We’ve written this playbook for CxOs, product managers, delivery managers, and
operations managers.

  Bethan Timmins and Steve Smith — Equal Experts

It’s easy to miss MTUs, until they sneak up on you and cause really confusing problems.

  Aaron Kalair — Hudl

Should you compensate for on-call? How? I really want to see more articles about this, so send them my way if you see or write any.

  Chris Evans — Incident.io

Some good tips in this article, and I love the case studies.

  Prathamesh Sonpatki — Last9

Outages

PagerDuty
Apple App Store, Apple Music and iCloud
GitHub

They had several incidents this week.

.au TLD

DNSSec.

Sportsbook.ag
SRE WEEKLY

Published
Categorized as SRE