SRE Weekly Issue #319

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

Be judicious when you generate remediation tasks from incidents, or you can end up investing in the wrong area.

  Marc Brooker

Zendesk SRE has a set of 8 reliability principles that guide what they do.

  Jason Smale — Zendesk

We’re going to talk about a few necessities that enable exceptional incident management.

Service ownership
Incident roles
The incident declaration process
Running incident drills

  Robert Ross — FireHydrant

I don’t think you’re supposed to use Consul that way…

Read this article to follow along on an interesting design journey.

  Thomas Ptacek — Fly.io

One single metric for availability probably can’t tell you the whole story.

  Stephen Townshend — Slight Reliability

We can learn from the process another engineer takes to debug a problem. But often, a ticket or problem description is stripped of the process and just has the answer, hampering learning.

  Lorin Hochstein — The ReadME Project (GitHub)

We’re still not 100% there as a team, but I hope this article will serve as a reference for anyone who might create an SRE team in the future.

  @tjun — Mercari

This article gives 6 different ways to organize the findings from your retrospective to share with different audiences.

  Vanessa Huerta Granda — Jeli

There’s a great reliability story in the way that the Hubble telescope and the Apollo missions used gimbals — and in the way that the JWST doesn’t.

  Robert Barron — IBM

Outages

Hulu
IRS

The US Internal Revenue Service’s systems went down on the due date for tax filing.

Instagram
SRE WEEKLY

Published
Categorized as SRE