Articles
Be judicious when you generate remediation tasks from incidents, or you can end up investing in the wrong area.
  Marc Brooker
Zendesk SRE has a set of 8 reliability principles that guide what they do.
  Jason Smale — Zendesk
We’re going to talk about a few necessities that enable exceptional incident management.
Service ownership
Incident roles
The incident declaration process
Running incident drills
  Robert Ross — FireHydrant
I don’t think you’re supposed to use Consul that way…
Read this article to follow along on an interesting design journey.
  Thomas Ptacek — Fly.io
One single metric for availability probably can’t tell you the whole story.
  Stephen Townshend — Slight Reliability
We can learn from the process another engineer takes to debug a problem. But often, a ticket or problem description is stripped of the process and just has the answer, hampering learning.
  Lorin Hochstein — The ReadME Project (GitHub)
We’re still not 100% there as a team, but I hope this article will serve as a reference for anyone who might create an SRE team in the future.
  @tjun — Mercari
This article gives 6 different ways to organize the findings from your retrospective to share with different audiences.
  Vanessa Huerta Granda — Jeli
There’s a great reliability story in the way that the Hubble telescope and the Apollo missions used gimbals — and in the way that the JWST doesn’t.
  Robert Barron — IBM
Outages
The US Internal Revenue Service’s systems went down on the due date for tax filing.
Instagram
SRE WEEKLY