SRE Weekly Issue #310

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging the right team, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly shirt):


Here’s the next incredibly useful article in Jeli’s Incident Analysis 101 series. This one covers the skills and traits of a good incident analyst, along with what not to look for.

  Laura Maguire — Jeli

This article has a remarkable level of detail on 13 incidents at Twitter that were related to cache. The authors open with an explanation of why they focused on cache-related incidents.

  Dan Luu and Yao Yue

[…] the same three pillars form the core of any good process, whether it’s for the largest e-commerce giant or a scrappy SaaS startup.

The three pillars are:


  Lisa Karlin Curtis —

This one recommends doing away with “P0” and “P5” and instead using plain words like “Low” and “High”.

  Stephen Whitworth —

Feature flags can be a useful way to resolve user impact during an incident.

  Weihan Li — Rootly
This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.

Implementing a dead-switch for your alerting tool is really important so that you don’t blissfully sleep through an outage.

  Chris Loukas — HelloFresh

As SRE #1, the author of this article got to define the SRE role from the ground up.

  Fred Hebert — Honeycomb

In this article, I will share five lessons I learned about starting SRE teams (or engagements, or organizations).

This article is all about the shape of an SRE team, rather than technical details like SLOs and such.

  Andrea Spadaccini — USENIX ;login:


Oracle Cloud Infrastructure DNS

Categorized as SRE