SRE Weekly Issue #309

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging the right team, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly shirt):


Why do we use the term “root cause”? I especially love the opening analogy.

  John Allspaw — The ReadME Project

Our Reliability Manifesto is a succinct collection of rules, guidelines, and best practices that reflect our current thinking on what it takes to build a reliable system.

  Christian Hardenberg — Delivery Hero

Here’s how New Relic sets their S*Os.

Set SLIs and SLOs against system boundaries

  Dan Holloran and Elisa Binette — New Relic

It involves lots of machine learning and a “team resilience score”.

  Jennifer Riggins — The New Stack

Every incident is unique, so incident analysis is about learning in order to improve resilience, rather than trying to “fix” a “root cause”.

  Laura Maguire — Jeli

A lot of incident management guides out there are aimed at established, big-scale companies. Things are different when you’re in startup mode.

  Chris Evans —

This is so cool! It’s a guide for what kinds of incidents you’re likely to learn the most from. There’s a long list of things to look out for with explanations.

  Laura Maguire and Vanessa Huerta Granda — Jeli



Categorized as SRE