SRE Weekly Issue #309

Articles

What we talk about when we talk about ‘root cause’

Why do we use the term “root cause”? I especially love the opening analogy.

John Allspaw — The ReadME Project

Our Reliability Manifesto is a succinct collection of rules, guidelines, and best practices that reflect our current thinking on what it takes to build a reliable system.

Christian Hardenberg — Delivery Hero

Best practices for setting SLOs and SLIs for modern, complex systems

Here’s how New Relic sets their S*Os.

Set SLIs and SLOs against system boundaries

Dan Holloran and Elisa Binette — New Relic

How Amazon Prime Video’s Engineering Teams Build Resilience

It involves lots of machine learning and a “team resilience score”.

Jennifer Riggins — The New Stack

What is incident analysis and why should we do it?

Every incident is unique, so incident analysis is about learning in order to improve resilience, rather than trying to “fix” a “root cause”.

Laura Maguire — Jeli

The startup guide to sensible incident management

A lot of incident management guides out there are aimed at established, big-scale companies. Things are different when you’re in startup mode.

Chris Evans — incident.io

Incident Analysis 101: Which incidents should you investigate?

This is so cool! It’s a guide for what kinds of incidents you’re likely to learn the most from. There’s a long list of things to look out for with explanations.

Laura Maguire and Vanessa Huerta Granda — Jeli

Outages

Twitter
PokerStars
reddit
SRE WEEKLY

A message from our sponsor, Rootly:

Articles

Outages

Related