SRE Weekly Issue #316

View on sreweekly.com

I’m on vacation, so I prepared this issue in advance. Practically speaking, that just means there’s no Outages section this week. See you all next week!

P.S. Okay, I know I said no outages, but I will say that I’m keeping an eye on the Southwest Airlines outage, because we’re kind of counting on them to get home in a few days…

Articles

Do I need an incident debrief?

Yes.

Chris Evans — incident.io

How Disaster Ready Are Your Backup Systems, Really?

If you don’t test them, you don’t have backups; you have a lottery ticket. Except the chance of winning is high. And the prize is data loss.

Emily Arnott — Blameless

Self-Compassion Instead of Self-Blame

Being blameless does not mean blaming no one outwardly and blaming yourself inside your head.

Emily Arnott — Blameless

Spike detection in Alert Correlation

LinkedIn’s Alert Correlation system posts recommendations to Slack about which microservice may be at the heart of an incident.

Nishant Singh — LinkedIn

Runbooks, Playbooks, & SOPs — What’s the Difference?

I always get the two confused. This article explains the difference and gives tips for writing runbooks. More on runbooks from the same folks here.

Jessica Abelson — Transposit

Dissecting the S3 SLA

There are many intricate details in there! For example, the S3 SLA is per calendar month, not a rolling window, so the SLA of your product based on it might need to match.

Alex Ewerlöf

Best Practices for Postmortems: A guide

The ins and outs of conducting an effective postmortem. Ready templates and examples from leading organizations around the world!

Prathamesh Sonpatki — Last9

SRE WEEKLY

A message from our sponsor, Rootly:

Articles

Related