SRE Weekly Issue #324

Articles

The Need to Decouple Human Error from Incident Response

We’ll start off this week with a recap of a KubeCon talk that urges leaving the concept of “human error” behind.

Jennifer Riggins — The New Stack
Talk by Silvia Pina

5 Tips If You’re the 1st SRE Hire by Instacart’s First SRE

Just to be clear, they’re saying the tips are written by Instacart’s first SRE — they’re not tips aimed oddly specifically at the second Instacart SRE. Good tips, too.

Quentin Rousseau — Rootly
This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.

Systems should expose a (simple) overall health metric as well as specifics

This is a really good point, and well argued. Then there’s an amusing bit at the end about alerting on the number of WARNING-level log messages generated by the system as a proxy for overall health.

Chris Siebenmann

Tracking On-Call Health

In this post, I’m going to expand on the values we’re currently using at Honeycomb to monitor on-call health, why we think they’re good, and some of the challenges we’re still encountering.

Fred Hebert — Honeycomb

When incident response requires business response, who should you notify?

Internal and external communication are critical in an incident, second (perhaps) only to actually resolving the problem. Read this article to learn about who you need to communicate with, how to talk to them, and how to prepare in advance.

Hannah Culver — PagerDuty

We can’t all be Shaq: why it’s time for the SRE hero to pass the ball and how to get there

If you’re playing the hero role at your organization, you might be unintentionally masking the need for better incident management practices.

Malcolm Preston — FireHydrant

Outages

Oracle Dyn
Meta
Starbucks
easyJet
Google BigQuery
GitHub
SRE WEEKLY

A message from our sponsor, Rootly:

Articles

Outages

Related