SRE Weekly Issue #390

View on sreweekly.com

Many apologies to my email subscribers, who have seen two accidental re-sends of old issues recently due to a weird glitch in my automation. I think I’ve gotten a handle on it, and I’ll run an internal retrospective of this incident, of course.

Articles

SRE vs Platform Engineer: Can’t We All Just Get Along?

Is it really SRE vs platform engineer? Or is there a way platforms can take site reliability to the next level?

Jennifer Riggins — The New Stack

Our Prerequisites are Never Enough for High Risk

A surgeon delves into the key component that allows a group of skilled individuals to work effectively and safely together, using the term “heed” to describe this special interaction.

Sidenote: in a hilarious coincidence this article managed to spoil me on a movie I was in the middle of watching (Arrival) — but it also put me in a really cool mindset to watch the rest of the film.

Dr. Rob Poston

Degraded Performance: Square Services

More details on Square’s outage from a couple weeks ago (it was DNS).

Square

Azure status history

Azure had an interesting outage in its Australia East region involving a power failure and the order cooling units were restored in.

Microsoft Azure

How Did It Make Sense at the Time? Understanding Incidents As They Occurred, Not as They Are Remembered

Asking this question is how you unlock the hidden essence of an incident. This talk compares two public incident reports to show what it looks like when you dig into this question and when you don’t.

Jacob Scott — InfoQ

The Fallible Mind: The crash of Comair flight 5191

In this air accident, the pilots made a seemingly inexplicable mistake.

This sentence really stood out to me, especially after reading the “How Did It Make Sense at the Time?” article:

When we inexplicably grab the wrong utensil when cooking or accidentally start taking our dirty dishes to the bathroom instead of the kitchen, we should be thankful that we aren’t responsible for a plane full of people.

Admiral Cloudberg

GitHub Availability Report: August 2023

There’s an interesting failure mode in this one that might stand out for the Kafka admins among us:

The Kafka consumer ended up stuck in a loop, unable to stabilize fast enough before timing out and restarting the coordination process.

Jakub Oleksy — GitHub

The connection between incident management and problem management

After explaining the difference between the ITIL terms “incident management” and “problem management”, this article goes into a discussion of recent trends and whether it still makes sense to draw a distinction between the two.

Luis Gonzalez — incident.io

SRE WEEKLY

A message from our sponsor, Rootly:

Articles

Related