SRE Weekly Issue #356

View on sreweekly.com

Thanks to all of you that took the time to share your ideas about choosing incidents to investigate! I got some great answers and I’m looking forward to pulling them together into an article.

I decided to give this GPT-3 thing a spin. It turns out that it absolutely can assemble a newsletter with links to the week’s top SRE stories, each with a short description. It even includes authors. The authors are even real people. The URLs, though… well, they look real, but they’re mysteriously all 404s, and the articles don’t actually exist. Guess you’re stuck with me for now!

Articles

Platform Engineering as a Startup

This article takes the idea of “internal customers” to its logical conclusion, by treating the platform in the same way as a startup company.

Adam Buggia — Sym

Blameless Postmortems & Bayes’ Theorem

This article uses nifty probability formulas to show that blaming an engineer for an incident may well result in diminished reliability and efficiency.

Dan Slimmon

CircleCI incident report for January 4, 2023 security incident

Here’s a report on the CircleCI security incident at the start of the year. There’s some good stuff in there about not blaming the specific engineer whose device was attacked.

Rob Zuber — CircleCI

Counting Forest Fires

A hot take on how not to measure your incident response process.

Fred Hebert — Honeycomb
Full disclosure: Honeycomb is my employer.

How eBay’s Notification Platform Used Fault Injection in New Ways

eBay’s notification platform team built a fault-tolerant, resilient system by injecting faults in the application level.

Wei Chen — eBay

A small mistake does not a complex systems failure make

This one succinctly sums up why I haven’t covered the NOTAM outage much yet.

If a small mistake was sufficient to take down a complex system, then our systems would be crashing all of the time.

Lorin Hochstein

Production postmortem: The heisenbug server

Don’t you love when merely running strace fixes the problem?

Oren Eini

Question of Intent: The crash of Garuda Indonesia flight 200

This air accident seems at its face to be a clear-cut story of negligence. There’s far more to it, and the author goes into detail on why blaming the captain can damage air safety industry-wide.

Admiral Cloudberg

SRE WEEKLY

A message from our sponsor, Rootly:

Articles

Related