SRE Weekly Issue #421

Last week, I mistakenly attributed [an article]( to PagerDuty. Actually, it was by Paige Cruz, whose clever blog name I didn’t pay anywhere near close enough attention to! Thanks to several readers that nudged me gently about my error.

A message from our sponsor, FireHydrant:

FireHydrant is now AI-powered for faster, smarter incidents! Power up your incidents with auto-generated real-time summaries, retrospectives, and status page updates.

If you’ve been in this business long enough, you’ve almost certainly run into an incident where one of the contributors was an implicit invariant that was violated by a new change.

Easily the majority of incidents I’ve been in.

  Lorin Hochstein

This article is about trying to solve for this problem:

a potentially significant number of customers or queries can be affected by an outage and this won’t trigger an SLO violation.

  Niall Murphy

A surgeon struggles with the difficulties in building a culture of retrospectives and introspection in their surgical team, by running a fascinating retro on himself in this blog post.

  Robert Poston, MD

An argument for buying yourself time to slow down and make decisions carefully, as a way of ultimately speeding up incident resolution.

  Shayon Mukherjee

Disasters threatening a business’ ability to operate core functions don’t occur that often (phew!), but we do want to ensure we are prepared to keep our business running if they do. To practice disaster response skills, we run business continuity drills, and you can too with our 10-step plan!

  Janna Brummel — WeTransfer

How people think about reliability varies between companies. Which of the four different perspectives laid out int his article does your company fit into, if any?

  Ross Brodbeck

Honeycomb posted this followup on their April 9 outage, explaining what went wrong and how they’re responding.


  Full disclosure: Honeycomb is my employer.

The author of this article posed a question on r/sre:

What matters most for your success as an SRE?

They share a summary of the answers they got, with their commentary.

  Nočnica Mellifera — Checkly


Categorized as SRE