SRE Weekly Issue #334

I’ll be on vacation starting next Sunday (yay!). That means the next two issues will be prepared in advance, so there won’t be an Outages section.

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):


Should you go multi-cloud? What should you do during an incident involving a third-party dependency? What about after? Read this one for all that and more.

  Lisa Karlin Curtis —
Full disclosure: Fastly, my employer, is mentioned.

An introduction to the concept of common ground breakdown, using the Uvalde shooting in the US as a case study.

  Lorin Hochstein

The comments section is full of some pretty great advice, including questions you can ask while interviewing to suss out whether the on-call culture is going to be livable.

  u/dicksoutfoeharambe (and others) — reddit

From the archives, this is an analysis of a report on the 2018 major outage at TSB Bank in the UK.

  Jon Stevens-Hall

You can determine whether backoff will actually help your system, and this article does a great job of telling you how.

  Marc Brooker

I’ve read (and written) plenty of IC training guides, but this is the first time I’ve come across the concept of a “Hands-Off Update”. I’m definitely going to use that!

  Dan Slimmon

This is a really great exlpanation of observability from an angle I haven’t seen before.

a metric dashboard only contributes to observability if its reader can interpret the curves they’re seeing within a theory of the system under study.

  Dan Slimmon


Google Search

Did you catch the Google search outage? I’ve never seen one like it — that’s how rare they are. Google shared a tidbit of information about what went wrong — and it wasn’t the datacenter explosion folks speculated about.


Categorized as SRE