I’ll be on vacation starting next Sunday (yay!). That means the next two issues will be prepared in advance, so there won’t be an Outages section.
Should you go multi-cloud? What should you do during an incident involving a third-party dependency? What about after? Read this one for all that and more.
Lisa Karlin Curtis — incident.io
Full disclosure: Fastly, my employer, is mentioned.
An introduction to the concept of common ground breakdown, using the Uvalde shooting in the US as a case study.
The comments section is full of some pretty great advice, including questions you can ask while interviewing to suss out whether the on-call culture is going to be livable.
u/dicksoutfoeharambe (and others) — reddit
From the archives, this is an analysis of a report on the 2018 major outage at TSB Bank in the UK.
You can determine whether backoff will actually help your system, and this article does a great job of telling you how.
I’ve read (and written) plenty of IC training guides, but this is the first time I’ve come across the concept of a “Hands-Off Update”. I’m definitely going to use that!
This is a really great exlpanation of observability from an angle I haven’t seen before.
a metric dashboard only contributes to observability if its reader can interpret the curves they’re seeing within a theory of the system under study.
Did you catch the Google search outage? I’ve never seen one like it — that’s how rare they are. Google shared a tidbit of information about what went wrong — and it wasn’t the datacenter explosion folks speculated about.