SRE Weekly Issue #313

Articles

Do you need an incident commander? (Yes.) This article is about how to staff your incident command rotation through a couple of different strategies.

Ryan McDonald — FireHydrant

How Cloud Downtime Insurance Became a Thing

What an interesting idea, an insurance plan that pays out automatically when a cloud provider has an outage.

L.S. Howard — Insurance Journal
Full disclosure: Fastly, my employer, is mentioned.

Diary of a First-Time On-Call Engineer

LaunchDarkly revamped the way that their on-call system works. Learn about the experience through the eyes of a newly-onboarded engineer.

Anna Baker — LaunchDarkly (via The New Stack)

2021 SRE Report

Catchpoint’s yearly SRE Report is out with four key findings. You have to fill out a form with your email address, and then the link to download the report is presented in your browser.

Catchpoint

Little’s Law, Scalability and Fault Tolerance: The OS is your bottleneck. What you can do?

This article shows why one-thread-per-request can be a bottleneck and presents alternatives.

Ron Pressler — Parallel Universe (via High Scalability)

On the Brittleness of Dashboards

And this is a truth about incidents: there are always more signals than there is attention available.

It’s so true.

Fred Hebert — Honeycomb

Incident Analysis 101: Facilitating the Learning Review

If you’ve ever even considered running a retrospective, read this article.

This is my favorite piece of advice from this article:

If you think ‘this might be a stupid question,’ ask it.

Emily Ruppe — Jeli

What Does AIOps Mean for SREs? It’s Complicated.

I’m still not sure how I feel about AIOps. Fortunately, this article takes a measured stance while providing some useful insight.

Conclusion: AI won’t replace SREs – but it can help

JJ Tang — Rootly
This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.

Outages

Google Cloud Traffic Director

Google has already posted a preliminary outage report at the link above.

Spotify

This one involved the Traffic Director outage mentioned above, as per Spotify’s outage report here.

Discord

This one was also related to the Traffic Director outage, according to the final update on their status post.

Polygon

TikTok
SRE WEEKLY

A message from our sponsor, Rootly:

Articles

Outages

Related