SRE Weekly Issue #313

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

Do you need an incident commander? (Yes.) This article is about how to staff your incident command rotation through a couple of different strategies.

  Ryan McDonald — FireHydrant

What an interesting idea, an insurance plan that pays out automatically when a cloud provider has an outage.

  L.S. Howard — Insurance Journal
Full disclosure: Fastly, my employer, is mentioned.

LaunchDarkly revamped the way that their on-call system works. Learn about the experience through the eyes of a newly-onboarded engineer.

  Anna Baker — LaunchDarkly (via The New Stack)

Catchpoint’s yearly SRE Report is out with four key findings. You have to fill out a form with your email address, and then the link to download the report is presented in your browser.

  Catchpoint

This article shows why one-thread-per-request can be a bottleneck and presents alternatives.

  Ron Pressler — Parallel Universe (via High Scalability)

And this is a truth about incidents: there are always more signals than there is attention available.

It’s so true.

  Fred Hebert — Honeycomb

If you’ve ever even considered running a retrospective, read this article.

This is my favorite piece of advice from this article:

If you think ‘this might be a stupid question,’ ask it.

  Emily Ruppe — Jeli

I’m still not sure how I feel about AIOps. Fortunately, this article takes a measured stance while providing some useful insight.

Conclusion: AI won’t replace SREs – but it can help

  JJ Tang — Rootly
This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.

Outages

Google Cloud Traffic Director

Google has already posted a preliminary outage report at the link above.

Spotify

This one involved the Traffic Director outage mentioned above, as per Spotify’s outage report here.

Discord

This one was also related to the Traffic Director outage, according to the final update on their status post.

Polygon

TikTok
SRE WEEKLY

Published
Categorized as SRE