SRE Weekly Issue #333

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly đźš’. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

They asked four people and got four answers that run the gamut.

  Jeff Martens — Metrist

How Airbnb automates incident management in a world of complex, rapidly evolving ensemble of microservices.

Includes an overview of their ChatOps system that would make for a great blueprint to build your own.

  Vlad Vassiliouk — Airbnb

Rigidly categorizing incidents can cause problems, according to this article.

From the customer’s viewpoint… well why would they care what kind of technical classification it is being forced into?

  Jon Stevens-Hall

Lots of great advice in this one.

If no human needs to be involved, it’s pure automation.
If it doesn’t need a response right now, it’s a report.
If the thing you’re observing isn’t a problem, it’s a dashboard.
If nothing actually needs to be done, you should delete it.

   Leon Adato — New Relic

Using the recent Atlassian outage as a case study, this article explains the importance of communication during an incident, then goes over best practices.

  Martha Lambert — incident.io

My favorite part about this is the advice to “lower the cost of being wrong”. Important in any case, but especially during incident response.

  Emily Arnott — Blameless

There are some interesting incidents in this issue: one involving DNS and another with an overload involving over-eager retries.

  Jakub Oleksy — GitHub

A great read both for interviewers and interviewees.

  Myra Nizami — Blameless

Their main advice is to avoid starting with a microservice architecture, and only transition to one after your monolith has matured and you have a good reason to do so.

  Tomas Fernandez and Dan Ackerson — semaphore

Outages

Solana
Paytm
National Health Service (UK)
Bread
DreamHost
Flightradar24
Stack Exchange
SRE WEEKLY

Published
Categorized as SRE