SRE Weekly Issue #333

View on sreweekly.com

Articles

Is SRE Just Ops with a New Name?

They asked four people and got four answers that run the gamut.

Jeff Martens — Metrist

Automated Incident Management Through Slack

How Airbnb automates incident management in a world of complex, rapidly evolving ensemble of microservices.

Includes an overview of their ChatOps system that would make for a great blueprint to build your own.

Vlad Vassiliouk — Airbnb

Don’t overcategorise incidents

Rigidly categorizing incidents can cause problems, according to this article.

From the customer’s viewpoint… well why would they care what kind of technical classification it is being forced into?

Jon Stevens-Hall

Best Practices for Fixing Your Alerts

Lots of great advice in this one.

If no human needs to be involved, it’s pure automation.
If it doesn’t need a response right now, it’s a report.
If the thing you’re observing isn’t a problem, it’s a dashboard.
If nothing actually needs to be done, you should delete it.

Leon Adato — New Relic

Driving a customer-focused incident response process

Using the recent Atlassian outage as a case study, this article explains the importance of communication during an incident, then goes over best practices.

Martha Lambert — incident.io

SRE: From Theory to Practice | What’s difficult about on-call?

My favorite part about this is the advice to “lower the cost of being wrong”. Important in any case, but especially during incident response.

Emily Arnott — Blameless

GitHub Availability Report: July 2022

There are some interesting incidents in this issue: one involving DNS and another with an overload involving over-eager retries.

Jakub Oleksy — GitHub

Top SRE Interview Questions You Should Know

A great read both for interviewers and interviewees.

Myra Nizami — Blameless

When Microservices Are a Bad Idea

Their main advice is to avoid starting with a microservice architecture, and only transition to one after your monolith has matured and you have a good reason to do so.

Tomas Fernandez and Dan Ackerson — semaphore

Outages

Solana
Paytm
National Health Service (UK)
Bread
DreamHost
Flightradar24
Stack Exchange
SRE WEEKLY

A message from our sponsor, Rootly:

Articles

Outages

Related