They asked four people and got four answers that run the gamut.
Jeff Martens — Metrist
How Airbnb automates incident management in a world of complex, rapidly evolving ensemble of microservices.
Includes an overview of their ChatOps system that would make for a great blueprint to build your own.
Vlad Vassiliouk — Airbnb
Rigidly categorizing incidents can cause problems, according to this article.
From the customer’s viewpoint… well why would they care what kind of technical classification it is being forced into?
Lots of great advice in this one.
If no human needs to be involved, it’s pure automation.
If it doesn’t need a response right now, it’s a report.
If the thing you’re observing isn’t a problem, it’s a dashboard.
If nothing actually needs to be done, you should delete it.
Leon Adato — New Relic
Using the recent Atlassian outage as a case study, this article explains the importance of communication during an incident, then goes over best practices.
Martha Lambert — incident.io
My favorite part about this is the advice to “lower the cost of being wrong”. Important in any case, but especially during incident response.
Emily Arnott — Blameless
There are some interesting incidents in this issue: one involving DNS and another with an overload involving over-eager retries.
Jakub Oleksy — GitHub
A great read both for interviewers and interviewees.
Myra Nizami — Blameless
Their main advice is to avoid starting with a microservice architecture, and only transition to one after your monolith has matured and you have a good reason to do so.
Tomas Fernandez and Dan Ackerson — semaphore