SRE Weekly Issue #257

Articles

Sometimes alerts have inobvious reasons for existing

This one really got me thinking. Make sure you document why an alert exists, not just what it checks for.

Chris Siebenmann

Incident response from monolith to microservices

If you start with a monolith and adopt a microservice architecture, your incident response process will need to change as well.

Mya Pitzeruse — effx

Minesweeper automates root cause analysis as a first-line defense against bugs

Another one that needs a disclaimer: there’s no single “root cause” for an incident, and this article is not about that. This is about using statistical software to aid humans in debugging by looking at the activities performed by different users before they encounter a given bug.

Vijay Murali, Edward Yao, Umang Mathur, Satish Chandra — Facebook

On Not Being a Cog in the Machine

A new SRE at Honeycomb shares insight on the job and SRE attitudes in general.

Fred Hebert — Honeycomb

Slack’s Jan 2021 outage: a tale of saturation

This post considers the January 4th Slack outage as a set of cases of saturation.

Lorin Hochstein

Outages

SRE WEEKLY

A message from our sponsor, StackHawk:

Articles

Outages

Related