Articles
A primer on what makes a good runbook.
Runbooks are most effective when they are readily available, easily actionable, and up-to-date and accurate.
Cortex
In this article, we describe the architecture and implementation of our SRE infrastructure, how it is used and how it was adopted.
Philipp Gündisch and Vladyslav Ukis — Siemens
After an explanation of tech debt, this article goes into a possible solution: having on-call folks fix lingering problems in between pages.
Dormain Drewitz — The New Stack
I’ve read plenty of articles about service ownership, but this one has something new: a discussion of how to divvy up a monolith into separate “services” for teams to own.
Hannah Culver — PagerDuty
The folks at Sendinblue have chronicled their journey to better incident response, and there’s a lot here to learn from.
Tanguy Antoine — Sendinblue
Incidents will always happen, but thankfully they have plenty of upsides, as this article explains.
Andre King — Rootly
This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.
You’re not getting paged. Is it because you’ve fixed all the things, or has your alerting atrophied?
Boris Cherkasky
The folks at incident.io are here with the results of their survey of on-call practices. I like the focus on compensation for being on-call.
incident.io
Outages
Twitter
Netflix briefly went down for some users after the new ‘Stranger Things’ episodes debuted Friday morning, according to outage reports
GitHub
reddit
Zebrium
I noticed this one while trying to read one of their articles. I was getting NXDOMAIN trying to resolve zebrium.com.
SRE WEEKLY