SRE Weekly Issue #329

Articles

“Keep calm and use the runbook” – Why runbooks are the key to handling any situation effectively

A primer on what makes a good runbook.

Runbooks are most effective when they are readily available, easily actionable, and up-to-date and accurate.

Cortex

Establishing a Scalable SRE Infrastructure Using Standardization and Short Feedback Loops

In this article, we describe the architecture and implementation of our SRE infrastructure, how it is used and how it was adopted.

Philipp Gündisch and Vladyslav Ukis — Siemens

Tech Debt, Incidents and On Call

After an explanation of tech debt, this article goes into a possible solution: having on-call folks fix lingering problems in between pages.

Dormain Drewitz — The New Stack

How to Standardize Service Ownership at Scale for Improved Incident Response

I’ve read plenty of articles about service ownership, but this one has something new: a discussion of how to divvy up a monolith into separate “services” for teams to own.

Hannah Culver — PagerDuty

Streamlining our incident responses

The folks at Sendinblue have chronicled their journey to better incident response, and there’s a lot here to learn from.

Tanguy Antoine — Sendinblue

Why More Incidents Are Better

Incidents will always happen, but thankfully they have plenty of upsides, as this article explains.

Andre King — Rootly
This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.

My Alerts Don’t Work Anymore, Now What?

You’re not getting paged. Is it because you’ve fixed all the things, or has your alerting atrophied?

Boris Cherkasky

Uncovering the mysteries of on-call

The folks at incident.io are here with the results of their survey of on-call practices. I like the focus on compensation for being on-call.

incident.io

Outages

Twitter
Netflix briefly went down for some users after the new ‘Stranger Things’ episodes debuted Friday morning, according to outage reports
GitHub
reddit
Zebrium

I noticed this one while trying to read one of their articles. I was getting NXDOMAIN trying to resolve zebrium.com.

SRE WEEKLY

A message from our sponsor, Rootly:

Articles

Outages

Related