SRE Weekly Issue #298

Email subscribers, my apologies for the double-send last week. I upgraded WordPress and subsequently further cemented my distrust of all version upgrades ever.

I carefully tested a fix in staging before rolling it out gradually in preparation for this week’s issue. Just kidding, I hacked on it live until I got it fixed. Sorry about all those testing tweets. #testinproduction #yolo #SREWeeklydoesnotpracticeSRE

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging the right team, postmortem timeline, setting up reminders, and more. Book a demo:


This is Google’s detailed report from their outage last week. This one’s really worth a read; I promise you won’t be disappointed!


I really like this guide and template for writing incident reports. Each section comes with an explanation of what goes there with examples.

  Lorin Hochstein developed their Reliability Collaboration Model to guide the engagement between SRE and product development teams and the responsibilities assigned to each.

  Emmanuel Goossaert —

Especially timely now, in the thick of the holiday on-call period.

  James Frost — Ably

Great tips. I hope your Black Friday / Cyber Monday is going well!

  Quentin Rousseau — Rootly

This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.

I thought it might be better to try a new approach: defining what SRE was by looking at what it’s not. Or to put it another way, what can you remove from SRE and have it still be SRE?

  Niall Murphy

Instead of asking that question this article urges understanding what happened.

Another reason that imagining future scenarios is better that counterfactuals about past scenarios is that our system in the future is different from the one in the past.

  Lorin Hochstein


GitHubCoinbaseSRE WEEKLY

Categorized as SRE