SRE Weekly Issue #297

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging the right team, postmortem timeline, setting up reminders, and more. Book a demo:


It’s that time of year again, but maybe it’s time to rethink that code freeze.

Robert Ross — FireHydrant

This article really gets to the heart of why I love a good incident. I mean, obviously, I want to minimize, incidents. I swear.

Lisa Karlin Curtis —

This article draws on incident reports from The VOID to show how root cause analysis can be problematic.

Courtney Nash — Verica

It’s interesting to read this article after reading the previous one. In the “my car won’t start”, I found myself immediately wondering, why was the vehicle not maintained? What factors contributed to that?

Søren Pedersen — Dzone

These are the “phases”, although they stress that aiming for Visionary doesn’t make sense for all organizations.



Not the field I would have expected to look to for lessons, but it totally works!

Paul Marsicovetere — Formidable

This article introduces a 3-phased approach for safe database schema changes: Expand, Rollout, and Contract.

Alex Yates — Octopus Deploy

Try to run a program, and you get “No such file or directory”, even though the program is right there. How can this happen?

Julia Evans


Google Cloud Load Balancing

Google had a major outage that took down many sites and services. Notably, users of these sites were greeted with a Google 404 page with no branding related to the site they were attempting to access.


Tesla owners were locked out of their cars or unable to start them during the outage.


Categorized as SRE