SRE Weekly Issue #297

Articles

Avoid frostbite: Stop doing code freezes

It’s that time of year again, but maybe it’s time to rethink that code freeze.

Robert Ross — FireHydrant

5 ways incidents made me a better engineer

This article really gets to the heart of why I love a good incident. I mean, obviously, I want to minimize, incidents. I swear.

Lisa Karlin Curtis — incident.io

Root Cause is for Plants, Not Software

This article draws on incident reports from The VOID to show how root cause analysis can be problematic.

Courtney Nash — Verica

How to Perform Incident Post-mortems: Identify Root Cause With “Five Whys”

It’s interesting to read this article after reading the previous one. In the “my car won’t start”, I found myself immediately wondering, why was the vehicle not maintained? What factors contributed to that?

Søren Pedersen — Dzone

The five phases of organizational reliability

These are the “phases”, although they stress that aiming for Visionary doesn’t make sense for all organizations.

Absent
Reactive
Proactive
Strategic
Visionary

Google

3 Combat Sports Principles that Apply to Site Reliability Engineering

Not the field I would have expected to look to for lessons, but it totally works!

Paul Marsicovetere — Formidable

Safe schema updates – Near-zero downtime database deployments

This article introduces a 3-phased approach for safe database schema changes: Expand, Rollout, and Contract.

Alex Yates — Octopus Deploy

Debugging a weird ‘file not found’ error

Try to run a program, and you get “No such file or directory”, even though the program is right there. How can this happen?

Julia Evans

Outages

Google Cloud Load Balancing

Google had a major outage that took down many sites and services. Notably, users of these sites were greeted with a Google 404 page with no branding related to the site they were attempting to access.

Grab Tesla

Tesla owners were locked out of their cars or unable to start them during the outage.

SRE WEEKLY

A message from our sponsor, Rootly:

Articles

Outages

Related