SRE Weekly Issue #337

View on sreweekly.com

Thanks for all the vacation well-wishes! It was really great and relaxing. Take vacations, it’s important for reliability!

While I was out, I shipped the past two issues with content prepared in advance, and without the Outages section. This gave me a chance to really think hard about the value of the Outages section versus the time and effort I put into it.

I’ve decided to put the Outages section on hiatus for the time being. For notable outages, I’ll include them in the main section, on a case-by-case basis. Read on if you’re interested in what went into this decision.

The Outages section has always been of lower quality than the rest of the newsletter. I have no scientific process for choosing which Outages make the cut — mostly it’s just whatever shows up in my Google search alerts and seems “important”, minus a few arbitrary categories that don’t seem particularly interesting like telecoms and games. I do only a cursory review of the outage-related news articles I link to, and often they’re on poor-quality sites with a ton of intrusive ads. Gathering the list of Outages has begun taking more and more of my time, and I’d much rather spend that effort on curating quality content, so that’s what I’m going to do going forward.

10 Things I Learned From My First Incident Review

Every one of these 10 items is enough reason to read this article! This makes me want to go investigate some incidents right now.

Fischer Jemison — Jeli

Slowing Down to Speed Up – Circuit Breakers for Slack’s CI/CD

Slack shares with us in great detail why they use circuit breakers and how they rolled them out.

Frank Chen — Slack

Tips to Make Your On-Call Process Less Stressful

My favorite part of this one is the section on expectations. We need to socialize this to help reduce the pressure on folks going on call for the first time.

Prakya Vasudevan — Squadcast

Why Status Pages Are Lying to You and What To Do About It

Status pages are marketing material. Prove me wrong.

Ellen Steinke — Metrist

Using incidents to level up your teams

incidents have unusually high information density compared with day-to-day work, and they enable you to piggy-back on the experience of others

Lisa Karlin Curtis — incident.io

How we store and process millions of orders daily

These folks realized that they had two different use cases for the same data, real-time transactions and batch processing. Rather than try to find one DB that could support both, they fork two copies of the data.

Xi Chen and Siliang Cao — Grab

Live Your Best Life With Structured Events

It’s all about gathering enough information that you can ask new questions when something goes wrong, rather than being stuck with only answers to the questions you thought to ask in advance.

Charity Majors

How Discord Supercharges Network Disks for Extreme Low Latency

They needed the speed of local ephemeral SSDs but the reliability of network-based persistent disks. The solution: a linux MD option to mirror but prefer to read from the local disks. Neat!

Glen Oakley — Discord

Operating system upgrades at LinkedIn’s scale

OS upgrades can be risky. LinkedIn developed a system to unify OS upgrade procedures and make them much less risky.

Hengyang Hu, Dinesh Dhakal, and Kalyanasundaram Somasundaram — LinkedIn

SRE WEEKLY

A message from our sponsor, Rootly:

Related