SRE Weekly Issue #330

View on sreweekly.com

Thanks for all the well-wishes as I took a sick day last week. I’m feeling much better!

Articles

DNS Incidents Like Cloudflare’s Could Turn your Status Page Useless; Here is How to Prevent It

Is your status page status.yourcompany.com? If so, read this article, then get yourself a new domain.

Eduardo Messuti — Statuspal

Coming back from maternity leave and learning from incidents.

The author used my favorite technique for getting up to speed on a company: analyzing a recent incident.

Vanessa Huerta Granda — Jeli

What leading expeditions taught me about incident management

There are a number of lessons I learned guiding weeks-long backcountry leadership courses for teens that I carried with me into my roles in incident management. In this blog post, I’ll share three that stand out.

Ryan McDonald — FireHydrant

What you should and (probably) shouldn’t try from SRE

I really like these articles about interpreting SRE in a way that makes sense for your organization. SRE is still constantly evolving.

Steve Smith — Equal Experts

What I learned from leading my first incident

The author led an incident just 3 months into their tenure. Here’s what they learned.

Milly Leadley — incident.io

Notes on an Observability Team

while SRE and DevOps type job explainers have been written ad nauseam, I found there’s relatively little online about Observability Teams and roles. I figured I’d share a bit about my experience on an O11y Team.

Eric Mustin

We Learn Systems by Changing Them

I found the contrast between this one and the previous article interesting. The previous one includes a quote of Brendan Gregg:

Let me try some observability first. (Means: Let me look at the system without changing it.)

Jessica Kerr — Honeycomb

GitHub Availability Report: June 2022

In June, we experienced four incidents resulting in significant impact to multiple GitHub.com services. This report also sheds light into an incident that impacted several GitHub.com services in May.

GitHub

Day 2 Operations with the James Webb Space Telescope is about to begin

Using the Webb telescope as an example, this article describes the progression of a system toward production operation using a metaphor of 3 days.

Robert Barron — IBM

Outages

LaunchDarkly

This one was from last week when I missed an issue, but it was pretty big and interesting, so I decided to include it this week.

Twitter
Snapchat
Peloton

I see no evidence that Lizzo was the cause, despite rumors.

SRE WEEKLY

A message from our sponsor, Rootly:

Articles

Outages

Related