SRE Weekly Issue #304

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly đźš’. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging the right team, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly shirt):


Ably processes a lot of messages, so when they have to redesign a core part of their architecture, it gets pretty interesting.

  Simon Woolf — Ably

If you ask any Site Reliability or DevOps engineer how they feel about a deployment plan with over 300 single points of failure, you’d see a lot of nauseous faces and an outbreak of nervous tics!

Nevertheless, that was the best design. Read on to find out why.

  Robert Barron

Slack had three separate incidents while trying to deploy DNSSEC for This article goes into deep detail on what went wrong each time and what they learned.

Yes, it was an oversight that we did not test a domain with a wildcard record before attempting — learn from our mistakes!

  Rafael Elvira and Laura Nolan — Slack

The specializations outlined in this article include:

The Educator
The SLO Guard
Infrastructure architect
Incident response leader

  Emily Arnott — Blameless

If you had to design a WhatsApp today to support its current load, how would you go about it? Here’s one possible design.

  Ankit Sirmorya — High Scalability

Yesterday I asked on Twitter why you might want to run your own DNS servers, and I got a lot of great answers that I wanted to summarize here.

  Julia Evans

In this podcast interview, find out more about why Courtney Nash created the VOID and how posting an incident report can benefit your company. Transcript available.

  Mandy Walls (with guest Courtney Nash) — Page it to the Limit

Drawing on Cynefin, this article explains why debugging by feel and guesswork won’t suffice anymore; we need to be methodical.

  Pete Hodgson — Honeycomb



Categorized as SRE