SRE Weekly Issue #304

Articles

Channel global decoupling for region discovery

Ably processes a lot of messages, so when they have to redesign a core part of their architecture, it gets pretty interesting.

Simon Woolf — Ably

The James Webb Space Telescope — making 300 points of failure reliable

If you ask any Site Reliability or DevOps engineer how they feel about a deployment plan with over 300 single points of failure, you’d see a lot of nauseous faces and an outbreak of nervous tics!

Nevertheless, that was the best design. Read on to find out why.

Robert Barron

The Case of the Recursive Resolvers

Slack had three separate incidents while trying to deploy DNSSEC for slack.com. This article goes into deep detail on what went wrong each time and what they learned.

Yes, it was an oversight that we did not test a domain with a wildcard record before attempting slack.com — learn from our mistakes!

Rafael Elvira and Laura Nolan — Slack

Building an SRE Team with Specialization

The specializations outlined in this article include:

The Educator
The SLO Guard
Infrastructure architect
Incident response leader

Emily Arnott — Blameless

Designing WhatsApp

If you had to design a WhatsApp today to support its current load, how would you go about it? Here’s one possible design.

Ankit Sirmorya — High Scalability

Why might you run your own DNS server?

Yesterday I asked on Twitter why you might want to run your own DNS servers, and I got a lot of great answers that I wanted to summarize here.

Julia Evans

The VOID with Courtney Nash

In this podcast interview, find out more about why Courtney Nash created the VOID and how posting an incident report can benefit your company. Transcript available.

Mandy Walls (with guest Courtney Nash) — Page it to the Limit

Why Intuitive Troubleshooting Has Stopped Working for You

Drawing on Cynefin, this article explains why debugging by feel and guesswork won’t suffice anymore; we need to be methodical.

Pete Hodgson — Honeycomb

Outages

Solana
Finalsite
Flipkart
SRE WEEKLY

A message from our sponsor, Rootly:

Articles

Outages

Related