Ably processes a lot of messages, so when they have to redesign a core part of their architecture, it gets pretty interesting.
Simon Woolf — Ably
If you ask any Site Reliability or DevOps engineer how they feel about a deployment plan with over 300 single points of failure, you’d see a lot of nauseous faces and an outbreak of nervous tics!
Nevertheless, that was the best design. Read on to find out why.
Slack had three separate incidents while trying to deploy DNSSEC for slack.com. This article goes into deep detail on what went wrong each time and what they learned.
Yes, it was an oversight that we did not test a domain with a wildcard record before attempting slack.com — learn from our mistakes!
Rafael Elvira and Laura Nolan — Slack
The specializations outlined in this article include:
The SLO Guard
Incident response leader
Emily Arnott — Blameless
If you had to design a WhatsApp today to support its current load, how would you go about it? Here’s one possible design.
Ankit Sirmorya — High Scalability
Yesterday I asked on Twitter why you might want to run your own DNS servers, and I got a lot of great answers that I wanted to summarize here.
In this podcast interview, find out more about why Courtney Nash created the VOID and how posting an incident report can benefit your company. Transcript available.
Mandy Walls (with guest Courtney Nash) — Page it to the Limit
Drawing on Cynefin, this article explains why debugging by feel and guesswork won’t suffice anymore; we need to be methodical.
Pete Hodgson — Honeycomb