Articles
This talk summary explores the concept that “error” is a concept applied to an event from the outside, rather than a simple fact. What can this tell us about our after-incident investigation process?
  Fred Hebert
Here’s a deep dive into a performance degradation in Cloudflare last December that was related to missing error handling in a shell script.
  Alex Forster — Cloudflare
Atlassian is having a tough time. It seems as if they deleted a few hundred customers’ data and have to pull it out of their backups one at a time.
Here’s another article about the outage (Steven J. Vaughan-Nichols — The New Stack).
  Gergely Orosz — Pragmatic Engineer
Cool trick: their client library can fall back to a backup domain if DNS ably.io fails.
  Jo Stichbury — Ably
It still wasn’t quite DNS, it was an interesting situation with the Linux kernel’s martian packet detection algorithm.
  Laurent Bernaille and David Lentz — DataDog
Aside from the terrifying risk of nuclear war, this sounds very similar to the kind of complex system failures SREs deal with routinely.
  Zia Mian, M. V. Ramana — Scientific American
Both approaches have their pros and cons. The right strategy for your company or team depends, of course, on your needs and priorities.
  Quentin Rousseau — Rootly
This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.
Outages
Insteon is down and may not be coming back
Amazon
SRE WEEKLY