SRE Weekly Issue #318

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly đźš’. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

This talk summary explores the concept that “error” is a concept applied to an event from the outside, rather than a simple fact. What can this tell us about our after-incident investigation process?

  Fred Hebert

Here’s a deep dive into a performance degradation in Cloudflare last December that was related to missing error handling in a shell script.

  Alex Forster — Cloudflare

Atlassian is having a tough time. It seems as if they deleted a few hundred customers’ data and have to pull it out of their backups one at a time.

Here’s another article about the outage (Steven J. Vaughan-Nichols — The New Stack).

  Gergely Orosz — Pragmatic Engineer

Cool trick: their client library can fall back to a backup domain if DNS ably.io fails.

  Jo Stichbury — Ably

It still wasn’t quite DNS, it was an interesting situation with the Linux kernel’s martian packet detection algorithm.

  Laurent Bernaille and David Lentz — DataDog

Aside from the terrifying risk of nuclear war, this sounds very similar to the kind of complex system failures SREs deal with routinely.

  Zia Mian, M. V. Ramana — Scientific American

Both approaches have their pros and cons. The right strategy for your company or team depends, of course, on your needs and priorities.

  Quentin Rousseau — Rootly
This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.

Outages

YouTube
Insteon

Insteon is down and may not be coming back

Amazon
SRE WEEKLY

Published
Categorized as SRE