SRE Weekly Issue #318

Articles

This talk summary explores the concept that “error” is a concept applied to an event from the outside, rather than a simple fact. What can this tell us about our after-incident investigation process?

Fred Hebert

PIPEFAIL: How a missing shell option slowed Cloudflare down

Here’s a deep dive into a performance degradation in Cloudflare last December that was related to missing error handling in a shell script.

Alex Forster — Cloudflare

The Scoop: Inside the Longest Atlassian Outage of All Time

Atlassian is having a tough time. It seems as if they deleted a few hundred customers’ data and have to pull it out of their backups one at a time.

Here’s another article about the outage (Steven J. Vaughan-Nichols — The New Stack).

Gergely Orosz — Pragmatic Engineer

Message durability and quality of service across a large-scale distributed system

Cool trick: their client library can fall back to a backup domain if DNS ably.io fails.

Jo Stichbury — Ably

It’s always DNS . . . except when it’s not: A deep dive through gRPC, Kubernetes, and AWS networking

It still wasn’t quite DNS, it was an interesting situation with the Linux kernel’s martian packet detection algorithm.

Laurent Bernaille and David Lentz — DataDog

India’s Inadvertent Missile Launch Underscores the Risk of Accidental Nuclear Warfare

Aside from the terrifying risk of nuclear war, this sounds very similar to the kind of complex system failures SREs deal with routinely.

Zia Mian, M. V. Ramana — Scientific American

The Pros and Cons of Embedded SREs

Both approaches have their pros and cons. The right strategy for your company or team depends, of course, on your needs and priorities.

Quentin Rousseau — Rootly
This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.

Outages

YouTube
Insteon

Insteon is down and may not be coming back

Amazon
SRE WEEKLY

A message from our sponsor, Rootly:

Articles

Outages

Related