SRE Weekly Issue #369

Articles

Missing the forest for the trees: the component substitution fallacy

if we focus only on issues with individual components, there’s so much we miss about the nature of failure in our systems.

Lorin Hochstein

The Invisible Success of Near Misses

Will Gallego says that we need to prioritize and incentivize learning from near misses, not just actual incidents.

We’ve made headway into expending energy towards learning from incidents. We’ll be even better off when we can regularly make learning from successes our regular work as well.

Will Gallego

Highway to Ruin: The crash of Southern Airways flight 242

This air crash in 1977 taught us many important lessons including surprising details about the behavior of jet engines in rain. The water ingestion testing apparatus shown in one of the photos is pretty impressive.

Admiral Cloudberg

Alerting on the User Experience

When your alerts cover systems owned by different teams, who should be on call?

Nathan Lincoln — Honeycomb
Full disclosure: Honeycomb is my employer.

Oxy: Fish/Bumblebee/Splicer subsystems to improve reliability

Cloudflare does some pretty eye-opening things with the network stack and file descriptors, as described in this amusingly-named article.

Quang Luong and Chris Branch

Building a culture of incident response

While ostensibly about security incident response, this article has a lot of useful ideas for improving response to any kind of incident.

Jess Chang — Vanta (for incident.io)

Keep the monolith, but split the workloads

An argument for monoliths over microservices, but with an important caveat: be careful about compartmentalizing your failure domains.