SRE Weekly Issue #353

Articles

What Does the Future Hold for Role of SRE?

This article contains:

two reasons why site reliability engineers may be part of IT teams for years to come, and two reasons why site reliability engineering may turn out just to be a fad.

Christopher Tozzi — ITPro Today

Best practices for observability

This article proposes an interesting method for incident investigation: constantly try to disprove your hypotheses to avoid confirmation bias.

Ivan Merill — Fiberplane

Killed By A Machine: The Therac-25

How I’ve managed to run this newsletter for almost 7 years without a single mention of the Therac-25 incidents is beyond me. Therac-25 is an important lesson for all of us as we design systems and analyze incidents.

Adam Fabio — Hackaday

Network Errors and Data Loss 2008-01

Even though this happened 14 years ago, the cause is very much still relevant today. If you have two bit-flips in the same TCP packet, it’ll still pass the checksum.

Poppy Linden — Linden Lab

What makes a good alert?

This article proposes two criteria: Actionability and Investigability.

Dan Slimmon

Intermittent downtime from repeated crashes

This incident write-up chronicles an incident in which a poison pill message repeatedly crashed their Heroku app.

Lawrence Jones — incident.io

The Words Not Spoken: The crash of Avianca flight 052

Take this one with a grain of salt since there’s a fair bit of counterfactual reasoning in the description. Nevertheless there’s a lot to learn from this and Wikipedia’s article on the same accident.

Admiral Cloudberg

SRE WEEKLY

A message from our sponsor, Rootly:

Articles

Related