SRE Weekly Issue #386

This issue was delayed a day while I was enjoying a much-needed vacation with my family. While I’m on the subject, it’s hot take time: vacations are important for the reliability of our sociotechnical systems, so good SREs should take vacations regularly and encourage others to as well.

A message from our sponsor, Rootly:

When incidents impact your customers, failing to communicate with them effectively can erode trust even further and compound an already difficult situation. Learn the essentials of customer-facing incident communication in Rootly’s latest blog post:
https://rootly.com/blog/the-medium-is-the-message-how-to-master-the-most-essential-incident-communication-channels

Articles

If “you build it, you run it” requires mandate, knowledge, and responsibility, what happens when one of those is missing?

  Alex Ewerlöf

Slack developed an all-encompassing metric for the user experience that goes beyond a simple SLO.

  Matthew McKeen and Ryan Katkov

This whitepaper delves deep into the ways a microservice architecture changes how transactions work. It presents a method of dealing with microservice transaction failures through application-specific compensation logic.

  Frank Leymann — WSO2

Bambu is a brand of 3d printers that are primarily cloud-based. A problem in their cloud system resulted in printers running jobs unexpectedly, causing significant damage to some customer’s printers.

  Bambu Lab

An interesting confluence of fiber optic line failures resulted in loss of connectivity on what should have been a redundant link.

  Google

I know the title looks like click-bait, but this article delivers with 7 well thought-out critiques of SLOs.

  Code Reliant

This latest entry into the awesome-* arena is a curated list of runbooks and related resources for popular software.

  Runbear

You shift from asking “what was the abnormal work?” to “how did this incident happen even though everyone was doing normal work?”

This article immediately made me think of the latest Mentour Pilot accident investigation in which everyone acted nearly perfectly and yet still only narrowly avoided a mid-air collision.

  Lorin Hochstein

SRE WEEKLY

Published
Categorized as SRE