SRE Weekly Issue #413

Sorry about the automation fail and resend! That definitely wasn’t issue #1.

This article discusses building failure management directly into our systems, using Erlang as a case study.

Jamie Allen

Cinnamon: Using Century Old Tech to Build a Mean Load Shedder

Building on their experience with their previous load shedding library, Uber built a new one that requires no configuration.

Jakob Holdgaard Thomsen, Vladimir Gavrilenko, Jesper Lindstrom Nielsen, and Timothy Smyth — Uber

Conditional Love for AWS Metadata Enumeration

These folks found a way to get tag names and values from other people’s AWS resources. I know this is more security- than SRE-related but the technique is just so cool!

Daniel Grzelak — Plerion

Justifying Resilience Work

How much does it cost to improve resilience? What’s the ROI? It’s fuzzy, but we still need to do it.

Will Gallego

SREday – London, UK, Sep 19-20, 2024

Check it out, it’s an entire SRE conference I was totally unaware of!

SREday

SLA vs. SLO vs. SLI: What’s the Difference?

It’s an SLI/SLO/SLA explainer, but with a twist: a pros and cons list for each of the three.

Laura Clayton — UptimeRobot

What were your worst on-call experiences?

A great reddit thread for some schadenfreude… and perhaps you’d like to share your own story?

u/New_Detective_1363 and others — reddit

End of support for repl.co & recent issues explained

What an interesting cause for an incident: the service your customers have pointed your product at decides to block your requests, effectively DoSing your systems.

Tomas Koprusak — UptimeRobot

The Role of CAP Theorem in Modern Day Distributed Systems

The CAP theorem is useful as a theory, but what does it actually mean in practice?

neda — ReadySet

SRE WEEKLY

A message from our sponsor, FireHydrant:

Related