SRE Weekly Issue #396

Translating Failures into Service-Level Objectives

Using 3 high-profile incidents from the past year, this article explores how to define SLOs that might catch similar problems, with a special focus on keeping the SLI close to the user experience.

Adriana Villela and Ana Margarita Medina — The New Stack

The costs of microservices

Microservices can have some great benefits, but if you want to build with them, you’re going to have to solve a whole pile of new problems.

Roberto Vitillo

How distributed systems fail

To protect your application against failures, you first need to know what can go wrong. […] the most common failures you will encounter are caused by single points of failure, the network being unreliable, slow processes, and unexpected load.

Roberto Vitillo

Sofia’s Observability Odyssey: The Do’s and Don’ts for Effective Observability

I love how this article keeps things interesting by starting with a fictional (but realistic) story about the dangers of over-alerting before continuing on to give direct advice.

Adso

Retries, Backoff and Jitter

I especially enjoy the section on the potential pitfalls and challenges with retries and how you can avoid them.

CodeReliant

As an SRE, how often are you directly involved with application code / logic?

This reddit thread is a goldmine, including this gem:

I actively avoid getting involved with software subject matter expertise, because it robs the engineering team of self-reliance, which is itself a reliability issue.

u/bv8z and others — reddit

crates.io Postmortem: Broken Crate Downloads

There’s a pretty cool “Five Whys”-style analysis that goes past “dev pushed unreviewed code with incomplete tests to production” and to the sociotechnical challenges underlying that.

Tobias Bieniek — crates.io

SRE WEEKLY

A message from our sponsor, FireHydrant:

Related