SRE Weekly Issue #418

The observability waters have been muddy for awhile, and this article does a great job of taking a step back and building a definition — and a roadmap.

Hazel Weakly

A Commentary on Defining Observability

Fred Hebert wrote this response/follow-on to Hazel’s article:

The main points I’ll try to bring here are on the topics of the difference between insights and questions, the difference between observability and data availability, reinforcing a socio-technical definition, the mess of complex systems and mapping them, and finally, a hot take on the use of models when reasoning about systems.

Fred Hebert

Service Level Agreement

What the service providers are willing to put on the table in terms of penalties is often much less than the money you lose when your service goes down.

Alex Ewerlöf

Assumptions About Debriefs That Belie Legal Risk

Fascinating legal questions come to the surface when lawyers consider the possibility for legal risk exposure from a surgical incident debriefing meeting.

Dr. Rob Poston

How to deal with alert fatigue head-on

if you approach on-call the right way, you can mitigate the impacts of alert fatigue or, better yet, avoid it altogether. Here, we’ll dive into the tactics teams can implement to address alert fatigue and its underlying causes.

incident.io

Different Ways to Aggregate Nines

How do you create an SLO that references multiple SLIs together, such as slow requests and errors?

Ross Brodbeck

SREcon24 Americas Recap

More than just a list of talks, this piece pulls out major themes from SRECon24.

Will Gallego

How much are your 9’s worth?

Making your 9’s look great by cheating.

Of course, you don’t actually want to do that, but learning how can show us that availability numbers are nuanced.

Ross Brodbeck

SRE WEEKLY

A message from our sponsor, FireHydrant:

Related