The observability waters have been muddy for awhile, and this article does a great job of taking a step back and building a definition — and a roadmap.
Hazel Weakly
Fred Hebert wrote this response/follow-on to Hazel’s article:
The main points I’ll try to bring here are on the topics of the difference between insights and questions, the difference between observability and data availability, reinforcing a socio-technical definition, the mess of complex systems and mapping them, and finally, a hot take on the use of models when reasoning about systems.
Fred Hebert
What the service providers are willing to put on the table in terms of penalties is often much less than the money you lose when your service goes down.
Alex Ewerlöf
Fascinating legal questions come to the surface when lawyers consider the possibility for legal risk exposure from a surgical incident debriefing meeting.
Dr. Rob Poston
if you approach on-call the right way, you can mitigate the impacts of alert fatigue or, better yet, avoid it altogether. Here, we’ll dive into the tactics teams can implement to address alert fatigue and its underlying causes.
incident.io
How do you create an SLO that references multiple SLIs together, such as slow requests and errors?
Ross Brodbeck
More than just a list of talks, this piece pulls out major themes from SRECon24.
Will Gallego
Making your 9’s look great by cheating.
Of course, you don’t actually want to do that, but learning how can show us that availability numbers are nuanced.
Ross Brodbeck
SRE WEEKLY