Using 3 high-profile incidents from the past year, this article explores how to define SLOs that might catch similar problems, with a special focus on keeping the SLI close to the user experience.
Adriana Villela and Ana Margarita Medina — The New Stack
Microservices can have some great benefits, but if you want to build with them, you’re going to have to solve a whole pile of new problems.
Roberto Vitillo
To protect your application against failures, you first need to know what can go wrong. […] the most common failures you will encounter are caused by single points of failure, the network being unreliable, slow processes, and unexpected load.
Roberto Vitillo
I love how this article keeps things interesting by starting with a fictional (but realistic) story about the dangers of over-alerting before continuing on to give direct advice.
Adso
I especially enjoy the section on the potential pitfalls and challenges with retries and how you can avoid them.
CodeReliant
This reddit thread is a goldmine, including this gem:
I actively avoid getting involved with software subject matter expertise, because it robs the engineering team of self-reliance, which is itself a reliability issue.
u/bv8z and others — reddit
There’s a pretty cool “Five Whys”-style analysis that goes past “dev pushed unreviewed code with incomplete tests to production” and to the sociotechnical challenges underlying that.
Tobias Bieniek — crates.io
SRE WEEKLY