Articles
The steps are:
Know How Much Time Is Spent On Toil
Find The Toil
Determine The Root Causes Of Toil
Find And Prioritize The Low-Hanging Fruit
Promote Toil Reduction
Aater Suleman — Forbes
I like how they try to strike a balance and avoid reviewing too far in depth, while still hitting everything important.
Milan PlĹľĂk — Grafana Labs
Lots of good stuff in this one about one of my favorite topics, service ownership.
Kenneth Rose — OpsLevel
This is the intro I needed to understand Conflict-Free Replicated Data Types.
Jo Stichbury — Ably
Availability, maintainability and reliability all have distinct—if related—meanings, and they each play different roles in reliability operations.
JJ Tang — DevOps.com
The five Ps come from medicine and understanding medical accidents, but they apply equally well to analyzing incidents in IT.
Lydia Leong
I really love the focus on de-emphasizing finding action items in incident retrospectives, in favor of learning.
Gergely Orosz — The Pragmatic Engineer
Outages
This week, I saw several status pages point to some kind of problem in their ability to send SMS notifications to AT&T phones. I thought this was interesting because usually I don’t learn about an outage solely from other companies’ status pages.