SRE Weekly Issue #294

Articles

Five Steps To Reduce SRE Toil And Add More Value

The steps are:

Know How Much Time Is Spent On Toil
Find The Toil
Determine The Root Causes Of Toil
Find And Prioritize The Low-Hanging Fruit
Promote Toil Reduction

Aater Suleman — Forbes

How we’re building a production readiness review process at Grafana Labs

I like how they try to strike a balance and avoid reviewing too far in depth, while still hitting everything important.

Milan Plžík — Grafana Labs

Seth Lochen of Groupon talks ownership and the bystander effect, platform engineering, and frogs in boiling water

Lots of good stuff in this one about one of my favorite topics, service ownership.

Kenneth Rose — OpsLevel

How do CRDTs solve distributed data consistency challenges?

This is the intro I needed to understand Conflict-Free Replicated Data Types.

Jo Stichbury — Ably

Defining Availability, Maintainability and Reliability in SRE

Availability, maintainability and reliability all have distinct—if related—meanings, and they each play different roles in reliability operations.

JJ Tang — DevOps.com

Five-P factors for root cause analysis

The five Ps come from medicine and understanding medical accidents, but they apply equally well to analyzing incidents in IT.

Lydia Leong

Incident Review and Postmortem Best Practices

I really love the focus on de-emphasizing finding action items in incident retrospectives, in favor of learning.

Gergely Orosz — The Pragmatic Engineer

Outages

AT&T SMS in the US

This week, I saw several status pages point to some kind of problem in their ability to send SMS notifications to AT&T phones. I thought this was interesting because usually I don’t learn about an outage solely from other companies’ status pages.

Google Meet
Tesco
Coinbase
Zomato
Barclays
HSBC
SRE WEEKLY

A message from our sponsor, Rootly:

Articles

Outages

Related