SRE Weekly Issue #294

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging the right team, postmortem timeline, setting up reminders, and more. Book a demo:
https://rootly.com/?utm_source=sreweekly

Articles

The steps are:

Know How Much Time Is Spent On Toil
Find The Toil
Determine The Root Causes Of Toil
Find And Prioritize The Low-Hanging Fruit
Promote Toil Reduction

Aater Suleman — Forbes

I like how they try to strike a balance and avoid reviewing too far in depth, while still hitting everything important.

Milan Plžík — Grafana Labs

Lots of good stuff in this one about one of my favorite topics, service ownership.

Kenneth Rose — OpsLevel

This is the intro I needed to understand Conflict-Free Replicated Data Types.

Jo Stichbury — Ably

Availability, maintainability and reliability all have distinct—if related—meanings, and they each play different roles in reliability operations.

JJ Tang — DevOps.com

The five Ps come from medicine and understanding medical accidents, but they apply equally well to analyzing incidents in IT.

Lydia Leong

I really love the focus on de-emphasizing finding action items in incident retrospectives, in favor of learning.

Gergely Orosz — The Pragmatic Engineer

Outages

AT&T SMS in the US

This week, I saw several status pages point to some kind of problem in their ability to send SMS notifications to AT&T phones. I thought this was interesting because usually I don’t learn about an outage solely from other companies’ status pages.

Google Meet
Tesco
Coinbase
Zomato
Barclays
HSBC
SRE WEEKLY

Published
Categorized as SRE