SRE Weekly Issue #282

Articles

I really need to learn bpftrace, and this article is a great place to start.

Brendan Gregg

If we expand our definition of “incident” beyond traditional engineering problems, we increase our opportunity for learning.

Stephen Whitworth — incident.io

Where Do SREs Go From Here?

This is an interview with a director at Catchpoint about their 2021 SRE Report. They discuss two results from the survey: folks report a 15% decrease in toil and slow adoption of AIOps.

Charlene O’Hanlon — devops.com

Incident Retro: Failing Comment Creation + Erroneous Push Notifications

A recurring theme in this story is that the incident was when folks learned how the push notifications work.

Molly Struve — DEV

r/sre – Dev focused SREs do not want to take on operational tasks

In this reddit thread, a company hired some developers as SREs and then found that they didn’t want to do operations work. Folks weigh on why and what to do.

u/red_flock and others — reddit

Latency based SLO

How exactly do you want to phrase (and measure) an SLO about latency percentiles? Beware the subtle details.

Piyush Verma — last9

Resilience in Action E9: Vulnerability, Compassion, and Post-Incident Reviews in the Emergency Room with Dr. Al’ai Alvarez

I’m definitely going to think on the great incident response and followup wisdom in this interview. My favorite:

If I can change 1% to better that outcome, what is that 1%?

Christina Tan — Blameless

Full disclosure: Fastly, my employer, is mentioned.

Burned by ‘let it burn’

Root cause: guessed wrong in the moment

Lorin Hochstein