SRE Weekly Issue #288

Articles

Faced with a difficult hiring market for SREs, they embarked on a well-designed, carefully thought out program to hire and train entry-level folks as SREs — and it worked!

Thomas Betts — InfoQ

The things we find hardest in incident response

No matter how good your tooling is, how experienced you are, or how much you’ve prepared, incidents can still be hard.

Five people share about what they find hardest during incident response.

Chris Evans — incident.io

The Developer Experience and the Role of the SRE Are Changing, Here’s How

This one has a lot of ideas about how to guide developers toward full ownership of their services in production.

Ambassador

6 modes of system resilience

In this post, I will cover the following modes of system resilience:

Adaptive Response
Superior Monitoring
Coordinated Resilience
Heterogenous Systems
Dynamic Repositioning
Requisite Availability

Ash P — Cruform

Useful knowledge and improvisation

Root cause of success: unpatched security vulnerability

TMW a security vulnerability allows you to break into your infrastructure, averting disaster during an incident.

Lorin Hochstein, with incident story by Eric Dobbs

Heroku Incident #2347 Follow-Up

A migration didn’t go as planned, and customer traffic lost its way.

Heroku

Transforming DevOps with Human-in-the-Loop Automation

I’m a big believer in human-in-the-loop automation. My favorite part of this article was this:

A further problem is that full automation — which aims to take the human out of the picture — requires a complete, nuanced understanding of a system and all potential outcomes, paradoxically resulting in heightened system complexity.

Tina Huang — Transposit

Outages

Google Voice
Assembled

For some users, Assembled’s styling was not rendering and caused the application to be unusable.

“Root cause”: CSS

Apple Store
United Airlines
TikTok
Slack
GCash
Solana (Cryptocurrency)

They posted details in later tweets::

thread 1
thread 2

SRE WEEKLY

A message from our sponsor, StackHawk:

Articles

Outages

Related