SRE Weekly Issue #288

A message from our sponsor, StackHawk:

Want to see what’s new with automated security tooling? Tune in on September 30 to see how StackHawk and Semgrep are making it possible to embed security testing in CI/CD.
https://sthwk.com/whats-new-webinar

Articles

Faced with a difficult hiring market for SREs, they embarked on a well-designed, carefully thought out program to hire and train entry-level folks as SREs — and it worked!

Thomas Betts — InfoQ

No matter how good your tooling is, how experienced you are, or how much you’ve prepared, incidents can still be hard.

Five people share about what they find hardest during incident response.

Chris Evans — incident.io

This one has a lot of ideas about how to guide developers toward full ownership of their services in production.

Ambassador

In this post, I will cover the following modes of system resilience:

Adaptive Response
Superior Monitoring
Coordinated Resilience
Heterogenous Systems
Dynamic Repositioning
Requisite Availability

Ash P — Cruform

Root cause of success: unpatched security vulnerability

TMW a security vulnerability allows you to break into your infrastructure, averting disaster during an incident.

Lorin Hochstein, with incident story by Eric Dobbs

A migration didn’t go as planned, and customer traffic lost its way.

Heroku

I’m a big believer in human-in-the-loop automation. My favorite part of this article was this:

A further problem is that full automation — which aims to take the human out of the picture — requires a complete, nuanced understanding of a system and all potential outcomes, paradoxically resulting in heightened system complexity.

Tina Huang — Transposit

Outages

Google Voice
Assembled

For some users, Assembled’s styling was not rendering and caused the application to be unusable.

“Root cause”: CSS

Apple Store
United Airlines
TikTok
Slack
GCash
Solana (Cryptocurrency)

They posted details in later tweets::

thread 1
thread 2

SRE WEEKLY

Published
Categorized as SRE