SRE Weekly Issue #380

View on sreweekly.com

Articles

Amazon Prime Video’s Microservices Move Doesn’t Lead to a Monolith after All

Well, that cleared things up. (It didn’t, but the debate is interesting).

Scott M. Fulton III — The New Stack

5 strategies to improve your incident communication

This article has five tips for great incident communication, along with a section on why this matters.

Luis Gonzalez — incident.io

SRE Engagement Models

Beyond just a list of ways SREs interface with other teams, this article also compares them and gives advantages and disadvantages of each.

Amin Astaneh — Certo Modo

Resilience requires helping each other out

Building every system to be strong enough to handle peak load can be very expensive. Can we instead build our systems to take excess load from each other cooperatively?

Lorin Hochstein — Surfing Complexity

Ensuring reliability: SLOs, on-call process, and postmortems

Another useful “how we do SRE” post, including an incident report template.

Pavel Pritchin — Dodo Engineering

Incident severity: why you need it and how to ensure it’s set

Here’s an interesting twist on the usual “incident severity 101” article: in a company where “anyone can declare an incident”, how do you make sure incident severity gets set consistently in every incident?

Mike Lacsamana — FireHydrant

Impedance Mismatch: SRE vs Dev Speed

How can we work to improve reliability when folks perceive our efforts to be counter to velocity?

Code Reliant

The Problem With Nonpunitive Safety Culture

In a blameless culture without consequences, what’s the incentive for learning to make the system more reliable? This is an incredibly thought-provoking article and I’m still not sure how I feel about it.

Robert Poston MD

SRE WEEKLY

A message from our sponsor, Rootly:

Articles

Related