SRE Weekly Issue #260

Articles

[Increment: Reliability] Interview: Dr. David D. Woods

People throw around “resiliency” quite often when they mean “reliability” or “high availability”. Dr. Woods sets the record straight.

Ipsita Agarwal — Increment

[Increment: Reliability] The process: Implementing Yelp’s failover strategy

A key part of their strategy is to keep their service running at 50% capacity or less, allowing them to lose a datacenter without overloading the remaining datacenter.

Mathieu Frappier, Dorothy Jung, and Qui Nguyen — Increment

[Increment: Reliability] On adaptive capacity in incident response

In issue #236, I linked to an excellent paper by Dr. Richard Cook and Beth Long about engineering resilience in incident response. Now they’re back, teaming up with John Allspaw to summarize and expand on that paper!

John Allspaw, Beth Adele Long, and Dr. Richard Cook — Increment

Security Chaos Engineering: How to Security Differently

A quick s/security/reliability/g and this is an SRE article; the same principles apply to both fields.

Aaron Rinehart — Verica

SRE2AUX: How Flight Controllers were the first SREs

How can we apply the tenets and principles of NASA mission controllers to our SRE work?

Geoff White — Blameless

SRE as Organizational Transformation: Lessons from Activist Organizers

Genius idea: we can take our lead from activists as we try to win over our organization to adopt SRE principles.

Chris Hendrix — Blameless

Atlas: Our journey from a Python monolith to a managed platform

This insightful observation caught my eye:

It’s unnecessary overhead for a product team to plan capacity, set up good alerts and multihoming (automatically running in multiple data centers) for small, simple functionality.

Naphat Sanguansin and Utsav Shah — Dropbox

Outages

Fitbit
Netflix
Disney+
- This week was the Wandavision finale.
Fastly
- Fastly is my employer.

SRE WEEKLY

A message from our sponsor, StackHawk:

Articles

Outages

Related