SRE Weekly Issue #368

Articles

Isolating Noisy Neighbors in Distributed Systems: The Power of Shuffle-Sharding

This article uses a simulation to demonstrate the power of shuffle sharding to limit the blast radius of overload conditions.

Eugene Retunsky — DZone

Stress Testing Tutorial

A comprehensive look at stress testing, how it differs from load testing, how to implement it, and how to analyze the results.

Lambdatest

Building Fault Tolerance with RPC Fallbacks in DoorDash’s Microservices

Retries and high availability are great, but for critical dependencies, we can go a step further and define an alternative in case a dependency is down.

Leart Gjoni — DoorDash

DoorDash’s May 12th Outage

From the archives, here’s an incident report from a major outage at DoorDash in 2022.

Ryan Sokol — DoorDash

Holding Retrospectives in a Cloud-Native World

Amazon’s old internal “retrospective” process sounds painful and scary. Fortunately the author took the good parts and learned some valuable lessons from the rest.

Lee Atchison — Container Journal

SREs: Stop Asking Your Product Managers for SLOs

Instead of asking PMs to “speak SRE,” span the communication gap by using the common language of user stories to build business-cogent SLOs.

Kit Merker — DevOps.com

AWS’s Anti-Competitive Move Hidden in Plain Sight

Amazon advantages their service offerings like RDS by making the (normally pricey) cross-availability-zone data transfer free.

Corey Quinn — Last Week In AWS

Should Every Incident Get a Retrospective?

It’s easy to think of reasons to run a retrospective on an incident. What about the reverse? Which incidents should we skip over?

Lex Neva — The New Stack
Full disclosure: Honeycomb, my employer, is mentioned.

SRE WEEKLY

A message from our sponsor, Rootly:

Articles

Related