SRE Weekly Issue #321

Articles

Using Fault Injection Testing to Improve DoorDash Reliability

A researcher explains how they implemented their microservice failure testing tool at DoorDash. The tool, Fillibuster, automatically discovers microservice dependencies and injects faults, avoiding the need to design specific individual failure scenarios.

Christopher Meiklejohn — DoorDash

Twitter: @ReinH on Atlassian’s incident write-up

Last week, I shared Atlassian’s outage write-up. This link is a Twitter thread with a critique.

I feel like it is perhaps not a “good look” to repeatedly try to sell your product in your writeup about your product’s catastrophic outage

@ReinH

usefulness of error

“Error” serves a number of functions for an organization: as a defense against entanglement, the illusion of control, as a means for distancing, and as a marker for a failed investigation.

Eric Dobbs

Incident Report for Enom (January 15, 2022)

This is a write-up posted in January for an incident that occurred during an infrastructure migration. I feel like I can relate to every one of the learnings.

Enom (Tucows)

On-Call: Leave It Better Than You Found It

In the past two years, I’ve been participating in on-call rotations as a Site Reliability Engineer at Vinted. Here are some of the practical lessons I’ve learned about the process.

Ernestas Narmontas

How SREs analyze risks to evaluate SLOs

This article is all about finding out what risks exist that may impact your ability to meet your SLOs. Once you’ve done that, you can determine whether your SLOs are realistic.

Ayelet Sachto — Google

How we aligned 200 teams to monitor services with SLOs

When your organization chooses to implement SLOs, how do you get everyone on board? This two-part series has an in-depth look at how Klarna did it.

Andrew Cartine — Klarna

What is an SRE Product Manager?

Subtitle: And why do SRE teams need PMs?

After laying out the reasons why SREs need PMs, this article goes into detail about what a PM can bring to an SRE team.

António Araújo — detech.ai

BellJar: A new framework for testing system recoverability at scale

BellJar helps users find cyclic dependencies in their services, by running totally isolated VMs and requiring users to explicitly enable every external dependency they need in order to bootstrap each service. It has a really neat feature of automatically generating runbooks based on these test cases.

Christopher Bunn and Jie Huang — Meta

Meltdown: Three Mile Island

This week, I watched Netflix’s Meltdown: Three Mile Island, a documentary about the nuclear accident in the US in 1979. It’s not exactly a post-incident write-up, but there’s a lot in there about normalization of deviance, situational awareness, and risk-taking (both in and out of incidents).

Netflix

Outages

Slack

and this one

Heroku

Heroku’s been dealing with a security incident since April 13. They performed a mass password reset of all accounts and their GitHub integration has been disabled for days.

Roblox
SRE WEEKLY

A message from our sponsor, Rootly:

Articles

Outages

Related