SRE Weekly Issue #276

A message from our sponsor, StackHawk:

Get ready for some GraphQL! Tune in this Tuesday, June 29 at 9 AM PT for an automated GraphQL security testing learning lab. Register:
http://sthwk.com/graphql-learning-lab

Articles

HBO accidentally sent an email to a bunch of people, and they tweeted (jokingly?) blaming their intern. This is a link to a short, thoughtful response thread.

Gergely Orosz

This is the story of the Bunny CDN outage linked below. Great read, thanks folks!

Dejan Grofelnik Pelzel — Bunny

There’s never a bad time to review the fallacies of distributed computing. This article introduces them with examples and discussion of each.

Alex Diaconu — Ably

These aren’t specific tools, but rather 7 classes of tools (with examples). They are:

Chaos engineering
Monitoring and alerting
Observability
Paging tools
SLO management
Infrastructure-as-Code (and everything-as-code)
Automated incident response

Quentin Rousseau — Rootly

Design is interpretive. We have to find common ground before we can even start to create a design, but finding that common ground is part of the design.

For example, we think of building codes as being precise, but when applied to new situations, they are ambiguous, and the engineers must make a judgment about how to apply them.

Lorin Hochstein

This starts with a really neat moment in which the interviewer asks Yiu to talk about lessons from her jewelry-making hobby that she applies to SRE.

Kurt Andersen

When Gamestop’s stock shot through the roof earlier this year, Reddit’s traffic did too. This is the first article in a short series by Reddit’s SRE team on how they handled the influx.

This article is about the ways that user actions affected their systems in unexpected ways, and how they responded.

Courtney Wang — Reddit

Recently in our Site Reliability Engineering organization in Azure, we established a set of cultural values that we hold ourselves and each other accountable to.

Bill Johnson — Microsoft

Outages

Western Digital “My Book Live” hard drives
Amazon Prime Video and Alexa
PharmOutcomes

PharmOutcomes is a SaaS used by pharmacies.

Commonwealth Bank
medium

I’ve gotten a few 500s from Medium while trying to review articles last week and this week. Maybe it’s this incident on their status page?

Bunny (CDN)
reddit

This post on their status site says “API errors”, but I saw rumblings that suggested that reddit itself was down.

SRE WEEKLY

Published
Categorized as SRE