SRE Weekly Issue #292

Articles

Four lessons every company should learn from the back-to-back Facebook outages

The lessons:

Acknowledge human error as a given and aim to compensate for it
Conduct blameless post-mortems
Avoid the “deadly embrace”
Favor decentralized IT architectures

There have been quite a few of these “lessons learned” articles that I’ve passed over, but I feel like this one is worth reading.

Anurag Gupta — Shoreline.io

Niall Murphy

Worst Case

Could us-east-1 go away? What might you do about it? Let’s catastrophize!

I love catastrophizing!

Tim Bray

What Managed Kubernetes Service is Best for SREs?

When evaluating options, this article focuses on reliability, both of the service itself and the options it provides for building reliable services on it.

Quentin Rousseau — Rootly

This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.

SRE Toolkit: Failure Domains

This one answers the questions: what are failure domains, and how can we structure them to improve reliability?

brandon willett

SRE top interview questions to land an SRE role

It’s a great list of questions, and it covers a lot of ground. SREs wear many hats.

Opsera

How Time Series Databases Work—and Where They Don’t

I’ve always been curious about how Prometheus and similar time-series DBs compress metric data. Now I know!

Alex Vondrak — Honeycomb

An UPDATE without a WHERE, or something close to it

This one has some unconfirmed (but totally plausible!) deeper details about what might have gone wrong in the Facebook outage, sourced from rumors.

rachelbythebay

Turning Safety vs. Profits Into a Fair Fight

There’s a really intriguing discussion in here about why organizations might justify a choice of profit at the expense of safety, and how the deck is stacked.

Rob Poston

Outages

Docker Hub
Let’s Encrypt Status
Southwest Airlines
Nest
SRE WEEKLY

A message from our sponsor, Rootly:

Articles

Outages

Related