SRE Weekly Issue #292

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly đźš’. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging the right team, postmortem timeline, setting up reminders, and more. Book a demo:
https://rootly.io/?utm_source=sreweekly

Articles

The lessons:

Acknowledge human error as a given and aim to compensate for it
Conduct blameless post-mortems
Avoid the “deadly embrace”
Favor decentralized IT architectures

There have been quite a few of these “lessons learned” articles that I’ve passed over, but I feel like this one is worth reading.

Anurag Gupta — Shoreline.io

Niall Murphy

Could us-east-1 go away? What might you do about it? Let’s catastrophize!

I love catastrophizing!

Tim Bray

When evaluating options, this article focuses on reliability, both of the service itself and the options it provides for building reliable services on it.

Quentin Rousseau — Rootly

This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.

This one answers the questions: what are failure domains, and how can we structure them to improve reliability?

brandon willett

It’s a great list of questions, and it covers a lot of ground. SREs wear many hats.

Opsera

I’ve always been curious about how Prometheus and similar time-series DBs compress metric data. Now I know!

Alex Vondrak — Honeycomb

This one has some unconfirmed (but totally plausible!) deeper details about what might have gone wrong in the Facebook outage, sourced from rumors.

rachelbythebay

There’s a really intriguing discussion in here about why organizations might justify a choice of profit at the expense of safety, and how the deck is stacked.

Rob Poston

Outages

Docker Hub
Let’s Encrypt Status
Southwest Airlines
Nest
SRE WEEKLY

Published
Categorized as SRE