SRE Weekly Issue #348

Articles

Building reliable services: A guide to setting SLOs

Here’s a good intro to creating SLOs including a section on best practices.

Cortex

When they started to get complaints from customers, they knew it was time to get serious about measuring and monitoring their reliability.

arun — Reputation

@MosquitoCapital on Twitter Re: Twitter

As an SRE and sysadmin with 10+ years of industry experience, I wanted to write up a few scenarios that are real threats to the integrity of the bird site over the coming weeks.

What follows is a thread with tens of realistic failure scenarios, many of which apply not just to Twitter.

@MosquitoCapital on Twitter

What a Broken Wheel Taught Google Site Reliability Engineers

A few amusing anecdotes reveal deeper lessons in SRE.

David Cassel — The New Stack

Twitter engineer claims problems will eventually lead to a breakdown

A resilient system like Twitter isn’t likely to go down instantly just because of a few changes. It’s much more likely to slowly degrade, per this article.

Christopher Carbone — Daily Mail

Strength in Numbers: The crash of National Airlines flight 102

It’s really interesting to see where this write-up differs from a video summary of the same accident by Mentour Pilot. Given the differences, I wonder if there are even more details that both left out?

Admiral Cloudberg

Troubleshooting On A Distributed Team Without Losing Common Ground

This is a really great description of common ground breakdown, referencing Woods and Klein.

Dan Slimmon

SRE WEEKLY

A message from our sponsor, Rootly:

Articles

Related