SRE Weekly Issue #348

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

Here’s a good intro to creating SLOs including a section on best practices.

  Cortex

When they started to get complaints from customers, they knew it was time to get serious about measuring and monitoring their reliability.

  arun — Reputation

As an SRE and sysadmin with 10+ years of industry experience, I wanted to write up a few scenarios that are real threats to the integrity of the bird site over the coming weeks.

What follows is a thread with tens of realistic failure scenarios, many of which apply not just to Twitter.

  @MosquitoCapital on Twitter

A few amusing anecdotes reveal deeper lessons in SRE.

  David Cassel — The New Stack

A resilient system like Twitter isn’t likely to go down instantly just because of a few changes. It’s much more likely to slowly degrade, per this article.

  Christopher Carbone — Daily Mail

It’s really interesting to see where this write-up differs from a video summary of the same accident by Mentour Pilot. Given the differences, I wonder if there are even more details that both left out?

  Admiral Cloudberg

This is a really great description of common ground breakdown, referencing Woods and Klein.

  Dan Slimmon

SRE WEEKLY

Published
Categorized as SRE