SRE Weekly Issue #320

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly đźš’. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

Slack shared this write-up of their February outage, which involved complex systems interactions and cascading failure.

  Laura Nolan — Slack

Go watch this lightning talk now! She had me hooked within the first ten seconds.

Hi, my name is Emily Ruppe, I work at Jeli.io, and I am a recovering incident commander, and I am sick of the phrase “to prevent this incident from ever happening again”.

  Emily Ruppe — DevOpsDays Rockies

This is my personal story of starting the SRE organization at Uber.

This article was written by a former Uber employee and is posted on their personal blog.

  Will Larson

This is total transparency at its finest. This write-up has all the details you could ever hope for on what went wrong, how they responded, and what comes next.

  Sri Viswanath — Atlassian

The target audience is new SREs and executive sponsors who would keep hearing these terms repeatedly but not take the time to read 1000s of words at a time.

[source: author comment on Reddit]

  Ash P. — SREPath

Dropbox wanted to be able to handle datacenter failure. To reach this goal, they moved from an active/active model to active/passive and spun up a new Disaster Readiness team to rework their failover system.

  Krishelle Hardson-Hurley, Ross Delinger, and Tong Pham — Dropbox

HelloFresh drove the implementation of SLOs in their Kubernetes-based infrastructure using Prometheus and Sloth.

  Chris Loukas — HelloFresh

A Roblox engineer outlines the way that Roblox handles reliability at scale.

  Alberto Covarrubias — Roblox

[…] let’s look at some common on call antipatterns and some simple things we can do to alleviate their common pitfalls.

  Nickolas Means — Sym

Outages

myfitnesspal
Dyn
Apple Music and App Store
WhatsApp
1Password
SRE WEEKLY

Published
Categorized as SRE