SRE Weekly Issue #322

Bit of a short issue this week. This morning, I stepped on my phone, crushing it mightily beneath my bootheel. Unfortunately a lot of my automation for reviewing articles is on there… thank goodness I have functioning backups.

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

What? Actually, it’s a pretty good analogy.

  Emily Arnott — Blameless

Mercari has this update to their previous article on their embedded SRE team with more details on how their embedding model works.

  Taichi Nakashima — Mercari

Interesting things happen when you combine tail latency with a microservice architecture.

  Marc Brooker

Their starting point was paging for every single exception raised by their application. Here’s how they tempered that a bit to get a handle on their paging volume.

  Lisa Karlin Curtis — incident.io

This article draws from the “SRE Hierarchy” in Google’s SRE book (which itself is a reference to Maslow’s hierarchy of needs). It recasts the SRE hierarchy as a path to maturity.

  Ash P. — SREPath

Google posted this summary of an incident from late April. A configuration change had the unintended effect of causing livestream view requests to fail.

  Google

Outages

Xbox

I don’t normally bother with game outages, but this one caught my eye. During the 4-day outage, customers were unable to play Xbox games that they had already purchased.

Twitter
Coinbase
SRE WEEKLY

Published
Categorized as SRE