SRE Weekly Issue #322

Bit of a short issue this week. This morning, I stepped on my phone, crushing it mightily beneath my bootheel. Unfortunately a lot of my automation for reviewing articles is on there… thank goodness I have functioning backups.

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):


What? Actually, it’s a pretty good analogy.

  Emily Arnott — Blameless

Mercari has this update to their previous article on their embedded SRE team with more details on how their embedding model works.

  Taichi Nakashima — Mercari

Interesting things happen when you combine tail latency with a microservice architecture.

  Marc Brooker

Their starting point was paging for every single exception raised by their application. Here’s how they tempered that a bit to get a handle on their paging volume.

  Lisa Karlin Curtis —

This article draws from the “SRE Hierarchy” in Google’s SRE book (which itself is a reference to Maslow’s hierarchy of needs). It recasts the SRE hierarchy as a path to maturity.

  Ash P. — SREPath

Google posted this summary of an incident from late April. A configuration change had the unintended effect of causing livestream view requests to fail.




I don’t normally bother with game outages, but this one caught my eye. During the 4-day outage, customers were unable to play Xbox games that they had already purchased.


Categorized as SRE
Generated by Feedzy