SRE Weekly Issue #340

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

This one’s from a couple years ago and covers 3 main themes the author saw at SRECon Americas 2020. Fascinating topics include providing context for newbies, learning from incidents, and rethinking the incident command system.

  Taylor Barnett — Transposit

On September 8, Honeycomb had a major outage in data ingestion, and they’ve posted this preliminary report, “pending an in-depth incident review in the upcoming weeks”.

BONUS CONTENT: Another outage report from a different outage the next day.

  Honeycomb
Full disclosure: Honeycomb is my employer.

This is neat! Someone posted a day in their life as an actual SRE, and a bunch of commenters followed suit.

  Various commenters — Reddit

Some big names in SRE got together to talk about how to know when your system is broken. Listen to the recording or read this excellent summary that goes in depth on grey failures and more.

  Emily Arnott — Blameless

To better scale our systems, our infrastructure and product teams got together and decided to make these optimizations: reduce database loads, conduct load tests and size the demand and prioritize critical flows.

…and sharding.

  Robinhood

A major incident went poorly, and that catalyzed investment in developing a new incident response system. They worked to transition from swarming to Incident Command.

  Vikrant Saini — Razorpay

I love this part:

[…] if you have to deploy your microservices in a certain order, they’re not really microservices.

  Cortex

This one had an interesting interplay of contributing factors.

  Heroku

SRE WEEKLY

Published
Categorized as SRE