SRE Weekly Issue #296

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging the right team, postmortem timeline, setting up reminders, and more. Book a demo:
https://rootly.com/?utm_source=sreweekly

Articles

WOW! This is the longest, most detailed public incident post I’ve ever seen from any company. I’ve linked to their short(er) summary, but be sure to check out the long version for all the juicy details.

If we operate too far from the edge, we lose sight of it and can’t anticipate when corrective work should be emphasized. If we operate too close to it, we are constantly in high-stakes situations and firefighting.

Fred Hebert — Honeycomb

This article goes through the actual math of creating an alert for an SLO, including how to avoid alerting for the entire sliding window even after the problem is fixed.

Ervin Barta

This reddit thread doesn’t have any firm answers, but the discussion is pretty interesting.

u/faidoc and others — reddit

Good advice for writing resumes in general, with some SRE-specific tips. There are also links to example SRE resumes.

Quentin Rousseau — Rootly

This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.

Turns out they have runbooks too — or I guess you could say we have SOPs.

Hugh Brien — Transposit

What do you do about developers that just don’t want to be on call?

Charity Majors — Honeycomb

Before opening their new API up to the public, Ably walloped it with Locust.

Denis Sellu — Ably

Outages

Robinhood
Google Cloud Platform, Gmail, Google Calendar, Chat, Meet, and Groups

Linked above is their followup report from the perspective of GCP. There’s also a report for the GSuite side here.

SRE WEEKLY

Published
Categorized as SRE