SRE Weekly Issue #312

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly đźš’. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging the right team, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly shirt):
https://rootly.com/demo/?utm_source=sreweekly

Articles

There’s a really great discussion of “pilot error” at the end of this air accident summary video.

  Mentour Pilot

There are some really great names and talks on the agenda for this half-day virtual conference on April 1.

  IRConf

This article is about building a framework, rather than using one off-the-shelf, to ensure that it’s tailored to the needs of your orgnaization.

  Ethan Motion

When are you smarter than your playbooks, and when are your playbooks smarter than you?

  Andre King — Rootly
This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.

This one is about piecing together the story of how an incident unfolded. One interviewee might mention something new, and then you can ask later interviewees about it.

  Cory Watson — Jeli

All about alert fatigue: how to recognize it and how to fix it once you notice it.

  Emily Arnott — Blameless

This one includes a summary of their February 2 outage:

[…] a routine deployment failed to generate the complete set of integrity hashes needed for Subresource Integrity. The resulting output was missing values needed to securely serve Javascript assets on GitHub.com.

  Jakub Oleksy — GitHub

Following on last week’s article about the term “postmortem”, this one has even more great reasons to pick a different word.

  Blameless

This article recommends a two-stage approach to writing an incident retrospective report: a “calibration document” and then the final report.

  Thai Wood — Jeli

Outages

Tasmania
Discord

Something’s on fire! We’re looking into it, hang tight.

SRE WEEKLY

Published
Categorized as SRE