SRE Weekly Issue #338

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

This one advocates for looking beyond “root cause” when analyzing an incident, and instead finding Themes and Takeaways.

If it can be solved with a pull request it’s not a takeaway.

  Vanessa Huerta Granda — Jeli

In this juicy incident, the Incident Commander’s intimate knowledge of a similar failure mode fixated incident response away from the true cause.

  Fred Hebert — Honeycomb

[…] the more we normalize lower-impact incidents, the more confidence and experience we build for Sev1 situations.

  Dan Condomitti — The New Stack

Want to compensate folks extra for on-call work? This tool connects to PagerDuty to do all the heavy lifting for you.

  Lawrence Jones — incident.io

This Reddit post in r/sre has some really great stories in the comments.

  various users — Reddit

Along with the “why”, this article also goes into the “how”.

  Martha Lambert — incident.io

Early in my career, I had to write a raw IP packet generator to reproduce a DoS attack so that I could mitigate it. It’s fun!

  Julia Evans

In an incident in July, a cloud provider change broke provisioning for new Codespaces VMs, taking down the service.

  Jakub Oleksy — GitHub

Put Safety First and Minimize
the 12 Common Causes of Mistakes
in the Aviation Workplace

  FAA (US’s Federal Aviation Administration)

SRE WEEKLY

Published
Categorized as SRE
Generated by Feedzy