SRE Weekly Issue #252

Articles

Their on-call started out as four 24 hour shifts per person interspersed throughout the year. Find out how they transitioned to a new approach in a process that spanned the start of the pandemic.

Mary Moore-Simmons — GitHub

Google Cloud Issue Summary — Google Meet — 2020-12-14

A new Meet version had a higher storage usage requirement, and a backend system filled up.

Google

WTF is Alert Fatigue

This is webinar on alert fatigue, coming up on January 14.

Sarah Wells — Financial Times

Jamie Dobson — Container Solutions

Announcing the Security Chaos Engineering Report

The chaos experiments you do for security purposes can often expose weak points in reliability as well.

Aaron Rinehart — Verica

Kelly Shortridge — Capsul8

Little Known Ways to Better Use Your Error Budgets

Here are four nifty outside-the-box ideas to use the data you may already have.

Emily Arnott — Blameless

Lessons learned in incident management

Their custom incident management tool, DropSEV, can detect incident-worthy availability drops and file an incident automatically, obviating the need for an engineer to decide on severity level on the fly.

Joey Beyda and Ross Delinger — DropBox

GitHub Availability Report: December 2020

This one has some additional detail on a November outage involving MySQL replication lag.

Keith Ballinger — GitHub

Outages

Slack
- My first couple hours of work this year were oddly quiet…
Heroku
Google Meet
- This is different from the one above.
Fanduel
Twitch
Coinbase
Archive of Our Own

SRE WEEKLY

A message from our sponsor, StackHawk:

Articles

Outages

Related