SRE Weekly Issue #252

A message from our sponsor, StackHawk:

Interested in how you can automate application security testing with GitHub Actions? Check out this on demand webinar from StackHawk and Snyk and see how simple it is to get started.
https://sthwk.com/stackhawk-snyk

Articles

Their on-call started out as four 24 hour shifts per person interspersed throughout the year. Find out how they transitioned to a new approach in a process that spanned the start of the pandemic.

Mary Moore-Simmons — GitHub

A new Meet version had a higher storage usage requirement, and a backend system filled up.

Google

This is webinar on alert fatigue, coming up on January 14.

Sarah Wells — Financial Times

Jamie Dobson — Container Solutions

The chaos experiments you do for security purposes can often expose weak points in reliability as well.

Aaron Rinehart — Verica

Kelly Shortridge — Capsul8

Here are four nifty outside-the-box ideas to use the data you may already have.

Emily Arnott — Blameless

Their custom incident management tool, DropSEV, can detect incident-worthy availability drops and file an incident automatically, obviating the need for an engineer to decide on severity level on the fly.

Joey Beyda and Ross Delinger — DropBox

This one has some additional detail on a November outage involving MySQL replication lag.

Keith Ballinger — GitHub

Outages

SRE WEEKLY

Published
Categorized as SRE