SRE Weekly Issue #277

A message from our sponsor, StackHawk:

Planelty saved weeks of work by implementing StackHawk instead of building an internal ZAP service. See how:
https://sthwk.com/planetly-stackhawk

Articles

Remember all those Robinhood outages? The US financial regulatory agency is making Robinhood repay folks for the losses they sustained as a result and also fining them for other reasons.

Michelle Ong, Ray Pellecchia, Angelita Plemmer Williams, and Andrew DeSouza — FINRA

This is brilliant and I wish I’d thought of it years ago:

One of the things we’ve previously seen during database incidents is that a set of impacted tables can provide a unique fingerprint to identify a feature that’s triggering issues.

Courtney Wang — Reddit

The suggested root cause involves consolidation in cloud providers and the importance of DNS.

Alban Kwan — CircleID

Full disclosure: Fastly, my employer, is mentioned.

This paper is about recognizing normalization of deviance and techniques for dealing with it. This tidbit really made me think:

[…] they might have been taught a system deviation without realizing that it was so […]

Bus Horiz

Blameless incident analysis is often at odds with a desire to “hold people accountable”. This article explores that conflict and techniques for managing the needs involved.

Christina Tan and Emily Arnott — Blameless

What can you do if you’re out of error budget but you still want to deliver new features? Get creative.

Paul Osman — Honeycomb

I am going to go through the variation we use to up skill our on-call engineers we called “The Kobayashi Maru”, the name we borrowed from the Star Trek training exercise to test the character of Starfleet cadets.

Bruce Dominguez

Outages

Slack
Zimbabwe Shared Services (financial services)
Snapchat
Facebook
YouTube
Twitter
Nest
GCash
SRE WEEKLY

Published
Categorized as SRE