Articles
Remember all those Robinhood outages? The US financial regulatory agency is making Robinhood repay folks for the losses they sustained as a result and also fining them for other reasons.
Michelle Ong, Ray Pellecchia, Angelita Plemmer Williams, and Andrew DeSouza — FINRA
This is brilliant and I wish I’d thought of it years ago:
One of the things we’ve previously seen during database incidents is that a set of impacted tables can provide a unique fingerprint to identify a feature that’s triggering issues.
Courtney Wang — Reddit
The suggested root cause involves consolidation in cloud providers and the importance of DNS.
Alban Kwan — CircleID
Full disclosure: Fastly, my employer, is mentioned.
This paper is about recognizing normalization of deviance and techniques for dealing with it. This tidbit really made me think:
[…] they might have been taught a system deviation without realizing that it was so […]
Bus Horiz
Blameless incident analysis is often at odds with a desire to “hold people accountable”. This article explores that conflict and techniques for managing the needs involved.
Christina Tan and Emily Arnott — Blameless
What can you do if you’re out of error budget but you still want to deliver new features? Get creative.
Paul Osman — Honeycomb
I am going to go through the variation we use to up skill our on-call engineers we called “The Kobayashi Maru”, the name we borrowed from the Star Trek training exercise to test the character of Starfleet cadets.
Bruce Dominguez
Outages
Slack
Zimbabwe Shared Services (financial services)
Snapchat
Facebook
YouTube
Twitter
Nest
GCash
SRE WEEKLY