SRE Weekly Issue #287

A message from our sponsor, StackHawk:

Trying to figure out how to keep your APIs secure? You’re not the only one. See how DataRobot is automating API security testing with StackHawk.
https://sthwk.com/DataRobot

Articles

Lots of details about how Slack does incident response in this one.

Stephen Whitworth — incident.io

This list also gives an interesting insight into the way this company does SRE.

Mayank Gupta and Merlyn Shelley — Squadcast

Oh BGP, you rascally little routing protocol.

Alessandro Improta and Luca Sani — Catchpoint

A comprehensive definition of SREs and Site Reliability Engineering, including what SREs do and what makes SREs different from other roles.

The article covers various facets of SRE and acknowledges that SREs can perform many roles.

JJ Tang — Rootly

Another really excellent air accident story with lots of great talk about mental models and confirmation bias. The crew saw lots of disparate indications that each didn’t point to anything in particular and each wasn’t a huge problem on its own. That, coupled with confirmation bias, helped them miss what might seem obvious in hindsight.

Mentour Pilot

Outages

Coinbase, Kraken, and Gemini (Cryptocurrency exchanges)
reddit
SRE WEEKLY

Published
Categorized as SRE