I’m having some serious FOMO from having missed out on the Learning From Incidents conference. If you post or see any write-ups, please send them my way!
Articles
An in-depth explanation of health checking, including the importance of failing open to avoid a metastable cascading failure.
Srinavas — eightnoteight
SQS (Amazon’s Simple Queue Service) is hugely scalable, but you must design your system with its limitations and behaviors in mind.
Satrajit Basu — DZone
What if your SSO provider is down? This article describes a scheme for falling back to HTTP Basic Authentication in an emergency.
Chris Siebenmann
Etsy scaled their database by transitioning to a sharding strategy using Vitess. The journey was long and involved some tricky gotchas, as explained in this 3-part series.
River Rainne and Amy Ciavolino — Etsy
An in-depth explanation of consistent hashing with a special focus on building a case for why other sharding mechanisms fall short.
Nk — High Scalability
LinkedIn chronicles their recent improvements to HODOR (the Holistic Overload Detection and Overload Remediation) including new kinds of overload detectors.
Abhishek Gilra, Nizar Mankulangara, Salil Kanitkar, and Vivek Deshpande — LinkedIn
An airline that gave monetary rewards for early arrivals and a steep cockpit authority gradient were just two of the factors that contributed to this crash.
Admiral Cloudberg
SRE WEEKLY