SRE Weekly Issue #361

View on sreweekly.com

I’m having some serious FOMO from having missed out on the Learning From Incidents conference. If you post or see any write-ups, please send them my way!

Articles

Health Checking

An in-depth explanation of health checking, including the importance of failing open to avoid a metastable cascading failure.

Srinavas — eightnoteight

Exploring the Architecture of Amazon SQS

SQS (Amazon’s Simple Queue Service) is hugely scalable, but you must design your system with its limitations and behaviors in mind.

Satrajit Basu — DZone

Chris’s Wiki :: blog/web/SingleSignOnVsAvailability

What if your SSO provider is down? This article describes a scheme for falling back to HTTP Basic Authentication in an emergency.

Chris Siebenmann

Scaling Etsy Payments with Vitess: Part 1 – The Data Model

Etsy scaled their database by transitioning to a sharding strategy using Vitess. The journey was long and involved some tricky gotchas, as explained in this 3-part series.

River Rainne and Amy Ciavolino — Etsy

Consistent hashing algorithm

An in-depth explanation of consistent hashing with a special focus on building a case for why other sharding mechanisms fall short.

Nk — High Scalability

Hodor: Overload scenarios and the evolution of their detection and handling

LinkedIn chronicles their recent improvements to HODOR (the Holistic Overload Detection and Overload Remediation) including new kinds of overload detectors.

Abhishek Gilra, Nizar Mankulangara, Salil Kanitkar, and Vivek Deshpande — LinkedIn

The Captain’s Gambit: The crash of Allegheny Airlines flight 485

An airline that gave monetary rewards for early arrivals and a steep cockpit authority gradient were just two of the factors that contributed to this crash.

Admiral Cloudberg

SRE WEEKLY

A message from our sponsor, Rootly:

Articles

Related