SRE Weekly Issue #392

A message from our sponsor, Rootly:

Rootly is proud to have been recognized by G2 as a High Performer and Enterprise Leader in Incident Management for the sixth consecutive quarter! In total, we received nine G2 awards in the Summer Report. As a thank-you to our community, we’re giving away some awesome Rootly swag. Read our CEO’s blog post and pick up some free swag here:
https://rootly.com/blog/celebrating-our-nine-new-g2-awards

In the midst of industry discussions about productivity and automation, it’s all too easy to overlook the importance of properly reckoning with complexity.

There’s a cool bit in there about redistributing complexity rather than simply getting rid of it, using microservices as an example.

  Ken Mugrage — Thoughtworks — MIT Technology Review

Interesting idea: if we go too far toward making incident investigations comfortable and routine, we can make learning less likely.

  Dane Hillard — Jeli

A problem with P99 is that 1% of your customers have a worse experience, and P99 doesn’t capture how worse.

   Cynthia Dunlop — The New Stack

Lambda isn’t “NoOps”, it’s just a different flavor of ops.

  Ernesto Marquez — Concurrency Labs

Salesforce had a major outage earlier this month, and now they’ve posted this followup analysis.

  Salesforce

This sysadmin story is a lesson in understanding the full context before passing judgement.

  rachelbythebay

Things get interesting toward the end, where they warn that focusing too narrowly on learning from incidents can cause problems.

  Luis Gonzalez — incident.io

The fail fast pattern is highly relevant for building reliable distributed systems. Rapid error detection and failure propagation prevents localized issues from cascading across system components.

  Code Reliant

SRE WEEKLY

Published
Categorized as SRE