SRE Weekly Issue #395

A message from our sponsor, FireHydrant:

Incident management platform FireHydrant is combining alerting and incident response in one ring-to-retro tool. Sign up for the early access waitlist and be the first to experience the power of alerting + incident response in one platform at last.
https://firehydrant.com/signals/

This article gives an overview of database consistency models and introduces the PACELC Theorem.

  Roberto Vitillo

A primer on memory and resource leaks, including some lesser-known causes.

  Code Reliant

How can you troubleshoot a broken pod when it’s built FROM scratch and you can’t even run a shell in it?

  Mike Terhar
  Full disclosure: Honeycomb is my employer.

This article explains why reliability isn’t just a one-off project that you can bolt on and move on.

  Gavin Cahill — Gremlin

DoorDash wanted consistent observability across their infrastructure that didn’t depend on instrumenting each application. To solve this, they developed BPFAgent, and this article explains how.

  Patrick Rogers — DoorDash

Mean time to innocence is the average elapsed time between when a system problem is detected and any given team’s ability to say the team or part of its system is not the root cause of the problem.

This article, of course, is about not having a culture like that.

  John Burke — TechTarget

It was the DB — more specifically, it was a DB migration with unintended locking.

  Casey Huang — Pulumi

The incident stemmed from a control plane change that worked in some regions but caused OOMs in others.

  Google

SRE WEEKLY

Published
Categorized as SRE