SRE Weekly Issue #395

View on sreweekly.com

What every developer should know about database consistency

This article gives an overview of database consistency models and introduces the PACELC Theorem.

Roberto Vitillo

What is a Memory leak? Causes | Detection | Tools | Golang

A primer on memory and resource leaks, including some lesser-known causes.

Code Reliant

Rescue Struggling Pods from Scratch

How can you troubleshoot a broken pod when it’s built FROM scratch and you can’t even run a shell in it?

Mike Terhar
Full disclosure: Honeycomb is my employer.

Five mindset shifts for effective reliability programs

This article explains why reliability isn’t just a one-off project that you can bolt on and move on.

Gavin Cahill — Gremlin

BPFAgent: eBPF for Monitoring at DoorDash

DoorDash wanted consistent observability across their infrastructure that didn’t depend on instrumenting each application. To solve this, they developed BPFAgent, and this article explains how.

Patrick Rogers — DoorDash

What is Mean Time to Innocence?

Mean time to innocence is the average elapsed time between when a system problem is detected and any given team’s ability to say the team or part of its system is not the root cause of the problem.

This article, of course, is about not having a culture like that.

John Burke — TechTarget

Details of the Pulumi Outage on October 6, 2023

It was the DB — more specifically, it was a DB migration with unintended locking.

Casey Huang — Pulumi

Google Cloud Networking Incident Report (2023-10-05)

The incident stemmed from a control plane change that worked in some regions but caused OOMs in others.

Google

SRE WEEKLY

A message from our sponsor, FireHydrant:

Related