SRE Weekly Issue #374

A message from our sponsor, Rootly:

Rootly is hiring for a Sr. Developer Relations Advocate to continue helping more world-class companies like Figma, NVIDIA, Squarespace, accelerate their incident management journey. Looking for previous on-call engineers with a passion for making the world a more reliable place. Learn more:


A fascinating Postgresql debugging story that hinges on code comments, of all things.

  Christopher White — Prefect

If you’re a distributed systems nerd, this one’s a real treat. It’s a detailed breakdown of the results of a Jepsen test.

  Denis Rystsov — RedPAnda

An investigation into a kernel bug that caused excessive TCP memory usage in certain situations.

  Mike Freemon — Cloudflare

Let’s unpack what scaling a team is all about, what are the indicators, what are steps you can take, and how you know if you’re done.

  Biju Chacko — Squadcast

Here’s another guide on running incident retrospectives and building a repeatable retrospective process.

  Amin Astaneh — Certo Modo

Here’s a fun little tool that lets you inspect how data in a C program is represented in memory.

  Julia Evans

This two-part series explores some shortcomings in Kubernetes’s CronJob system and the ways that Lyft fixed and worked around them.

  Kevin Yang — Lyft

And here’s a case where someone ran into the Kubernetes CronJob bug described in the previous article.

  Vallery Lancey


Categorized as SRE