SRE Weekly Issue #431

Cloudflare incident on June 20, 2024

This is a really thorny one. As individual subprocesses started infinitely looping, their system shifted load to other datacenters, masking the problem. A coinciding failure in the load shifting system made things even more interesting.

Lloyd Wallis, Julien Desgats, and Manish Arora — Cloudflare

Are dashboards dead? Not quite. They just haven’t evolved

A great discussion of where dashboards fall short and what we should look for instead.

Adam Kinniburgh — SquaredUp

How we improved push processing on GitHub

Read how we have significantly improved the ability of our monolith to correctly and fully process pushes from our users.

Will Haltom — GitHub

Can you run in a tight loop and still be well-behaved?

Timing things to happen at specific intervals is yet another way that we collectively find out that dealing with time is a hard problem.

This article illustrates the subtle but important pitfalls in trying to create a system that does something on a strict interval.

rachelbythebay

Using LLMs to Generate Terraform Code

This article reads more like a case study. The author gave a prompt to three different LLMs and actually tested the Terraform config it produced.

Mike Vanbuskirk — Terrateam

How the Pusher team built subscription counting at scale

When your pub/sub system can have a million subscribers, even something mundane as notifying about subscriber counts requires careful thought.

Ashmeet Singh — Pusher

Quick and Dirty vs. Polished and Perfect: The Two Sides of Engineering

To me, this concept comes up over and over in SRE, and it’s a core part of SLOs.

Juraj Masar — BetterStack

Feature flag vs feature management

In this blog post, we’ll dive deep into the technical aspects of feature flags and feature management, exploring how they can be leveraged by SREs to enable progressive delivery, improve system resilience, and optimize the user experience.

Hope Lynch — CloudBees

Pilots Unable to PULL UP!! Air Transat flight 211

This week’s Mentour Pilot video covers an accident that involved an inaccurate flight simulator. I wasn’t familiar with the term “negative training” before, but now I’m going to be keeping an eye out for it in the systems I manage!

Mentour Pilot

SRE WEEKLY

A message from our sponsor, FireHydrant:

Related