SRE Weekly Issue #373

Articles

2023 03 08 Incident: Infrastructure connectivity issue affecting multiple regions

Datadog posted a report on their major outage in March, and it’s a doozy. An unattended updates system that they didn’t even want, need, or know about triggered across all hosts in multiple clouds nearly simultaneously, causing a regression.

Alexis Lê-Quôc — Datadog

Addressing GitHub’s recent availability issues

GitHub has had a string of apparently unrelated outages recently, and they’ve posted this description.

Mike Hanley — GitHub

StanzaSystems/awesome-load-management

Oh look, another awesome-* repo relevant to our interests!

A repo of links to articles, papers, conference talks, and tooling related to load management in software services: loadshedding, circuitbreaking, quota management and throttling. PRs welcome.

Laura Nolan and Niall Murphy — Stanza Systems

SRE Story with Matthew Iselin

This interview covers a lot of ground including looking beyond just “up or down” when considering reliability.

Prathamesh Sonpatki — SRE Stories

Debugging a FUSE deadlock in the Linux kernel

If you’re in the mood for a deep systems debugging story, you’re in for a treat. The author takes you along for the ride with a wealth of detailed code snippets.

Tycho Andersen — Netflix

Why `fsync()`: Losing Unsynced Data

Regardless of the replication mechanism you must fsync() your data to prevent global data loss in non-Byzantine protocols.

Denis Rystsov and Alexander Gallego — Redpanda

Emotional Intelligence

Emotional intelligence is a critical skill for SREs, especially when we interact with other teams in fraught situations.

Amin Astaneh — Certo Modo

Fleet Management at Spotify (Part 3): Fleet-wide Refactoring

Wow! Spotify created a set of tools to perform automated refactoring of thousands of repositories at once. This includes the ability to run tests, automatically merge pull requests without human review, and roll refactorings out gradually.

Matt Brown — Spotify

Teach me how to Howie!

Jeli has published a one-page cheat-sheet for their highly-detailed Howie guide for running incident retrospectives.

Jeli

SRE WEEKLY

A message from our sponsor, Rootly:

Articles

Related