This article covers five skills:
Ability to Lead
Taking Charge in Critical Situations
Expressing Opinions in a Non-Conflicting Way
Leading Initiatives for Continuous Improvement
Building and Maintaining Relationships
Prabesh
I was pretty dubious most of the way through this article — until I realized it was a story about why this solution didn’t work for them. Now it’s an interesting read about Python and exercising restraint in complexity.
Jean-Mark Wright
Meta is training an LLM to suggest commits that may have caused a given incident, and its suggestions are right 42% of the time.
Diana Hsu, Michael Neu, Mohamed Farrag, and Rahul Kindi — Meta
Percentiles, because when your math(s) teacher told you you’d use math all the time when you grew up, they were right! This article does a great job of explaining percentiles if you’re having trouble wrapping your mind around them.
Alex Ewerlöf
Netflix designed their load shedding system to efficiently drop the requests that don’t matter as much and prioritize what users really care about.
Anirudh Mendiratta, Kevin Wang, Joey Lynch, Javier Fernandez-Ivern, and Benjamin Fedorka — Netflix
This article illustrates cascading delays in microservices and describes three techniques for dealing with them: timeouts, retries, and circuit breakers.
Jean-Mark Wright
Cloudflare’s public DNS resolver had an outage due to a (probably accidental?) BGP hijack. 1.1.1.1 is a common address used internally for testing routing, so it’s easy to understand how an accidental route leak happened.
Bryton Herdes, Mingwei Zhang, and Tanner Ryan — Cloudflare
Here’s a new post about durability and write-ahead logs. Write-ahead logs are used almost everywhere. But to build an intuition for why, it is helpful to imagine what you would do without a WAL.
Phil Eaton
SRE WEEKLY