This one’s full of great advice about making sure alerts are actionable, including alerting on flows that actually matter to customers.
Nočnica Mellifera — Checkly
Here are a collection of things I learned after getting back into Magic: the Gathering over the past 10 years or so. They are things that apply to both the MTG scene and your friendly neighborhood incident response process.
Ross Brodbeck
It was a classic application of technical debt: they chose to focus on customer-facing features and let k8s updates slide. Here’s how they caught back up safely.
Jeff Wolski
This article presents an interesting hypothesis, and from it draws some nifty conclusions about reasoning about failure in systems.
we cannot know for sure whether or not software is going to be incident-free. It might well be, but we can’t ever prove it.
Niall Murphy
For teams to solve incidents quickly and effectively, responders need to be able to trust each other and stakeholders have to trust the responders. This level of trust is hard to cultivate if your organization doesn’t have a significant amount of psychological safety.
Mandi Walls — PagerDuty
More than just an interview, this article outlines a multi-year transformation from disorganized haphazard incident investigation to a smooth and efficient incident response process.
Eric Silberstein — Klaviyo
In this article, you will learn how to prevent broken connections when a Pod starts or shuts down. You will also learn how to shut down long-running tasks and connections gracefully.
Daniele Polencic — Learnk8s
It turns out that an S3 bucket owner pays for failed requests to that bucket, even if they’re unauthenticated, so anyone can run up your AWS bill if they know your bucket’s name. Oops.
Oh, and they can get the bucket name from CT logs (thanks, Corey Quinn!)
Maciej Pocwierz
SRE WEEKLY