SRE Weekly Issue #355

Articles

I’m trying something new: I’m looking for input from you, dear readers!

This link is a Google Form where I’m asking for ideas that I might turn into a blog post or conference talk. If you’re game, I’d love to hear what you think.

Join Jeli and Honeycomb for an Incident Response and Analysis Discussion

Here’s the panel for this webinar:

Vanessa Huerta Granda (Jeli)
Emily Ruppe (Jeli)
Liz Fong-Jones (Honeycomb)
Fred Hebert (Honeycomb)

Honestly, with that set of names, I’d listen even if they were just discussing the weather.
Full disclosure: Honeycomb, my employer, is mentioned.

The near crash of Air Canada flight 759

This week saw an outage of the NOTAM system which disseminates important information to aircraft pilots in the US. As a result, all flights in the US were grounded.

There’s not much in the way of interesting detail available yet, but I did see a mention of this air incident in which NOTAMs played a significant part. Mentour Pilot also covered this one

Admiral Cloudberg

A New Definition of Reliability

In essence, this new reliability is:

The health of your system
Weighed based on customer expectations and happiness
Prioritized based on your current capabilities

This article focuses on the sociotechnical aspects of reliability.

Jim Gochee — The New Stack

When to Alert on What?

Here are some guidelines for what kind of alerting works best for services at various stages of maturity.

Ali Sattari

Creating Safety is Dangerous Work

The actions we take to avert a potential problem can introduce their own risks.

Will Gallego