SRE Weekly Issue #404

For every 9 you add to SLO, you’re making the system 10x more reliable but also 10x more expensive.

Alex Ewerlöf

Patching around a C++ crash with a little bit of Lua

In this incident story, the feature flags were served by the main application server. When a new feature caused the server to crash, there was no way to flip the flag back off to stop the crashes.

rachelbythebay

Set Taxonomies to Neutral

The author of a classification system for human error reflects 20 years later on the harm that such systems can cause by using deficit-based language.

Dr. Steven Shorrock

Post Mortem on VOID Report: Cloudflare Control Plane and Analytics Outage

Here’s Fred Hebert’s analysis of Cloudflare’s write-up of their incident on November 2.

I’m hoping they’re going to do a more in-depth review.

Fred Hebert — VOID

Integrating manual with automatic instrumentation

In this post, we introduce a hybrid approach that seamlessly combines the precision of manual instrumentation with the comfort, efficiency, and performance of automatic instrumentation.

Ron Federman — Odigos

The Swedbank Outage shows that Change Controls don’t work

Change is not the problem. It’s unaddressed risk

Bruce Johnston — High Scalability

Production Postmortem: The Spawn of Denial of Service

A shell script with a loop running a DB client can fill up your ephemeral ports in a hurry.

Oren Eini — RavenDB

Writing Code is the Same Thing as Writing Prose

When you get right down to it, it’s all human communication, even assembly code. It’s human factors all the way down.

Michael Hart

SRE WEEKLY

A message from our sponsor, FireHydrant:

Related