SRE Weekly Issue #405

Using the Swedish word “Lagom” as a jumping-off point, this article explains the importance of choosing an SLO that is just right: not too lax and not too strict.

Alex Ewerlöf

Our Journey Migrating to AWS IMDSv2

A simple security change like ceasing to use IMDSv1 can involve profound risk and necessitate a major migration process.

Archie Gunasekara — Slack

Why People Should Be at the Heart of Operational Resilience

It can be all too easy to let a subset of your IT organization “handle” resiliency. If resilience is about an ability to adapt and respond to change, then it needs broad buy-in.

Richard Gall — The New Stack

Any change can break us, but we can’t treat every change the same

If any seemingly innocuous change can break our systems, what should we do?

Lorin Hochstein

Human Performance in the Spotlight: ‘Human Error’ and ‘Honest Mistakes’

What exactly is “human error”?

Steven Shorrock — Humanistic Systems

Zero downtime Postgres upgrades

We recently upgraded from Postgres 11.9 to 15.3 with zero downtime by using logical replication, a suite of support scripts, and tools in Elixir & Erlang’s BEAM virtual machine.

They share a ton of details about how they did it.

Brent Anderson — Knock

RHIP, doctors, and pagers

Why do doctors still use antiquated pagers? There’s a lot here that speaks to what it’s really like to operate in an on-call environment, and how to evaluate new tools.

Fred Hebert

Beyond Murphy’s Law

This article riffs on Murphy’s law, exploring various aspects of how things go wrong using anecdotes.

Bertrand Florat

SRE WEEKLY

A message from our sponsor, FireHydrant:

Related