SRE Weekly Issue #405

A message from our sponsor, FireHydrant:

In this episode of FireHydrant’s Gimme 5 video series, Asaf Gaon, Director of Technical Support for automated grocery fulfillment solution Takeoff Technologies, talks about how to handle third-party downtime in a collaborative – and automated – way.
https://firehydrant.com/blog/gimme-5-with-takeoff-technologies-asaf-gaon/

Using the Swedish word “Lagom” as a jumping-off point, this article explains the importance of choosing an SLO that is just right: not too lax and not too strict.

  Alex Ewerlöf

A simple security change like ceasing to use IMDSv1 can involve profound risk and necessitate a major migration process.

  Archie Gunasekara — Slack

It can be all too easy to let a subset of your IT organization “handle” resiliency. If resilience is about an ability to adapt and respond to change, then it needs broad buy-in.

  Richard Gall — The New Stack

If any seemingly innocuous change can break our systems, what should we do?

  Lorin Hochstein

What exactly is “human error”?

  Steven Shorrock — Humanistic Systems

We recently upgraded from Postgres 11.9 to 15.3 with zero downtime by using logical replication, a suite of support scripts, and tools in Elixir & Erlang’s BEAM virtual machine.

They share a ton of details about how they did it.

  Brent Anderson — Knock

Why do doctors still use antiquated pagers? There’s a lot here that speaks to what it’s really like to operate in an on-call environment, and how to evaluate new tools.

  Fred Hebert

This article riffs on Murphy’s law, exploring various aspects of how things go wrong using anecdotes.

   Bertrand Florat

SRE WEEKLY

Published
Categorized as SRE