SRE Weekly Issue #414

A message from our sponsor, FireHydrant:

91% of engineering leaders say they want a better alerting tool. The other 9% couldn’t take the survey on their Blackberry. Meet Signals: a new standard in alerting and on call, now available.
https://firehydrant.com/blog/alerting-and-on-call-scheduling-for-how-you-actually-work/

This year’s VOID Report is out, and it’s well worth a read. The subtitle is “Exploring the Unintended Consequences of Automation in Software” which is a really good way to get me to read something!

  Courtney Nash — The VOID

A terraform change deleted a critical resource, and reviewers missed it because the plan was so big. Now they use Atlantis and Open Policy Agent to avoid accidental deletions of critical resources.

  Lin Du — InfoQ

When analyzing an incident, what can we learn when we assume that everyone did everything as well as possible?

  Lorin Hochstein

onsite technicians performing this planned network maintenance inadvertently unplugged several fibers that were adjacent to those in the work order, but still in use for production traffic

  Google

There’s a huge difference between four and five nines. There’s an especially interesting quote in this article that Google doesn’t think five nines is attainable in a commercial service.

  Diana Bocco — UptimeRobot

Here’s an interview with three SREs about what it’s like to be an SRE at IBM.

  IBM

I’ve been hearing about Observability 2.0 but didn’t know what it was all about. This article explains what it is and how it can help with cost.

  Charity Majors — Honeycomb
  Full disclosure: Honeycomb is my employer.

A cute little video pep talk for SREs. The site is actually real, too!

  Krazam

Like a mini Y2K, leap day came around again and left some technical glitches in its wake, as chronicled in this article.

  Gergely Orosz — The Pragmatic Engineer

SRE WEEKLY

Published
Categorized as SRE