SRE Weekly Issue #325

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly đźš’. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

This article really upends the concept of “human error”, in an intriguing way.

  Lorin Hochstein

A key part of building reliable systems is often overlooked: continuously learning.

In the highly dynamic CI/CD environment, engineers with stale or outdated knowledge of the system are less able to detect, diagnose or repair anomalous behavior in their systems […]

  Laura Maguire (jeli.io) — The New Stack

This is the story of how an organization transitioned from a single NOC-like on-call team to individualized alerts routed to the relevant team.

  Boris Cherkasky

This guide has a set of key factors you should consider when building a new SRE team in order to increase the likelihood of success.

  Ash P — SREPath

My favorite pitfall discussed in this article: avoid committing to every possible remediation action from every incident.

  incident.io

This article, written by one of the authors of the Google SRE book, is a critical look at the state of SRE and what the future holds.

Today, I believe we cannot successfully answer several key questions about SRE.

  Niall Murphy

This interview goes into the thorny challenges around building a reliable app based on third-party services. It delves into the lack of reliable reporting we commonly see from cloud service providers and what ideal reporting would look like.

  Jeff Martens (interviewing Jeff Smith) — Metrist

Outages

Spotify

TLS certificate expiration.

Solana
Square
Australian Taxation Office
Google Cloud Platform us-east1
SRE WEEKLY

Published
Categorized as SRE