SRE Weekly Issue #325

Articles

This article really upends the concept of “human error”, in an intriguing way.

Lorin Hochstein

Readiness to Learn: Safely and Reliably Deploy to the Cloud

A key part of building reliable systems is often overlooked: continuously learning.

In the highly dynamic CI/CD environment, engineers with stale or outdated knowledge of the system are less able to detect, diagnose or repair anomalous behavior in their systems […]

Laura Maguire (jeli.io) — The New Stack

My Big Fat Monolithic Alert

This is the story of how an organization transitioned from a single NOC-like on-call team to individualized alerts routed to the relevant team.

Boris Cherkasky

Starting an SRE team from scratch [Quick Guide]

This guide has a set of key factors you should consider when building a new SRE team in order to increase the likelihood of success.

Ash P — SREPath

Incident postmortem pitfalls

My favorite pitfall discussed in this article: avoid committing to every possible remediation action from every incident.

incident.io

What SRE could be

This article, written by one of the authors of the Google SRE book, is a critical look at the state of SRE and what the future holds.

Today, I believe we cannot successfully answer several key questions about SRE.

Niall Murphy

DevOps Expert Jeff Smith On Third-Party Reliability Data

This interview goes into the thorny challenges around building a reliable app based on third-party services. It delves into the lack of reliable reporting we commonly see from cloud service providers and what ideal reporting would look like.

Jeff Martens (interviewing Jeff Smith) — Metrist

Outages

Spotify

TLS certificate expiration.

Solana
Square
Australian Taxation Office
Google Cloud Platform us-east1
SRE WEEKLY

A message from our sponsor, Rootly:

Articles

Outages

Related