SRE Weekly Issue #328

Articles

Cloudflare outage on June 21, 2022

Less than 12 hours after their outage, Cloudflare posted this detailed run-down of what happened.

Tom Strick and Jeremy Hartman — Cloudflare

Metastable failures occur in open systems with an uncontrolled source of load where a trigger causes the system to enter a bad state that persists even when the trigger is removed.

Marc Brooker

The Good, The Bad, And Alerting on Derivatives

By “derivatives”, the author means rate-of-change, like Prometheus’s irate(). Derivatives have their place, but this article has good reasons to reconsider using them for alerts.

Boris Cherkasky

How to Adopt an SRE Practice (When You’re not Google)

In this article, I’ll dive into what it takes to get into site reliability engineering, how to adopt it within your own organization and some of the core principles and best practices you’ll need to keep in mind as you move forward in your SRE maturity journey.

Jemiah Sius — devops.com

Tech Interview Questions: Typing “google.com” in a browser

I have given and received this question in many SRE interviews, and it’s famously used by Google in their interviews. This article dissects the question and its merits and downsides for the benefit of both interviewers and interviewees.

Will Gallego

Outages

Cloudflare

Cloudflare had a major outage, taking many sites and services with it.

SRE WEEKLY

A message from our sponsor, Rootly:

Articles

Outages

Related