SRE Weekly Issue #328

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly đźš’. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

Less than 12 hours after their outage, Cloudflare posted this detailed run-down of what happened.

  Tom Strick and Jeremy Hartman — Cloudflare

Metastable failures occur in open systems with an uncontrolled source of load where a trigger causes the system to enter a bad state that persists even when the trigger is removed.

  Marc Brooker

By “derivatives”, the author means rate-of-change, like Prometheus’s irate(). Derivatives have their place, but this article has good reasons to reconsider using them for alerts.

  Boris Cherkasky

In this article, I’ll dive into what it takes to get into site reliability engineering, how to adopt it within your own organization and some of the core principles and best practices you’ll need to keep in mind as you move forward in your SRE maturity journey.

  Jemiah Sius — devops.com

I have given and received this question in many SRE interviews, and it’s famously used by Google in their interviews. This article dissects the question and its merits and downsides for the benefit of both interviewers and interviewees.

  Will Gallego

Outages

Cloudflare

Cloudflare had a major outage, taking many sites and services with it.

SRE WEEKLY

Published
Categorized as SRE