SRE Weekly Issue #420

A message from our sponsor, FireHydrant:

FireHydrant is now AI-powered for faster, smarter incidents! Power up your incidents with auto-generated real-time summaries, retrospectives, and status page updates.
https://firehydrant.com/blog/ai-for-incident-management-is-here/

The game Last Epoch launched in February, and they had a rocky start. This huge retrospective post tells the story of what happened and how they fixed it.

  EHG_Kain — Last Epoch

Cloudflare’s Phoenix system can find and recover failed servers, reducing toil.

  Jet Mariscal, Aakash Shah, and Yilin Xiong — Cloudflare

More than just another glossary of SL*s, this one also has examples and best practices.

  Sara Miteva — Checkly

Spurred from a question in the SRECon attendee survey, this one really gets you thinking: how does the current “generation” of SREs differ from those that came before?

  Paige — PagerDuty

This one’s about finding out what execs need in incidents and figuring out how to get everone’s needs met.

  Chris Evans — incident.io

This post explains how Cloudflare gathers information about their alerts and improves them to benefit reliability and on-call health.

  Monika Singh — Cloudflare

This one contains formulas for calculating compound SLOs when downstream dependencies are parallel or serial.

  Alex Ewerlöf

SRE WEEKLY

Published
Categorized as SRE