SRE Weekly Issue #382

A message from our sponsor, Rootly:

Eliminate the anxiety around declaring an incident for nebulous problems by introducing a triage phase into your incident management process. Our latest blog posts dives into why the triage phase is so important, and how you can automate yours with Rootly.

Read more on the Rootly blog:


The Linux OOM killer can already be a bugbear, and things only get more complicated when you add containers to the mix.

  Rafał Korepta — RedPanda

This post explores how to align platform and product engineering teams by implementing business value proxy metrics and using incidents to inform them.

The same metrics that we use to measure other initiatives against business priorities may be able to show us whether our incident response process is effective.

  Gonzalo Maldonado — FireHydrant

Here’s another take on devops vs SRE, using a metaphor of organizing a party.

  Diogo Souza

how do you balance taking advantage of the acceleration and innovation of AI while not compromising reliability and losing users?

  Jim Gochee — The New Stack

My favorite part is the bit about the risks of automation and keeping humans in the loop.

  Dr. Mica Endsley — Business News This Week

It’s about reliability: IaC changes carry just as much risk to reliability as product code changes, if not more. How can we bring feature flags to IaC?

  Josephine E. Justin, Srikanth Murali, and Norton Stanley S A — DZone

Oh, the tangled web we weave when we send automated emails.

  Amin Astaneh — Certo Modo

Here are four things we learned while scaling up Presto to Meta scale, and some advice if you’re interested in running your own queries at scale.

  High Scalability


Categorized as SRE