SRE Weekly Issue #382

Articles

Solving challenges caused by Out Of Memory (OOM) Killer in Linux

The Linux OOM killer can already be a bugbear, and things only get more complicated when you add containers to the mix.

Rafał Korepta — RedPanda

Align platform and product engineering teams over incidents

This post explores how to align platform and product engineering teams by implementing business value proxy metrics and using incidents to inform them.

The same metrics that we use to measure other initiatives against business priorities may be able to show us whether our incident response process is effective.

Gonzalo Maldonado — FireHydrant

DevOps vs SRE: Is it a party?

Here’s another take on devops vs SRE, using a metaphor of organizing a party.

Diogo Souza

Embrace AI Acceleration by Investing in Reliability

how do you balance taking advantage of the acceleration and innovation of AI while not compromising reliability and losing users?

Jim Gochee — The New Stack

“Human Error” is the Scapegoat for Systemic and Organizational Failures

My favorite part is the bit about the risks of automation and keeping humans in the loop.

Dr. Mica Endsley — Business News This Week

Revolutionizing Infrastructure Management: The Power of Feature Flags in IaC

It’s about reliability: IaC changes carry just as much risk to reliability as product code changes, if not more. How can we bring feature flags to IaC?

Josephine E. Justin, Srikanth Murali, and Norton Stanley S A — DZone

On-Call Stories: Flying Blind

Oh, the tangled web we weave when we send automated emails.

Amin Astaneh — Certo Modo

Lessons Learned Running Presto at Meta Scale

Here are four things we learned while scaling up Presto to Meta scale, and some advice if you’re interested in running your own queries at scale.

High Scalability

SRE WEEKLY

A message from our sponsor, Rootly:

Articles

Related