SRE Weekly Issue #311

I’m dedicating this issue to the people of Ukraine, and also those in Russia that are protesting the invasion.

Articles

Easing Into Incident Command With Iris Carrera

In this episode of the podcast Page it to the Limit, they discuss learning how to be an incident commander.

There was major AWS outage and the second day I was incident command.

Kat Gaines, with guest Iris Carrera — Page it to the Limit

This article discusses three aspects of fully owning your systems: mandate, knowledge, and responsibility. After defining those terms, it goes on to discuss what happens if one of the three is missing.

Alex Ewerlöf

Rapid Event Notification System at Netflix

I really like the “Managing High RPS” section, especially the part about ignoring events if they’re too old to be relevant any longer.

Ankush Gulati and David Gevorkyan — Netflix

Hodor: Detecting and addressing overload in LinkedIn microservices

Cool idea! When a process is overloaded, the system drops requests based on heuristics until the overload condition has passed.

Bryan Barkley — LinkedIn

Incident severity and priority 101

Here’s another take on incident severity and priority levels. The two terms are different and mean specific things.

Robert Ross — FireHydrant

Renaming SRE outage “post-mortems” for psychological safety

Can we please agree to stop calling them “postmortems”?

Ash P — Cruform Newsletter

The origin of Service Level Objectives

The term “service level” goes back to the US highway system maintenance procedures, among others.

Akshay Chugh and Piyush Verma — Last9

The Truth About “MEH-TRICS”

Charity Majors has railed against metrics for years. Now, her company Honeycomb has a metrics product offering. How does she square it?

Charity Majors — Honeycomb

Resilience: Cloudy without a chance of meatballs

Despite the December AWS outage, folks aren’t fleeing AWS, and multi-cloud designs for reliability still don’t make sense, according to this cloud consultant. The media angle is fascinating.

Lydia Leong — Cloud Pundit

Incident Analysis 101: Interviewing – How to determine who to interview

This article has a great list of ideas of who to talk to, plus a section on how to prioritize when you’re short on time.

Daniela Hurtado — Jeli

Outages

Slack

They posted a followup with details on what happened.

A configuration change inadvertently lead to a sudden increase in activity on our database infrastructure.

crates.io (Rust package repository)
British Airways
Truth Social
Peloton
Truth Social

Due to the overwhelming demand at launch, we are currently rate-limited on onboarding new users to the platform.

SRE WEEKLY

A message from our sponsor, Rootly:

Articles

Outages

Related