SRE Weekly Issue #311

I’m dedicating this issue to the people of Ukraine, and also those in Russia that are protesting the invasion.

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly đźš’. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging the right team, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly shirt):
https://rootly.com/demo/?utm_source=sreweekly

Articles

In this episode of the podcast Page it to the Limit, they discuss learning how to be an incident commander.

There was major AWS outage and the second day I was incident command.

  Kat Gaines, with guest Iris Carrera — Page it to the Limit

This article discusses three aspects of fully owning your systems: mandate, knowledge, and responsibility. After defining those terms, it goes on to discuss what happens if one of the three is missing.

  Alex Ewerlöf

I really like the “Managing High RPS” section, especially the part about ignoring events if they’re too old to be relevant any longer.

  Ankush Gulati and David Gevorkyan — Netflix

Cool idea! When a process is overloaded, the system drops requests based on heuristics until the overload condition has passed.

  Bryan Barkley — LinkedIn

Here’s another take on incident severity and priority levels. The two terms are different and mean specific things.

  Robert Ross — FireHydrant

Can we please agree to stop calling them “postmortems”?

  Ash P — Cruform Newsletter

The term “service level” goes back to the US highway system maintenance procedures, among others.

  Akshay Chugh and Piyush Verma — Last9

Charity Majors has railed against metrics for years. Now, her company Honeycomb has a metrics product offering. How does she square it?

  Charity Majors — Honeycomb

Despite the December AWS outage, folks aren’t fleeing AWS, and multi-cloud designs for reliability still don’t make sense, according to this cloud consultant. The media angle is fascinating.

  Lydia Leong — Cloud Pundit

This article has a great list of ideas of who to talk to, plus a section on how to prioritize when you’re short on time.

  Daniela Hurtado — Jeli

Outages

Slack

They posted a followup with details on what happened.

A configuration change inadvertently lead to a sudden increase in activity on our database infrastructure.

crates.io (Rust package repository)
British Airways
Truth Social
Peloton
Truth Social

Due to the overwhelming demand at launch, we are currently rate-limited on onboarding new users to the platform.

SRE WEEKLY

Published
Categorized as SRE