SRE Weekly Issue #415

A message from our sponsor, FireHydrant:

Join FireHydrant and talk shop with your DevOps peers on March 28! You’ll gain a better understanding of what makes a fatigue-free on-call culture and how to implement practices to improve yours at this free, virtual roundtable.
https://app.livestorm.co/firehydrant/better-incidents-spring-bonfire-secrets-to-fatigue-free-on-call-in-2024

[…] it must be said that the intent of these metrics was always to give an indicator of how well your team was delivering software, not a high-stakes metric that should be used, for example, to hire and fire team leads.

  Nočnica Mellifera — The New Stack

A primer on the problems with N+1 database queries and how this pattern can sneak into your code whether you realize it or not.

  neda — ReadySet

A great explainer on choosing the right SLIs, starting with the Golden Signals and branching out.

  Tyler Treat

My favorite part about this is the “latency budget” question — which team’s code gets to spend how much time doing its part to serve a request?

  Alex Ewerlöf

Changes in two programs outside the container made Ceph suddenly grind to a halt, as detailed in this troubleshooting story.

  Vladimir Guryanov — Palark

The word “one” is the key here, as the author argues for getting rid of “warning” alerts entirely in favor of using only “critical”.

  Gauthier François

They wrote a Slack bot to summarize open PagerDuty incidents every day.

  Matt Weingarten

The problems I’ll explore in this blog—from the SRE perspective—are about time pressures (when to ship the investigation) and the type of report people expect.

  Fred Hebert — Honeycomb

  Full disclosure: Honeycomb is my employer.

In order to reduce the noise, first they had to define noisy alerts and the KPIs they were looking to improve.

  Gauthier François — Doctolib

SRE WEEKLY

Published
Categorized as SRE