SRE Weekly Issue #376

A message from our sponsor, Rootly:

Curious how companies like Figma, Tripadvisor, and 100s of others leverage Rootly to manage incidents in Slack and unlock instant best practices? Check out this lightning demo:
https://www.loom.com/share/051c4be0425a436e888dc0c3690855ad

Articles

With 100 workstreams and over 500 engineers engaged, this was the biggest incident response I’ve read about in years.

We had to force ourselves to identify the facts on the ground instead of “what ought to be,” and overrule our instincts to look for data in the places we normally looked (since our own monitoring was impacted).

  Laura de Vesine — Datadog

When you unify these three “pillars” into one cohesive approach, a new ability to understand the full state of your system in several new ways also emerges.

  Danyel Fisher — The New Stack
  Full disclosure: Honeycomb, my employer, is mentioned.

This report details the 10-hour incident response following the accidental deletion of live databases (rather than their snapshots, as intended).

  Eric Mattingly — Azure

Neat trick: write your alerts in English and get GPT to convert them to real alert configurations.

  Shahar and Tal — Keep (via HackerNews)

If your DNS resolver is responsible for handling queries for both internal and external domains, what happens when external DNS requests fail? Can internal ones still proceed?

  Chris Siebenmann

This article explains potential pitfalls and downsides to observability tools and the ways vendors might try to get you to use them, along with tips for how to avoid the traps.

  David Caudill

Too often, we dismiss the anomaly we just faced in an incident as a weird, one-off occurrence. And while that specific failure mode likely will be a one-off, we’ll be faced with new anomalies in the future.

  Loron Hochstein — Surfing Complexity

SRE WEEKLY

Published
Categorized as SRE