SRE Weekly Issue #285

Articles

What’s so great about this incident write-up is the way that entrenched mental models hampered the incident response. There’s so much to learn here.

Ray Ashman — Mailchimp

Rethinking Best Practices

The parallels between this and the Mailchimp article are striking.

Will Gallego

How to Improve Upon Google’s Four Golden Signals of Monitoring

This includes a review of the four golden signals and presents three areas to go further.

JJ Tang — Rootly

Root cause of failure, root cause of success

This one thoughtfully discusses why “root cause” is a flawed concept, approaching the idea from multiple directions.

Lorin Hochstein

IBM PREVAIL Conference: October 19–21, 2021

Check it out, a new SRE conference! This one’s virtual and the CFP is open until October 1.

Robert Barron — IBM

Notes on the Perfidy of Dashboards

To be clear, this article is about static dashboards that just contain pre-set graphs of specific metrics.

every dashboard is an answer to some long-forgotten question

Charity Majors

What makes public posts about incidents different from analysis write-ups

Public incident posts give us useful insight into how companies analyze their incidents, but it’s important to remember that they’re almost never the same as internal incident write-ups.

John Allspaw — Adaptive Capacity Labs

Heroku Incident #2300 Follow-Up

In this incident from July 7, front-line routing hosts exceeded their file descriptor limits, causing requests to be delayed and dropped.

Heroku

TLDs — Putting the ‘.fun’ in the top of the DNS

.io, assigned to the British Indian Ocean Territory is almost exclusively used by annoying startups for content completely unrelated to the islands.

Remember, it’s all fun and games until the random country you’ve attached your business to has an outage in their TLD DNS infrastructure.

Jan Schaumann

Why Observability Requires a Distributed Column Store

If you’re curious about just what a columnar data store is like I was, this article is a good introduction.

Alex Vondrak — Honeycomb

Outages

Google Voice
Heroku
National Health Service (UK)
Boston Public Library
SRE WEEKLY

A message from our sponsor, StackHawk:

Articles

Outages

Related