SRE Weekly Issue #295

Articles

I love this crystal clear argument based on statistics and research. MTTR as a metric is simply meaningless.

Courtney Nash — Verica

Five steps to better customer communication

Their steps for better communication during an outage:

Provide context to minimise speculation
Explain what you’re doing to demonstrate you’re ‘on it’
Set some expectations for when things will return to normal
Tell people what they should do0
Let folks know when you’ll be updating them next

Chris Evans — incident.io

Heroku Incident 2365 Follow-Up

Despite checking in advance to be sure their systems would support the new Let’s Encrypt certificate chain, they ran into trouble.

[…] we discovered that several HTTP client libraries our systems use were using their own vendored root certificates.

Heroku

Multicloud failover is almost always a terrible idea

This is the best case I’ve seen yet against multi-cloud infrastructure. I really like the airline analogy.

Lydia Leong

An Update on Our Outage – Roblox

Roblox had a major, several-day outage starting on October 28. I don’t usually include game outages in the Outages section since they’re so common and there’s not usually much information to learn from, I sure do like a good post-incident report. Thanks, folks!

David Baszucki — Roblox

40 Ms Bug

When you’re sending small TCP packets, two optimizations can conspire to introduce an artificial 40 millisecond (not megasecond…) delay.

Vorner

Google Incident report — Meet

_Here’s Google’s follow-up report for their October 25-26 Meet outage.

/r/sre — How to deal with retries in SLIs

Should you count failed requests toward your SLI if the client retries and succeeds? A good argument can be made on either side.

u/Sufficient_Tree4275 and other Reddit users

What the SRE team wants to achieve with the development team

Mercari restructured its SRE team, moving toward an embedded model to adapt to their growing microservice architecture.

ShibuyaMitsuhiro — Mercari

Episode 1: Honeycomb and the Kafka Migration – The VOID

There’s a really great discussion in this episode about leaving slack in the system in the form of bits of capacity and inefficiency that can be drawn upon to buy time during an outage.

Courtney Nash, with guests Liz Fong-Jones and Fred Hebert — Verica

Why a ‘Reliability Mindset’ Must Be Adopted Beyond SRE

Here’s how non-SREs can use SRE principles to improve their systems.

Laurel Frazier — Transposit