SRE Weekly Issue #331

Articles

DisasterCast – A podcast about scary things and how to stop them happening

I’ve been listening to this podcast this week and I love it! Each episode covers a disaster, safety theory, and other topics — with no ads. Their site is down right now, but the podcast is available on the usual platforms.

Drew Rae — DisasterCast

An 8 Step Guide to Go From a Clueless to a Production-aware Software Engineer

If we want to get folks to own their code in production, we need to teach them how to think like an SRE.

Boris Cherkasky

3 mistakes I’ve made at the beginning of an incident (and how not to make them)

Let’s look at three mistakes I’ve made during those stressful moments during the beginning of an incident — and discuss how you can avoid making them.

The mistakes are:

Mistake 1: We didn’t have a plan.
Mistake 2: We weren’t production ready.
Mistake 3: We fell down a cognitive tunnel.

Robert Ross — FireHydrant

When to kill the canary

At what point does your canary test indicate failure? Should the criteria be the same as your normal production alerting?

Øystein Blixhavn

On Counting Alerts

This is a followup to a previous article about on-call health. In this one, the author shares metrics about the number of alerts and discusses whether this number is useful.

Fred Hebert — Honeycomb

High Availability on Razorpay Payments Dashboard

Their dashboard crashed for 50% of user sessions, so they had a lot of work ahead of them. Find out how they got crash-free sessions to 99.9% and improved their time to respond to incidents.

Sandesh Damkondwar — Razorpay

@atoonk on Twitter summarizing the Rogers Communications outage

Rogers Communications, a major telecom in Canada, had a country-wide outage earlier this month. I don’t normally include telecom outages in the Outages section because they rarely share information that we can learn from.

This time, Rogers released a (redacted) report on their outage, and this Twitter thread summarizes the key points.

@atoonk on Twitter

Outages

Microsoft Teams and Office 365
Microsoft blames storage error for Teams outage
Google Cloud Storage
Google Cloud europe-west2 region

Preliminary root cause has been identified as multiple concurrent failures to our redundant cooling systems within one of the buildings that hosts the europe-west2-a zone for the europe-west2 region.

SRE WEEKLY

A message from our sponsor, Rootly:

Articles

Outages

Related