SRE Weekly Issue #314

Articles

Slight Reliability Episode 1 – What the heck is SRE anyway?

The first episode of this new podcast answers the question in three ways: what Google says SRE is, what the podcast host thinks it is, and how people seem to be practicing SRE.

Stephen Townsend — Slight Reliability

WHY this Aircraft CRASHED Down a RAVINE in India | Air India Express flight 1344

This aircraft accident report puts heavy emphasis on the deeper contributing factors rather than a seemingly obvious single root cause.

Mentour Pilot

Incident affecting Google Cloud Networking — Incident Report

Google posted an incident report for the March 8 incident involving Traffic Director.

Google

Fixing retries with token buckets and circuit breakers

This one includes some neat graphs made by showing load and theoretical success rates for various strategies such as no retries, N retries, token buckets, and circuit breakers.

Marc Brooker

Who watches the watchers?

What if your alerting system goes down? These folks set up a dead-switch to handle that situation.

Miedwar Meshbesher — Nanit

Incident Communication Best Practices: Defining Your On-Call Vocabulary

Strategies for creating concise, efficient communication between teams during incidents and operational suprises

[…] communications must be precise and descriptive to minimize confusion and accelerate a responder’s ability to assess and remedy the situation.

Steve Stevens — Transposit

Detecting silent errors in the wild: Combining two novel approaches to quickly detect silent data corruptions at scale

I really love these articles about hardware errors. They’re more common than we tend to realize.

Harish Dattatraya Dixit — Facebook

Outages

GitHub
ASX (Australian Stock Exchange)
Google Maps
SRE WEEKLY

A message from our sponsor, Rootly:

Articles

Outages

Related