SRE Weekly Issue #394

A warm welcome to my new sponsor, FireHydrant!

Creating Checklists for High Stakes Changes

This article gives an example checklist for a database version upgrade in RDS and explains why checklists cam be so useful for changes like this.

Nick Janetakis

The balancing act of reliability and availability

The distinction in this article is between responding at all and responding correctly. Different techniques solve for availability vs reliability.

incident.io

What every developer should know about TCP

Latency and throughput are inextricably linked in TCP, and this article explains why with a primer on congestion windows and handshakes.

Roberto Vitillo

Why you should measure tail latencies

Tail latency has a huge impact on throughput and on the overall user experience. Measuring average latency just won’t cut it.

Roberto Vitillo

A Brief, Incomplete and Mostly Wrong Devops Glossary

Is it really wrong though? Is it?

Adam Gordon Bell — Earthly

Part One: Exploring Aviation’s Human Factors ‘Dirty Dozen’

I’ve shared the FAA’s infographic of the Dirty Dozen here previously, but here’s a more in-depth look at the first six items.

Dr. Omar Memon — Simple Flying

More than five whys and “layer eight” problems

It’s often necessary to go through far more than five whys to understand what’s really going on in a sociotechnical system.

rachelbythebay

SRE Story with Michael Hausenblas

I found the bit about the AWS Incident/Communication Manager on-call role pretty interesting.

Prathamesh Sonpatki — SRE Stories

SRE WEEKLY

A message from our sponsor, FireHydrant:

Related