SRE Weekly Issue #379

Articles

In case you weren’t familiar with the Saga pattern like I was, it’s basically a pseudo-transaction across multiple microservices. Here’s why it might not be a great idea.

Sergiy Yevtushenko

The Story Behind Last Week’s Let’s Encrypt Downtime

During a rolling deploy, for a very brief period of time, different parts of the infrastructure had old or new code running, with unexpected results.

Andrew Ayer

Generating sequential numbers in a distributed manner

On its face, we have a simple requirement:

Generate sequential numbers
Ensure that there can be no gaps
Do that in a distributed manner

It’s never simple with distributed systems.

Lost in transit: debugging dropped packets from negative header lengths

In classic Cloudflare style, here’s an ultra-deep dive into the kernel to find the source of trouble-making packet loss.

Terin Stock — Cloudflare

There Are No Repeat Incidents

Even with a “duplicate” incident, there’s always at least one thing that’s different: the fact that it’s happened before. That changes things. In practice, a lot more will be different too.

Fred Hebert — Honeycomb
Full disclosure: Honeycomb is my employer.

Why So Many Companies Run in AWS us-east-1

There are definitely pros and cons to being in the most popular (and most oft-maligned) AWS region.

Jeff Martens — Metrist

What Is That Change Which Is The Source Of All Instability?

Changes are frequent causes of incidents, but what exactly counts as a change? This article delves into that with examples.

Boris Cherkasky

Collision with the Terminal: The crash of RwandAir flight 205

This crash is a great reminder that we have to look past “human error” to the systems around the humans that set them up for failure (or don’t set them up for success).

Admiral Cloudberg

SRE WEEKLY

A message from our sponsor, Rootly:

Articles

Related