SRE Weekly Issue #377

Articles

Why did AWS Support fail with US-EAST-1 again?

AWS had a major Lambda outage in us-east-1, and it took out many customer systems and quite a few other AWS systems, including their support portal.

The Stack

How I went from Operations Manager to Site Reliability Engineer In 6 Months!

This person had a fascinating path to SRE, starting out their career as a generator repair technician and transitioning through devops to SRE.

Brian Hellinger — Towards AWS

Migrating Critical Traffic At Scale with No Downtime — Part 2

In part 1, they outlined how they replay real traffic to test a new system before deploying it. In this article, they build on that with three additional techniques: sticky canaries, A/B testing, and gradually shifting traffic to the new system in production.

Shyam Gala, Javier Fernandez-Ivern, Anup Rokkam Pratap, and Devang Shah — Netflix

The Data Behind Delayed Status Page Updates

By comparing status page posting to their independent monitoring of services, Metrist is able to produce statistics about how long companies take to post to their status pages when they have an outage.

Jeff Martens — Metrist

When there’s no plan for this scenario, you’ve got to improvise

Improvising during an incident isn’t just a one-off occurrence, and we should plan for it.

Lorin Hochstein — Surfing Complexity

Heroku Incident 2558 Followup

A foreign key column had a smaller integer data type than the key that it referenced, and it failed when the referenced key went too high.

Heroku

Scalable chat app architecture: How to get it right the first time

Here, we’ll look at the key considerations you need to make when it comes to the architecture of your chat app, the structure and components of that architecture, and some of the technology options that can help support you in building a reliable chat experience.

Ably

A Watery Surprise: The crash of National Airlines flight 193

A departure from the normal air traffic control procedure allowed the pilots to lose situational awareness. A commonly-held myth about flotation equipment contributed to three deaths in a quite survivable accident.

Admiral Cloudberg

It’s not always DNS — unless it is

They kept finding what they thought was the problem, and their fixes helped, but the problem kept coming back.

Tanat Paul Lokejaroenlarb — Adevinta

SRE WEEKLY

A message from our sponsor, Rootly:

Articles

Related