SRE Weekly Issue #354

Articles

This episode of DisasterCast discusses what happens when attempts to make things safer backfire.

by trying to suppress small problems, we create a reservoir of danger waiting to burst out

Drew Rae

These images offer a glimpse into the visual patterns that appear in our variables and time-series, and the beauty that emerges from chaos. Some of the images in these galleries appeared during difficult rollouts, and some even during production incidents. All come from graphs generated by Google’s monitoring systems.

Google

The Scientific Method for Testing System Resilience

The popular slogan says “test in production”, but what if your business simply doesn’t allow it?

For any scenario where I expect to be causing client impact, I’d rather test in non-production than not test at all, since production is clearly off the table.

Christina Yakomin — InfoQ

The Engineers Are Bloggers Now

There’s been a trend toward narrating our engineering work on company blogs, without which this newsletter probably wouldn’t exist.

Jordan Teicher — New York Times

Moving to cloud: How to do Migrations the wrong way

My team recently moved databases from local files in the codebase to an online Database.

It didn’t go quite as planned, but they got there in the end.

Kaustubh Hiware — Mercari

Developing a data driven tool to estimate the cost of incidents

In Product Analytics we wanted to support our colleagues in SRE, so we created a model to predict the monetary costs of incidents affecting our conversion funnel.

Enrique Hernani Ros — HelloFresh

Air traffic system outage in Philippines strands 65,000

There’s some interesting detail here about multiple failed UPSes and an accidental voltage mismatch exacerbating the situation.

Laura Dobberstein — The Register

SRE WEEKLY

A message from our sponsor, Rootly:

Articles

Related