SRE Weekly Issue #357

Articles

3 tips for reducing stress during incident response efforts

Panic takes time and energy away from swift incident response, leading to second-guessing, a higher likelihood of mistakes, and analysis paralysis. Here are three tips to minimize it.

Malcolm Preston — incident.io

The FAA outage: On public incident reports and seeking second stories

A great explanation of why we need to wait for more details on the FAA NOTAM outage. My favorite part is the list of clues to whether an incident report might be useful: Time, Artifacts, Jargon, and Narrative.

Thai Wood — Resilience Roundup

Rundown of LinkedIn’s SRE practices

Lots of juicy details about a large SRE organization and how they work.

Ash Patel — SREPath

Cloudflare incident on January 24, 2023

A deploy accidentally wiped authentication tokens for some internal Cloudflare services, causing an outage for those services.

Kenny Johnson and Sam Rhea — Cloudflare

The Staging Dichotomy: Part One

eBay thought about adopting “test in production” and eliminating staging, but they determined that their use case really does require a staging environment. They carefully selected and anonymized real production data to use as test cases in staging.

Senthil Padmanabhan — eBay

Can We Stop With Those Horrible “System Overview” Dashboards Already?

This article has a really great section explaining the pitfalls of full system dashboards.

Boris Cherkasky

5 Exciting Predictions for SRE in 2023

The first one is my favorite:

Economic factors will force companies to look for more efficient ways of managing reliability

I’m not sure if that will happen, but it’s an interesting theory.

Emily Arnott

Remote First Incident Response

This author shares what they learned in adapting to running incidents remotely once the pandemic hit.

Emily Ruppe — Jeli

SRE WEEKLY

A message from our sponsor, Rootly:

Articles

Related