SRE Weekly Issue #389

Articles

Here’s four of the lessons I learned that should help you build a successful SRE organization.

Focus on Developer Training
Focus on the Right Abstractions
Focus on Self Service
Automate Yourself out of a job

Sven Hans Knecht

Exploring distributed vs centralized incident command models

In this blog post, we’ll talk about two incident management structure models — distributed and centralized, including the pros and cons of each, and examples of what each structure looks like in our community.

Robert Ross — FireHydrant

Understanding the Rasmussen model for failures

The Rasmussen model conceptualizes the limits of a system along 3 boundaries: Cost, System Performance, and Human Capacity.

Nishant Modak — Last9

Accelerator Report: Leak repaired, cooling in progress

Wow, this is a really interesting incident. it has all the hallmarks of a nightmare sev1: time pressure, unknown problem, inventing new procedures on the spot, multiple different teams/specialties having to work together, etc.

Jorg Wenninger — CERN

Scheduling Oncall considering Sabbath and other frequent recurring conflicts

What do you do when many engineers all need to take the same day off each week for religious reasons?

TimeWeSp

Concerning the production order system malfunction

Toyota recently halted production in their factories due to a problem in their order system, about which they shared some interesting details.

Toyota

Being The First SRE

Here’s a guidebook on how to handle being the first SRE at a company.

Sven Hans Knecht

SRE WEEKLY

A message from our sponsor, Rootly:

Articles

Related