SRE Weekly Issue #302

Happy holidays, for those that celebrate! I put this issue together in advance, so no Outages section this week.

Articles

Zero Downtime, Instant Deployment and Rollback

This is another great deep-dive into strategies for zero-downtime deploys.

Suresh Mathew — eBay

Making Your On-call and Incident Management Program Stick

How do you make sure your incident management process survives the growth of your team? This article has a useful list of things to cover as you train new team members.

David Caudill — Rootly
This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.

Upcoming trends in DevOps and SRE

The trends in this article are:

AIOps and self-healing platforms
Service Meshes
Lowcode DevOps
GitOps
DevSecOps

Biju Chacko — squadcast

Rundown of Netflix’s SRE practice

I can’t get enough of these. Please write one about your company!

Ash Patel

Some reasons to measure

My favorite part is the discussion of Kyle Kingsbury’s work on Jepsen. Would distributed systems have even more problems if Kingsbury did not shed light on them?

Dan Luu

Has the firefighting stopped? The effect of COVID-19 on on-call engineers

PagerDuty analyzed usage data for their platform in order to draw inferences about how the pandemic has affected incident response.

PagerDuty

Cardiac Surgery with a Robot: It Benefits Patients, but can Harm Surgeons

There’s a ton of interesting stuff in here about confirmation bias and fear in adopting a new, objectively less risky process.

Robert Poston, MD

SRE WEEKLY

A message from our sponsor, Rootly:

Articles

Related