SRE Weekly Issue #408

This is either a set of SRE interview topics or the squares for the SRE bingo card.

Lorin Hochstein

Blame awareness only works if you work towards blame awareness with all incidents, not just the ones that affect you.

Will Gallego

Rebuilding Netflix Video Processing Pipeline with Microservices

a brief history of our pipeline and the platforms, why the rebuilding was necessary, what these new services look like, and how they are being used for Netflix businesses.

Liwei Guo, Anush Moorthy, Li-Heng Chen, Vinicius Carvalho, Aditya Mavlankar, Agata Opalach, Adithya Prakash, Kyle Swanson, Jessica Tweneboah, Subbu Venkatrav, Lishan Zhu — Netflix

Best practices to prevent alert fatigue

Here are five concrete tips to fix your alerts and improve alert fatigue.

Candace Shamieh, Daljeet Sandu, and Nicolas Narbais — Datadog

SRE Governance

This article contains guidelines for many kinds of reviews and activities SRE can do to improve reliability, such as SLO reviews, dependency reviews, and more.

Jamie Allen

Alerts Are Fundamentally Messy

However, the reality of alerting in a socio-technical system must cater not only to the mess around the signal, but also to the longer term interpretation of alerts by people and automation acting on them. This post will expand on this messiness and why Honeycomb favors an iterative approach to setting our alerts.

Fred Hebert — Honeycomb
Full disclosure: Honeycomb is my employer.

#23 – The Danger of Unreliable Platforms (with Jade Rubick)

This far-ranging conversation covers many aspects of developing a reliable platform for engineering. There’s a text summary if audio’s not your thing.

Ash Patel — SREPath

Slack’s Migration to a Cellular Architecture

Spurred by a single-AZ outage that took down their service, Slack set out to break their system into isolated segments so that an AZ can be drained of traffic quickly and without impacting customers.

Cooper Bethea — Slack

SRE WEEKLY

A message from our sponsor, FireHydrant:

Related