SRE Weekly Issue #274

A message from our sponsor, StackHawk:

Join the GraphQL Security Testing Learning Lab on June 29 at 9 AM PT. Learn how to run automated security testing against your GraphQL APIs so you can find and fix vulnerabilities fast.
http://sthwk.com/graphql-learning-lab

Articles

The last section suggests selling SLOs to executives by likening them to OKRs or KPIs.

Austin Parker — Devops.com

Lowe’s is a home improvement retailer in North America. I often find it fascinating when I learn that a company that’s not seen as being in the tech-sector has a robust SRE practice.

Vivek Balivada and Rahul Mohan Kola Kandy — Lowe’s

The hallmark of sociological storytelling is if it can encourage us to put ourselves in the place of any character, not just the main hero/heroine, and imagine ourselves making similar choices.

Lorin Hochstein

This is brilliant: they apply DevOps and SRE practices to the challenging work of raising two autistic children.

Zac Nickens — USENIX ;login:

I especially like how their bot automatically pages reinforcements after folks have been on an incident for long enough to become fatigued.

Daniella Niyonkuru

Rather than measuring Mean Time To Recovery for incidents, let’s track our Mean Time To Retrospective.

Robert Ross — FireHydrant

Outages

Fastly

Fastly had a global outage of their CDN service, with many 5xx errors for around 40 minutes and diminished cache hit ratios following after. Many customers of Fastly experienced degradation, notably including Amazon, Reddit, and GitHub, among many others.

Fastly posted a summary shortly after the incident, describing a latent bug that was triggered by a customer’s (valid) configuration change.

Full disclosure: Fastly is my employer.

Salesforce
Facebook, Instagram, and WhatsApp
SRE WEEKLY

Published
Categorized as SRE