SRE Weekly Issue #265

Articles

Insights into a Product SRE team at LinkedIn

Here’s a great look into how LinkedIn’s embedded SREs work.

[…] the mission for Product SRE is to “engineer and drive product reliability by influencing architecture, providing tools, and enhancing observability.”

Zaina Afoulki and Lakshmi Namboori — LinkedIn

DNS propagation does not exist

It’s all just other people’s caches.

Ruurtjan Pul

Advice for someone moving from SRE to backend engineering

Recently there was a Reddit post asking for advice about moving from Site Reliability Engineering to Backend Eng. I started writing a response to it, the response got long, and so I turned it into a blog post.

Charles Cary — Shoreline

The Mightiest Monolith

This is the first in a series about lessons SREs can learn from the space shuttle program. The author likens earlier spacecraft to microservices and the Shuttle to a monolith.

Robert Barron

The 5 characteristics of high reliability organizations

This article is ostensibly about Emergency Medical Services (EMS), but as is so often the case, it’s directly applicable to SRE. The 5 characteristics are enlightening, and so is the fictitious anecdote about an EMT rattled from a previous incident.

Ems1

How we scaled the GitHub API with a sharded, replicated rate limiter in Redis

Simple solution meets reality. I like how we get to see what they did when things didn’t quite work out as they were hoping.

Robert Mosolgo — GitHub

GitHub Availability Report: March 2021

They did the work to convert a database column to a 64-bit integer before it was too late. Unfortunately, one of their library dependencies didn’t use 64-bit integers.

Keith Ballinger — GitHub

Learning from incidents: getting Sidekiq ready to serve a billion jobs

In this post, I’ll walk you through one of our first ever Sidekiq incidents and how we improved our Sidekiq implementation as a result of this incident.

Nakul Pathak — Scribd

Outages

Let’s Encrypt
Uber
Multiple Airlines’ Online Booking Sites
- An error in Google’s flight information service caused problems at multiple sites that consume it.
Tinder
BBC Website
Facebook, Instagram, and WhatsApp
Stellar.org (cryptocurrency)
WazirX (cryptocurrency exchange)
Microsoft Azure and other services
- Azure DNS servers experienced an anomalous surge in DNS queries from across the globe targeting a set of domains hosted on Azure.

SRE WEEKLY

A message from our sponsor, StackHawk:

Articles

Outages

Related