SRE Weekly Issue #381

Articles

The Pyramid introduced in this article is three levels of monitoring: Operational, Data Validation, and Business Assumptions. These roughly correspond to questions like: is the system up? Is the right amount of data flowing through it? Is that data correct?

Karel Vanden Bussche — DEV

Incident Review for Site-wide Outage for GitLab.com – Stale Terraform Pipeline #15997 (#15999) · Issues · GitLab.com / GitLab Infrastructure Team / production

Extremely powerful tools can become extremely powerful footguns, for example Terraform.

Dave Smith — GitLab

latency: a primer

Sure, you know what latency is, but do you really know what a percentile is? A histogram? A heatmap?

igor

CDN Observability

If you’re using a CDN, you need to keep an eye on it. Here’s a primer on what to watch for.

Or Hillel — DZone

Principles of Reliable Software Design

This article series covers 12 aspects important in the design of reliable systems. Some of the aspects, such as modularity, loose coupling, graceful degradation, and redundancy, are covered in depth.

Code Reliant

GitHub Availability Report: June 2023

A couple weeks back, GitHub was hard down, even including its status page at times. This report goes into that in detail, and the cause is pretty interesting.

Jakub Oleksy – GitHub

Failover

An in-depth look at different kinds of failover, including each kind’s methodology and purposes.

Alex Ewerlöf

Finding Fault: The crash of Korean Air Cargo flight 8509

This one is especially interesting for the controversial and baseless conclusions popularized in the media about a supposed cause rooted in Korean culture. It’s a good reminder that we need to be careful to ensure the validity of the lessons we learn from incidents.

Admiral Cloudberg

SRE WEEKLY

A message from our sponsor, Rootly:

Articles

Related