Articles
Here are some things that make SREs a unique breed in software work:
The one about Scrum caught my eye, and I followed the links through to the Stack Overflow post about SRE and Scrum.
Ash P — Cruform
An in-depth explainer on the Linux page cache, full of details and experiments.
Viacheslav Biriukov
There’s some great advice in this reddit thread… and maybe some tongue-in-cheek advice too.
Take production down the first day they give access — then it’s nothing but up from there!
Various — reddit
Using two real-world case studies, this article explains how developer self-service can go wrong, and then discusses how to avoid these pitfalls.
Kaspar von Grünberg — humanitec
What a great idea! I found it especially interesting that only 34% of SRE job postings mention defining SLIs/SLOs/error budgets.
Pruthvi — Spike.sh
For the first time, we’ve created the State of Digital Operations Report which is based on PagerDuty platform data.
[…]
we will walk through some of these findings and share 10 questions teams can ask themselves to improve their incident response.
Hannah Culver — PagerDuty
Incident response so often gets mired in assumptions that need to be re-evaluated. This article uses an incident as a case study
Lawrence Jones — incident.io.
This one lays out clear definitions of SRE and DevOps and compares and contrasts them.
Mateus Gurgel — Rootly
This week, Saleforce released Merlion, a Python library for time series machine learning and anomaly detection. Linked is an in-depth research paper on Merlin, explaining its theory of operation and experimental results.
Bhatnagar et al. — Salesforce
Outages
reddit
Atlassian Statuspage.io
Giant Pay
Trello
Trello had major outages on two consecutive days (here‘s the other).
SRE WEEKLY