Here are some things that make SREs a unique breed in software work:
The one about Scrum caught my eye, and I followed the links through to the Stack Overflow post about SRE and Scrum.
Ash P — Cruform
An in-depth explainer on the Linux page cache, full of details and experiments.
There’s some great advice in this reddit thread… and maybe some tongue-in-cheek advice too.
Take production down the first day they give access — then it’s nothing but up from there!
Various — reddit
Using two real-world case studies, this article explains how developer self-service can go wrong, and then discusses how to avoid these pitfalls.
Kaspar von Grünberg — humanitec
What a great idea! I found it especially interesting that only 34% of SRE job postings mention defining SLIs/SLOs/error budgets.
Pruthvi — Spike.sh
For the first time, we’ve created the State of Digital Operations Report which is based on PagerDuty platform data.
we will walk through some of these findings and share 10 questions teams can ask themselves to improve their incident response.
Hannah Culver — PagerDuty
Incident response so often gets mired in assumptions that need to be re-evaluated. This article uses an incident as a case study
Lawrence Jones — incident.io.
This one lays out clear definitions of SRE and DevOps and compares and contrasts them.
Mateus Gurgel — Rootly
This week, Saleforce released Merlion, a Python library for time series machine learning and anomaly detection. Linked is an in-depth research paper on Merlin, explaining its theory of operation and experimental results.
Bhatnagar et al. — Salesforce
Trello had major outages on two consecutive days (here‘s the other).