Articles
We built Edgar to ease this burden, by empowering our users to troubleshoot distributed systems efficiently with the help of a summarized presentation of request tracing, logs, analysis, and metadata.
Kevin Lew, Maulik Pandey, Narayanan Arunachalam, Dustin Haffner, Andrei Ushakov, Seth Katz, Greg Burrell, Ram Vaithilingam, Mike Smith and Elizabeth Carretto — Netflix
The PDF covers 5 main areas:
Availability
Performance
Monitoring
Incident Response
Preparation
No account required or form to fill out to download the PDF.
Splunk/VictorOps
This one’s especially interesting for the section about what MTTx metrics aren’t good for, and the following section on how to improve them.
Emily Arnott — Blameless
If you’re interested in deploying Kafka in a multi-region configuration, eBay has put quite a bit of thought into this and has a lot to share.
Engin Yoeyen — eBay
Straight from someone who was there from the start. The “what chaos engineering is not” section is especially enlightening.
Casey Rosenthal — Verica
The last paragraph regarding “unknown unknowns” is noteworthy.
Heroku
There are some great questions in here on blamelessness and full service ownership.
James Thigpen — Gremlin
Outages
Google Cloud Platform us-west2 region
They posted a detailed follow-up at the above link.
TikTok
Network Solutions and Register.com
Singapore Exchange (SGX)
reddit
Parler
SRE WEEKLY