Articles
This quarter’s Increment issue is about Reliability, and I haven’t had this much fun since their first issue about on-call. I’ll include a few of the articles here and more in later issues as I have a chance to review them.
Stripe
Accepting that imperfect things still work is fundamental to preventing failures from becoming catastrophes.
Understanding that no system is without errors is critical to building resilient systems.
Heidi Waterhouse
The very first sentence sets the tone, and I love it:
Resilience is a process: something you must actively perform, not something you check off a list once.
Ryn Daniels
Most of all, having an incident commander only works if everyone believes in the role. Someone stepping in to address a crisis and saying “I’m Batman” doesn’t help unless people have bought into the idea of Batman.
The next time I’m incident commander, I am totally going to jump in and say, “I’m Batman!”.
This article is a great primer on what an IC is and how to adopt incident command at your organization.
Tanya Reilly
After reading this blog post, you will have an understanding of the retry pattern used in microservices architecture, why it should be used, a few considerations while using the retry pattern, and how to use it in Python.
I love the W. C. Fields quote.
Anand Prashant
It’s that time again! Be sure to fill out the survey, not only so they can gather useful data, but also because Catchpoint will donate $5 to charity.
DevOps Institute, Catchpoint, and VMWare Tanzu
When considering the value of a QA test, SLIs can provide very valuable context.
SRE and QA can work hand in hand.
Emily Arnott — Blameless
This kind of thing keeps me up at night. Silent data corruption can destroy your reliability just as quickly as a backhoe on a non-redundant link.
Harish Dattatraya Dixit — Facebook
Etsy experienced years of growth practically overnight in 2020 as quarantines set in. Here’s how they handled it.
Mike Adler — Etsy
Outages
- Let’s Encrypt
- Google Voice
- This is Google’s analysis for the incident on February 16, caused by a TLS certificate management mishap.
- India’s National Stock Exchange (NSE)
- US Federal Reserve
- The US Fed’s computer system was down, preventing transfers between banks from going through.
- Venmo
- Facebook and Instagram
- Discord
SRE WEEKLY