Articles
This well-researched article caught me by surprise. It’s shocking that Ably received advice from AWS to stay under 400,000 simultaneous connections, despite Amazon’s own documentation stating support for “millions of connections per second”.
Paddy Byers — Ably
This blog is about how a group of hard-working individuals, with unique skills and working methods, managed to create a successful SRE team.
There’s a lot of detail about what their SREs do and how they communicate, with 3 projects as case studies.
Sergio Galvan — Algolia
This is an incident followup from an incident at Deno earlier this year. Their CDN saw their heavy use of .ts
files (TypeScript, a JavaScript variant) and mistakenly assumed they were MPEG transport segments, a violation of the CDN’s ToS. Oops.
Luca Casonato — Deno
Wait, there are 9 now?
Marc Hornbeek — Container Journal
There’s a nice little discussion of why “human error” is not a good enough answer for why a deviation (from standard operating procedure) happened.
Susan J. Schniepp and Steven J. Lynn — Pharmaceutical Technolog
They deployed an optimization that skipped sending some requests to the backend… and the backend metrics got worse. Why? Hint: aggregate metrics.
Dominik Sandjaja — Trivago
Outages
- National Weather Service (US)
- Squarespace.com
- Squarespace.com itself, but not user sites.
- Microsoft 365
SRE WEEKLY