Articles
Chaos Engineering isn’t adding chaos to your systems—it’s seeing the chaos that already exists in your systems.
Along with four prerequisites, this article also includes 3 myths about chaos engineering that might be making you feel hesitant about starting.
Courtney Nash — Verica
This one’s from May of last year. Almost a year on, it’s interesting to see which of these we’ve already implemented.
Ashley Roof — Transposit
An amusing parable illustrating why not to try to be too reliable.
Andrew Ford — Indeed
In the Outages section of last week’s issue, you’ll find two unrelated events referenced in this article: one about Russian internet censorship gone awry and another about a major datacenter fire.
Eric Johansson — Verdict
Along with what’s in the title, this article also covers the difference between an RCA and a contributing factors analysis.
Emily Arnott — Blameless
Lots of detail on how LinkedIn is improving their traffic forecasts. Warning/enticement: math contained within.
Deepanshu Mehndiratta — LinkedIn
Everyone is testing in production, some organizations admit and plan for it.
How to do it right, what can happen if it goes wrong, and how to limit the blast radius.
Heidi Waterhouse — LaunchDarkly
Remember when GitHub logged you out? Ah, I remember it like it was last week. I mean, the week before. Here’s GitHub’s troubleshooting story about what went wrong.
Dirkjan Bussink — GitHub
Outages
- Google Cloud Platform
- GCP had a major multi-region networking issue, due to a routing glitch. Click through for their followup post.
- US National Oceanic and Atmospheric Administration (NOAA)
- This outage impaired NOAA’s tsunami early warning system.
- Facebook, Instagram, and WhatsApp
- TikTok
- Elevated error rates
- Microsoft Teams and other services
- Click through for a highly detailed description of what went wrong. I can’t link directly to the incident in question, so you’ll have to scroll down to 3/15.
SRE WEEKLY