Articles
This one was brought to my attention by Dr. Richard Cook, who also pointed me to the AAIB incident report.
Dr. Cook went on to share these insights with me, which I’ve copied here with permission:
Note:
- the subtle interactions allowed the manual correction to be lost during the interval between recognizing the software problem and having the corrected software functionally ‘catch’ the Ms/Miss title mixup;
- the incident is attributed to “a simple flaw in the programming of the IT system” rather than failure of the workarounds that were put in place after the problem was recognized;
- the report is careful to demonstrate that the flaws in the system made only a slight difference to the flight parameters;
the report does not describe any IT process changes whatsoever!
The report has the effect of making the incident appear to be an unfortunate series of occurrences rather than being emblematic of the way that these sorts of processes are vulnerable.
Last year’s SRE From Home event was awesome, and this year’s iteration looks to be just as great.
Catchpoint
This is fun! Try your hand at troubleshooting a connection issue in this game-ified role-play scenario.
BONUS CONTENT: Read about the author’s motivations, design decisions, and plans here.
Julia Evans
Do we need to have some kind of Pillars Registry? Note, these are more like pillars of high availability than resilience engineering.
Hector Aguilar — Okta
I love this idea that we’re trying to get deep incident analysis done even though that may not be the actual goal of the organization.
As LFI analysts, we’re exploiting this desire for closure to justify spending time examining how work is really done inside of the system.
Lorin Hochstein
This is well worth a read if only for the on-call scenario at the start. Yup, been there. We miss you, Harry.
Harry Hull — Blameless
What’s the difference? Click through to learn about the distinction they’re drawing.
Amir Kazemi — effx
The New York Times’s Operations Engineering group developed an Operational Maturity Assessment and uses it to have collaborative conversations with teams about their systems.
Authro: The NYT Open Team — New York Times
Outages
- G-Suite
- Google posted this “Mini Incident Report while full Incident Report is prepared.”
- Slack
- Docker Hub
- Robinhood
- Elevated CDN Errors
- Heroku
SRE WEEKLY