How, Not Why: An Alternative to the Five Whys for Post-Mortems

When I got into the DevOps field, I was exposed to The Five Whys — a popular analytical method used in incident postmortems. The Five Whys is one type of root cause analysis (RCA): “The primary goal of the technique is to determine the root cause of a defect or problem by repeating the question ‘Why?.’ Each… Continue reading How, Not Why: An Alternative to the Five Whys for Post-Mortems

Published
Categorized as Technology

SRE Weekly Issue #258

View on sreweekly.com A message from our sponsor, StackHawk: On February 25 at 10 am PT we are going to show you how easy it is to add application security testing to a #GitLab pipeline. Save your spot for our live session http://sthwk.com/gitlab-stackhawk-automation Articles Practiced Humility in Retrospectives When acting as a retrospective facilitator, there’s… Continue reading SRE Weekly Issue #258

Published
Categorized as SRE

Mitigating the effects of silent data corruption at scale

What the research is:  Silent data corruption, or data errors that go undetected by the larger system, is a widespread problem for large-scale infrastructure systems. This type of corruption can propagate across the stack and manifest as application-level problems. It can also result in data loss and require months to debug and resolve. This work… Continue reading Mitigating the effects of silent data corruption at scale

Published
Categorized as Technology

SRE Weekly Issue #259

View on sreweekly.com A message from our sponsor, StackHawk: Mark your calendars! The first conference for OWASP ZAP users is taking place March 9. Get your free ticket to connect with other ZAP users and learn about the project’s roadmap http://sthwk.com/zapcon-sreweekly Articles Increment: Reliability This quarter’s Increment issue is about Reliability, and I haven’t had… Continue reading SRE Weekly Issue #259

Published
Categorized as SRE

Boosting the performance of virtual machines with Jump-Start

What the research is: Jump-Start is a new approach for improving the performance of virtual machines at scale. Virtual machines are a modern and popular design to implement programming languages used to build applications in general, including large-scale websites like Facebook and Instagram. However, virtual machines incur well-known performance overhead in terms of the amount… Continue reading Boosting the performance of virtual machines with Jump-Start

Published
Categorized as Technology

Notary: A Certificate Lifecycle Management Controller for Kubernetes

Authors: Vaishnavi Galgali, Savithru Lokanath, Arpeet Kale Introduction All services in the Einstein Vision and Language Platform use TLS/SSL certificates to encrypt communication between microservices. The certificates are generated in AWS Certificate Manager (ACM) and stored in the AWS Secrets Manager in the form of keystores and truststores (private and public keys). Certificate creation can be… Continue reading Notary: A Certificate Lifecycle Management Controller for Kubernetes

Published
Categorized as Technology

SRE Weekly Issue #260

View on sreweekly.com A message from our sponsor, StackHawk: Check out this guide to modern dynamic application security testing to learn how it works and what to look for in tooling. http://sthwk.com/dynamic-appsec-overview Articles [Increment: Reliability] Interview: Dr. David D. Woods People throw around “resiliency” quite often when they mean “reliability” or “high availability”. Dr. Woods… Continue reading SRE Weekly Issue #260

Published
Categorized as SRE