Minesweeper automates root cause analysis as a first-line defense against bugs

Root cause analysis (RCA) is an important part of fixing any bug. After all, you can’t solve a problem without getting to the heart of it. But RCA isn’t always simple, especially at a scale like Facebook’s. When billions of people are using an app on a variety of platforms and devices, a single bug… Continue reading Minesweeper automates root cause analysis as a first-line defense against bugs

Published
Categorized as Technology

Zero Downtime Node Patching in a Kubernetes Cluster

Authors: Vaishnavi Galgali, Arpeet Kale, Robert Xue Introduction The Salesforce Einstein Vision and Language services are deployed in an AWS Elastic Kubernetes Service (EKS) cluster. One of the primary security and compliance requirements is operating system patching. The cluster nodes that the services are deployed on need to have regular operating system updates. Operating system patching… Continue reading Zero Downtime Node Patching in a Kubernetes Cluster

Published
Categorized as Technology

SRE Weekly Issue #257

View on sreweekly.com A message from our sponsor, StackHawk: Keeping your APIs secure requires thoughtful design and testing. Learn how to protect your REST, SOAP and GraphQL APIs from security vulnerabilities with StackHawk http://sthwk.com/api-protection Articles Sometimes alerts have inobvious reasons for existing This one really got me thinking. Make sure you document why an alert… Continue reading SRE Weekly Issue #257

Published
Categorized as SRE

Native Scrolling in Salesforce mobile app

Native Scrolling in Salesforce Mobile App The Salesforce mobile app is a native app with hybrid functionality available for both iOS and Android platforms. A hybrid app combines the best of both worlds, leveraging native experiences with rich web customizations provided by the Salesforce platform via Flexipages, Lightning Web Components, Aura, and VisualForce. UI Scroller was… Continue reading Native Scrolling in Salesforce mobile app

Published
Categorized as Technology

Faster, more efficient systems for finding and fixing regressions

Every workday, Facebook engineers commit thousands of diffs (which is a change consisting of one or more files) into production. This code velocity allows us to rapidly ship new features, deliver bug fixes and optimizations, and run experiments. However, a natural downside to moving quickly in any industry is the risk of inadvertently causing regressions… Continue reading Faster, more efficient systems for finding and fixing regressions

Published
Categorized as Technology

How, Not Why: An Alternative to the Five Whys for Post-Mortems

When I got into the DevOps field, I was exposed to The Five Whys — a popular analytical method used in incident postmortems. The Five Whys is one type of root cause analysis (RCA): “The primary goal of the technique is to determine the root cause of a defect or problem by repeating the question ‘Why?.’ Each… Continue reading How, Not Why: An Alternative to the Five Whys for Post-Mortems

Published
Categorized as Technology

SRE Weekly Issue #258

View on sreweekly.com A message from our sponsor, StackHawk: On February 25 at 10 am PT we are going to show you how easy it is to add application security testing to a #GitLab pipeline. Save your spot for our live session http://sthwk.com/gitlab-stackhawk-automation Articles Practiced Humility in Retrospectives When acting as a retrospective facilitator, there’s… Continue reading SRE Weekly Issue #258

Published
Categorized as SRE

Mitigating the effects of silent data corruption at scale

What the research is:  Silent data corruption, or data errors that go undetected by the larger system, is a widespread problem for large-scale infrastructure systems. This type of corruption can propagate across the stack and manifest as application-level problems. It can also result in data loss and require months to debug and resolve. This work… Continue reading Mitigating the effects of silent data corruption at scale

Published
Categorized as Technology

SRE Weekly Issue #259

View on sreweekly.com A message from our sponsor, StackHawk: Mark your calendars! The first conference for OWASP ZAP users is taking place March 9. Get your free ticket to connect with other ZAP users and learn about the project’s roadmap http://sthwk.com/zapcon-sreweekly Articles Increment: Reliability This quarter’s Increment issue is about Reliability, and I haven’t had… Continue reading SRE Weekly Issue #259

Published
Categorized as SRE