{"id":883,"date":"2024-06-19T16:00:46","date_gmt":"2024-06-19T16:00:46","guid":{"rendered":"https:\/\/fde.cat\/index.php\/2024\/06\/19\/pvf-a-novel-metric-for-understanding-ai-systems-vulnerability-against-sdcs-in-model-parameters\/"},"modified":"2024-06-19T16:00:46","modified_gmt":"2024-06-19T16:00:46","slug":"pvf-a-novel-metric-for-understanding-ai-systems-vulnerability-against-sdcs-in-model-parameters","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2024\/06\/19\/pvf-a-novel-metric-for-understanding-ai-systems-vulnerability-against-sdcs-in-model-parameters\/","title":{"rendered":"PVF: A novel metric for understanding AI systems\u2019 vulnerability against SDCs in model parameters"},"content":{"rendered":"<p><span>We\u2019re introducing<\/span><a href=\"https:\/\/arxiv.org\/pdf\/2405.01741\"> <span>parameter vulnerability factor (PVF)<\/span><\/a><span>, a novel metric for understanding and measuring AI systems\u2019 vulnerability against silent data corruptions (SDCs) in model parameters.<\/span><br \/>\n<span>PVF can be tailored to different AI models and tasks, adapted to different hardware faults, and even extended to the training phase of AI models.<\/span><br \/>\n<span>We\u2019re sharing results of our own case studies using PVF to measure the impact of SDCs in model parameters, as well as potential methods of identifying SDCs in model parameters.<\/span><\/p>\n<p><span>Reliability is an important aspect of any successful AI implementation. But the growing complexity and diversity of AI hardware systems also brings an increased risk of hardware faults such as bit flips. Manufacturing defects, aging components, or environmental factors can lead to data corruptions \u2013 errors or alterations in data that can occur during storage, transmission, or processing and result in unintended changes in information.<\/span><\/p>\n<p><a href=\"https:\/\/engineering.fb.com\/2022\/03\/17\/production-engineering\/silent-errors\/\"><span>Silent data corruptions<\/span><\/a><span> (SDCs), where an undetected hardware fault results in erroneous application behavior, have become increasingly prevalent and difficult to detect. Within AI systems, an SDC can create what is referred to as parameter corruption, where AI model parameters are corrupted and their original values are altered.<\/span><\/p>\n<p><span>When this occurs during AI inference\/servicing it can potentially lead to incorrect or degraded model output for users, ultimately affecting the quality and reliability of AI services.<\/span><\/p>\n<p><span>Figure 1 shows an example of this, where a single bit flip can drastically alter the output of a ResNet model.\u00a0<\/span><\/p>\n<p>Figure 1: Flipping a random bit of one parameter in the 1st convolution (conv) layer in ResNet-18 drastically alters the model\u2019s output.<\/p>\n<p>\u00a0<\/p>\n<p><span>With this escalating thread in mind, there are two important questions: How vulnerable are AI models to parameter corruptions? And how do different parts (such as modules and layers) of the models exhibit different vulnerability levels to parameter corruptions?<\/span><\/p>\n<p><span>Answering these questions is an important part of delivering reliable AI systems and services and offers valuable insights for guiding AI hardware system design, such as when assigning AI model parameters or software variables to hardware blocks with differing fault protection capabilities. Additionally, it can provide important information for formulating strategies to detect and mitigate SDCs in AI systems in an efficient and effective manner.<\/span><\/p>\n<p><a href=\"https:\/\/arxiv.org\/pdf\/2405.01741\"><span>Parameter vulnerability factor (PVF)<\/span><\/a><span> is a novel metric we\u2019ve introduced with the aim to standardize the quantification of AI model vulnerability against parameter corruptions. PVF is a versatile metric that can be tailored to different AI models\/tasks and is also adaptable to different hardware fault models. Furthermore, PVF can be extended to the training phase to evaluate the effects of parameter corruptions on the model\u2019s convergence capability.<\/span><\/p>\n<h2><span>What is PVF?<\/span><\/h2>\n<p><span>PVF is inspired by the architectural vulnerability factor (AVF) metric used within the computer architecture community. We define a model parameter\u2019s PVF as the probability that a corruption in that specific model parameter will lead to an incorrect output. Similar to AVF, this statistical concept can be derived from statistically extensive and meaningful fault injection (FI) experiments.\u00a0<\/span><\/p>\n<p><span>PVF has several features:<\/span><\/p>\n<h3><span>Parameter-level quantitative assessment<\/span><\/h3>\n<p><span>As a quantitative metric, PVF concentrates on parameter-level vulnerability, calculating the likelihood that a corruption in a specific model parameter will lead to an incorrect model output. This \u201cparameter\u201d can be defined at different scales and granularities, such as an individual parameter or a group of parameters.<\/span><\/p>\n<h3><span>Scalability across AI models\/tasks<\/span><\/h3>\n<p><span>PVF is scalable and applicable across a wide range of AI models, tasks, and hardware fault models.<\/span><\/p>\n<h3><span>Provides insights for guiding AI system design<\/span><\/h3>\n<p><span>PVF can provide valuable insights for AI system designers, guiding them in making informed decisions about balancing fault protection with performance and efficiency. For example, engineers might leverage PVF to help map higher vulnerable parameters to better-protected hardware blocks and explore tradeoffs on latency, power, and reliability by enabling a surgical approach to fault tolerance at selective locations instead of a catch-all\/none approach.\u00a0<\/span><\/p>\n<h3><span>Can be used as a standard metric for AI vulnerability\/resilience evaluation<\/span><\/h3>\n<p><span>PVF has the potential to unify and standardize such practices, making it easier to compare the reliability of different AI systems\/parameters and fostering open collaboration and progress in the industry and research community.<\/span><\/p>\n<h2><span>How PVF works<\/span><\/h2>\n<p><span>Similar to AVF as a statistical concept, PVF needs to be derived through a large number of FI\u00a0 experiments that are statistically meaningful. Figure 2 shows an overall flow to compute PVF through a FI process. We\u2019ve presented a case study on the<\/span><a href=\"https:\/\/github.com\/facebookresearch\/dlrm\"> <span>open-source DLRM inference<\/span><\/a><span> with more details and example case studies in our<\/span><a href=\"https:\/\/arxiv.org\/pdf\/2405.01741\"> <span>paper<\/span><\/a><span>.<\/span><\/p>\n<p>Figure 2: Computing PVF through FI.<\/p>\n<p><span>Figure 3 illustrates the PVF of three DLRM parameter components, embedding table, bot-MLP, and top-MLP, under 1, 2, 4, 8, 16, 32, 64, and 128 bit flips during each inference. We observe different vulnerability levels across different parts of DLRM. For example, under a single bit flip, the embedding table has relatively low PVF; this is attributed to embedding tables being highly sparse, and parameter corruptions are only activated when the particular corrupted parameter is activated by the corresponding sparse feature. However, top-MLP can have 0.4% under even a single bit flip. This is significant \u2013 for every 1000 inferences, four inferences will be incorrect. This highlights the importance of protecting specific vulnerable parameters for a given model based on the PVF measurement.\u00a0<\/span><\/p>\n<p>Figure 3: The PVF of DLRM parameters under random bit flips.<\/p>\n<p><span>We observe that with 128 bit flips during each inference, for MLP components, PVF has increased to 40% and 10% for top-MLP and bot-MLP components respectively, while observing multiple NaN values. Top-MLP component has higher PVF than bot-MLP. This is attributed to the top-MLP being closer to the final model, and hence has less of a chance to be mitigated by inherent error masking probability of neural layers.\u00a0<\/span><\/p>\n<h2><span>The applicability of PVF<\/span><\/h2>\n<p><span>PVF is a versatile metric where the definition of an \u201cincorrect output\u201d (which will vary based on the model\/task) can be adapted to suit user requirements. To adapt PVF to various hardware fault models the method to calculate PVF remains consistent as depicted in Figure 2. The only modification required is the manner in which the fault is injected, based on the assumed fault models.\u00a0<\/span><\/p>\n<p><span>Furthermore, PVF can be extended to the training phase to evaluate the effects of parameter corruptions on the model\u2019s convergence capability. During training, the model\u2019s parameters are iteratively updated to minimize a loss function. A corruption in a parameter could potentially disrupt this learning process, preventing the model from converging to an optimal solution. By applying the PVF concept during training, we could quantify the probability that a corruption in each parameter would result in such a convergence failure.<\/span><\/p>\n<h2><span>Dr. DNA and further exploration avenues for PVF<\/span><\/h2>\n<p><span>The logical progression after understanding AI vulnerability to SDCs is to identify and lessen their impact on AI systems. To initiate this, we\u2019ve introduced<\/span><a href=\"https:\/\/dl.acm.org\/doi\/10.1145\/3620666.3651349\"> <span>Dr. DNA<\/span><\/a><span>, a method designed to detect and mitigate SDCs that occur during deep learning model inference. Specifically, we formulate and extract a set of unique SDC signatures from the distribution of neuron activations (DNA), based on which we propose early-stage detection and mitigation of SDCs during DNN inference.\u00a0<\/span><\/p>\n<p><span>We perform an extensive evaluation across 10 representative DNN models used in three common tasks (vision, GenAI, and segmentation) including ResNet, Vision Transformer, EfficientNet, YOLO, etc., under four different error models. Results show that Dr. DNA\u00a0 achieves a 100% SDC detection rate for most cases, a 95% detection rate on average and a &gt;90% detection rate across all cases, representing 20-70% improvement over baselines. Dr. DNA can also mitigate the impact of SDCs by effectively recovering DNN model performance with &lt;1% memory overhead and &lt;2.5% latency overhead.\u00a0<\/span><\/p>\n<h2><span>Read the research papers<\/span><\/h2>\n<p><a href=\"https:\/\/arxiv.org\/pdf\/2405.01741\"><span>PVF (Parameter Vulnerability Factor): A Novel Metric for Understanding AI Vulnerability Against SDCs in Model Parameters <\/span><\/a><\/p>\n<p><a href=\"https:\/\/dl.acm.org\/doi\/10.1145\/3620666.3651349\"><span>Dr. DNA: Combating Silent Data Corruptions in Deep Learning using Distribution of Neuron Activations<\/span><\/a><\/p>\n<p>The post <a href=\"https:\/\/engineering.fb.com\/2024\/06\/19\/data-infrastructure\/parameter-vulnerability-factor-pvf-ai-silent-data-corruption\/\">PVF: A novel metric for understanding AI systems\u2019 vulnerability against SDCs in model parameters<\/a> appeared first on <a href=\"https:\/\/engineering.fb.com\/\">Engineering at Meta<\/a>.<\/p>\n<p>Engineering at Meta<\/p>","protected":false},"excerpt":{"rendered":"<p>We\u2019re introducing parameter vulnerability factor (PVF), a novel metric for understanding and measuring AI systems\u2019 vulnerability against silent data corruptions (SDCs) in model parameters. PVF can be tailored to different AI models and tasks, adapted to different hardware faults, and even extended to the training phase of AI models. We\u2019re sharing results of our own&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2024\/06\/19\/pvf-a-novel-metric-for-understanding-ai-systems-vulnerability-against-sdcs-in-model-parameters\/\">Continue reading <span class=\"screen-reader-text\">PVF: A novel metric for understanding AI systems\u2019 vulnerability against SDCs in model parameters<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-883","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":759,"url":"https:\/\/fde.cat\/index.php\/2023\/09\/07\/arcadia-an-end-to-end-ai-system-performance-simulator\/","url_meta":{"origin":883,"position":0},"title":"Arcadia: An end-to-end AI system performance simulator","date":"September 7, 2023","format":false,"excerpt":"We\u2019re introducing Arcadia, Meta\u2019s unified system that simulates the compute, memory, and network performance of AI training clusters. Extracting maximum performance from an AI cluster and increasing overall efficiency warrants a multi-input system that accounts for various hardware and software parameters across compute, storage, and network collectively. Arcadia gives Meta\u2019s\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":744,"url":"https:\/\/fde.cat\/index.php\/2023\/08\/09\/scaling-the-instagram-explore-recommendations-system\/","url_meta":{"origin":883,"position":1},"title":"Scaling the Instagram Explore recommendations system","date":"August 9, 2023","format":false,"excerpt":"Explore is one of the largest recommendation systems on Instagram. We leverage machine learning to make sure people are always seeing content that is the most interesting and relevant to them. Using more advanced machine learning models, like Two Towers neural networks, we\u2019ve been able to make the Explore recommendation\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":806,"url":"https:\/\/fde.cat\/index.php\/2023\/12\/19\/ai-debugging-at-meta-with-hawkeye\/","url_meta":{"origin":883,"position":2},"title":"AI debugging at Meta with HawkEye","date":"December 19, 2023","format":false,"excerpt":"HawkEye is the powerful toolkit used internally at Meta for monitoring, observability, and debuggability of the end-to-end machine learning (ML) workflow that powers ML-based products. HawkEye supports recommendation and ranking models across several products at Meta. Over the past two years, it has facilitated order of magnitude improvements in the\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":332,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/fully-sharded-data-parallel-faster-ai-training-with-fewer-gpus\/","url_meta":{"origin":883,"position":3},"title":"Fully Sharded Data Parallel: faster AI training with fewer GPUs","date":"August 31, 2021","format":false,"excerpt":"Training AI models at a large scale isn\u2019t easy. Aside from the need for large amounts of computing power and resources, there is also considerable engineering complexity behind training very large models. At Facebook AI Research (FAIR) Engineering, we have been working on building tools and infrastructure to make training\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":804,"url":"https:\/\/fde.cat\/index.php\/2023\/12\/19\/unveiling-salesforces-blueprint-for-sustainable-ai-where-responsibility-meets-innovation\/","url_meta":{"origin":883,"position":4},"title":"Unveiling Salesforce\u2019s Blueprint for Sustainable AI: Where Responsibility Meets Innovation","date":"December 19, 2023","format":false,"excerpt":"Salesforce is guided by its core values of trust, customer success, innovation, equality, and sustainability. These values are reflected in its commitment to responsibly develop and deploy new technologies like generative AI on behalf of stakeholders \u2014 from shareholders to customers to the planet. The Large Language Models (LLMs) that\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":231,"url":"https:\/\/fde.cat\/index.php\/2021\/02\/02\/journey-to-a-million-models\/","url_meta":{"origin":883,"position":5},"title":"Journey to a million models","date":"February 2, 2021","format":false,"excerpt":"Journey to a Million\u00a0ModelsThe AIOps team in Salesforce started developing an anomaly detection system using the large amount of telemetry data collected from thousands of servers. The goal of this project was to enable proactive incident detection and bring down the mean time to detect (MTTD) and mean time to\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/883","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=883"}],"version-history":[{"count":0,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/883\/revisions"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=883"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=883"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=883"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}