{"id":275,"date":"2021-08-31T14:40:23","date_gmt":"2021-08-31T14:40:23","guid":{"rendered":"https:\/\/fde.cat\/?p=275"},"modified":"2021-08-31T14:40:23","modified_gmt":"2021-08-31T14:40:23","slug":"mitigating-the-effects-of-silent-data-corruption-at-scale","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/mitigating-the-effects-of-silent-data-corruption-at-scale\/","title":{"rendered":"Mitigating the effects of silent data corruption at scale"},"content":{"rendered":"<h2><span>What the research is:\u00a0<\/span><\/h2>\n<p><span>Silent data corruption, or data errors that go undetected by the larger system, is a <\/span><a href=\"https:\/\/indico.cern.ch\/event\/13797\/contributions\/1362288\/attachments\/115080\/163419\/Data_integrity_v3.pdf\"><span>widespread problem<\/span><\/a><span> for large-scale infrastructure systems. This type of corruption can propagate across the stack and manifest as application-level problems. It can also result in data loss and require months to debug and resolve. This work describes the best practices for detecting and remediating silent data corruptions on a scale of hundreds of thousands of machines.\u00a0<\/span><\/p>\n<p><span>In our <a href=\"https:\/\/arxiv.org\/abs\/2102.11245\" target=\"_blank\" rel=\"noopener\">paper<\/a>, we research common defect types observed in CPUs, using a real-world example of silent data corruption within a data center application, leading to missing rows in the database. We map out a high-level debug flow to determine root cause and triage the missing data.\u00a0\u00a0<\/span><\/p>\n<p><span>We determine that reducing silent data corruption requires not only hardware resiliency and production detection mechanisms, but also robust fault-tolerant software architectures. <\/span><\/p>\n<h2><span>How it works:\u00a0<\/span><\/h2>\n<p><span>Silent errors can happen during any set of functions within a data center CPU. We describe one example in detail to illustrate the debug methodology and our approach to tackling this in our large fleet. In a large-scale infrastructure, files are usually compressed when they are not being read, and decompressed when a request is made to read the file. Millions of these operations are performed every day. In this example, we mainly focus on the decompression aspect.\u00a0<\/span><\/p>\n<p><span>Before decompression is performed, file size is checked to see if the file size &gt; 0. A valid compressed file with contents would have a nonzero size. In our example, when the file size was being computed, a file with a nonzero file size was provided as input to the decompression algorithm. Interestingly, the computation returned a value of 0 for a nonzero file size. Since the result of the file size computation was returned as 0, the file was not written into the decompressed output database.<\/span><\/p>\n<p><span>As a result, for some random scenarios, when the file size was non-zero, the decompression activity was skipped. As a result, the database that relied on the actual content of the file had missing files. These files with blank contents and\/or incorrect size propagate to the application. An application that keeps a list of key value store mappings for compressed files immediately observes that some files that are compressed are no longer recoverable. This chain of dependencies causes the application to fail, and eventually the querying infrastructure reports data loss after decompression. The complexity is magnified as this happens occasionally when engineers schedule the same workload on a cluster of machines.<\/span><\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-17286\" src=\"https:\/\/i0.wp.com\/engineering.fb.com\/wp-content\/uploads\/2021\/02\/SDC1.jpg?resize=750%2C422&#038;ssl=1\" alt=\"Example of silent data corruption\" width=\"750\" height=\"422\"  data-recalc-dims=\"1\"><\/p>\n<p><span>Detecting and reproducing this scenario in a large scale environment is very complex. In this case, the reproducer at a multi-machine querying infrastructure level was reduced to a single machine workload. From the single machine workload, we identified that the failures were truly sporadic in nature. The workload was identified to be multi-threaded, and upon single threading the workload, the failure was no longer sporadic but consistent for a certain subset of data values on one particular core of the machine. The sporadic nature associated with multi-threading was eliminated but the sporadic nature associated with the data values persisted. After a few iterations, it became obvious that the computation of<\/span><\/p>\n<p><em>Int<\/em> (1.1<sup>53<\/sup>) = 0<\/p>\n<p><span>as an input to the <code>math.pow<\/code> function in <\/span><a href=\"https:\/\/www.scala-lang.org\/\"><span>Scala<\/span><\/a><span> will always produce a result of 0 on Core 59 of the CPU. However, if the computation is changed to<\/span><\/p>\n<p><em>Int<\/em> (1.1<sup>52<\/sup>) = 142<\/p>\n<p><span>the result is accurate.<\/span><\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-17287\" src=\"https:\/\/i0.wp.com\/engineering.fb.com\/wp-content\/uploads\/2021\/02\/SDC2.jpg?resize=750%2C422&#038;ssl=1\" alt=\"Diagram documenting the root-cause flow for silent data corruption \" width=\"750\" height=\"422\"  data-recalc-dims=\"1\"><\/p>\n<p><span>The above diagram documents the root-cause flow. The corruption affects calculations that can be nonzero as well. For example, the following incorrect computations were performed on the machine that was identified as defective. We identified that computation affected positive and negative powers for specific data values and in some cases, the result was nonzero when it should be zero. Incorrect values were obtained with varying degrees of precision.<\/span><\/p>\n<p>Example errors:<\/p>\n<p><em>Int<\/em> [(1.1)<sup>3<\/sup>] = 0 , <em>expected<\/em> = 1<\/p>\n<p><em>Int<\/em> [(1.1)<sup>107<\/sup>] = 32809 , <em>expected<\/em> = 26854<\/p>\n<p><em>Int<\/em> [(1.1)<sup>-3<\/sup>] = 1 , <em>expected<\/em> = 0<\/p>\n<p><span>For an application, this results in decompressed files that are incorrect in size and incorrectly truncated without an end-of-file (EoF) terminator. This leads to dangling file nodes, missing data, and no traceability of corruption within an application. Intrinsic data dependency on the core, as well as the data inputs, makes these types of problems computationally hard to detect and determine the root cause without a targeted reproducer. This is especially challenging when there are hundreds of thousands of machines performing a few million computations every second.\u00a0<\/span><\/p>\n<p><span>After integrating the reproducer script into our detection mechanisms, additional machines were flagged for failing the reproducer. Multiple software- and hardware-resilient mechanisms were integrated as a result of these investigations.<\/span><\/p>\n<h2><span>Why it matters:\u00a0<\/span><\/h2>\n<p><span>Silent data corruptions are becoming a more common phenomena in data centers than previously observed. In our <a href=\"https:\/\/arxiv.org\/abs\/2102.11245\" target=\"_blank\" rel=\"noopener\">paper<\/a>, we present an example that illustrates one of the many scenarios that can be encountered in dealing with data-dependent, reclusive, and hard-to-debug errors. Multiple strategies of detection and mitigation bring additional complexity to large-scale infrastructure. A better understanding of these corruptions helps us increase the fault tolerance and resilience of our software architecture. Together, these strategies help us build the next generation of infrastructure computing to be more reliable.<\/span><\/p>\n<h2><span>Read the full paper:<\/span><\/h2>\n<p><a href=\"https:\/\/arxiv.org\/abs\/2102.11245\" target=\"_blank\" rel=\"noopener\"><span>Silent data corruptions at scale<\/span><\/a><\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/engineering.fb.com\/2021\/02\/23\/data-infrastructure\/silent-data-corruption\/\">Mitigating the effects of silent data corruption at scale<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/engineering.fb.com\/\">Facebook Engineering<\/a>.<\/p>\n<p><a href=\"https:\/\/engineering.fb.com\/2021\/02\/23\/data-infrastructure\/silent-data-corruption\/\" target=\"_blank\" rel=\"noopener\">Read More<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>What the research is:\u00a0 Silent data corruption, or data errors that go undetected by the larger system, is a widespread problem for large-scale infrastructure systems. This type of corruption can propagate across the stack and manifest as application-level problems. It can also result in data loss and require months to debug and resolve. This work&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2021\/08\/31\/mitigating-the-effects-of-silent-data-corruption-at-scale\/\">Continue reading <span class=\"screen-reader-text\">Mitigating the effects of silent data corruption at scale<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-275","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":555,"url":"https:\/\/fde.cat\/index.php\/2022\/03\/17\/detecting-silent-errors-in-the-wild-combining-two-novel-approaches-to-quickly-detect-silent-data-corruptions-at-scale\/","url_meta":{"origin":275,"position":0},"title":"Detecting silent errors in the wild: Combining two novel approaches to quickly detect silent data corruptions at scale","date":"March 17, 2022","format":false,"excerpt":"Silent data corruptions (SDCs), data errors that go undetected by the larger system, are a widespread problem for large-scale infrastructure systems. Left undetected, these types of corruptions can cause data loss and propagate across the stack and manifest as application-level problems. Silent data corruptions (SDC) in hardware impact computational integrity\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":276,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/sre-weekly-issue-259\/","url_meta":{"origin":275,"position":1},"title":"SRE Weekly Issue #259","date":"August 31, 2021","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, StackHawk: Mark your calendars! The first conference for OWASP ZAP users is taking place March 9. Get your free ticket to connect with other ZAP users and learn about the project\u2019s roadmap http:\/\/sthwk.com\/zapcon-sreweekly Articles Increment: Reliability This quarter\u2019s Increment issue is about\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":883,"url":"https:\/\/fde.cat\/index.php\/2024\/06\/19\/pvf-a-novel-metric-for-understanding-ai-systems-vulnerability-against-sdcs-in-model-parameters\/","url_meta":{"origin":275,"position":2},"title":"PVF: A novel metric for understanding AI systems\u2019 vulnerability against SDCs in model parameters","date":"June 19, 2024","format":false,"excerpt":"We\u2019re introducing parameter vulnerability factor (PVF), a novel metric for understanding and measuring AI systems\u2019 vulnerability against silent data corruptions (SDCs) in model parameters. PVF can be tailored to different AI models and tasks, adapted to different hardware faults, and even extended to the training phase of AI models. We\u2019re\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":669,"url":"https:\/\/fde.cat\/index.php\/2023\/01\/26\/tulip-modernizing-metas-data-platform\/","url_meta":{"origin":275,"position":3},"title":"Tulip: Modernizing Meta\u2019s data platform","date":"January 26, 2023","format":false,"excerpt":"The technical journey discusses the motivations, challenges, and technical solutions employed for warehouse schematization, especially a change to the wire serialization format employed in Meta\u2019s data platform for data interchange related to Warehouse Analytics Logging. Here, we discuss the engineering, scaling, and nontechnical challenges of modernizing\u00a0 Meta\u2019s exabyte-scale data platform\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":307,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/running-border-gateway-protocol-in-large-scale-data-centers\/","url_meta":{"origin":275,"position":4},"title":"Running Border Gateway Protocol in large-scale data centers","date":"August 31, 2021","format":false,"excerpt":"What the research is: A first-of-its-kind study that details the scalable design, software implementation, and operations of Facebook\u2019s data center routing design, based on Border Gateway Protocol (BGP). BGP was originally designed to interconnect autonomous internet service providers (ISPs) on the global internet. Highly scalable and widely acknowledged as an\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":301,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/reverse-debugging-at-scale\/","url_meta":{"origin":275,"position":5},"title":"Reverse debugging at scale","date":"August 31, 2021","format":false,"excerpt":"Say you receive an email notification that a service is crashing just after your last code change deploys. The crash happens in only 0.1 percent of the servers where it runs. But you\u2019re at a large-scale company, so 0.1 percent equals thousands of servers \u2014 and\u00a0this issue is going to\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/275","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=275"}],"version-history":[{"count":1,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/275\/revisions"}],"predecessor-version":[{"id":435,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/275\/revisions\/435"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=275"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=275"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=275"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}