{"id":313,"date":"2021-08-31T14:39:51","date_gmt":"2021-08-31T14:39:51","guid":{"rendered":"https:\/\/fde.cat\/?p=313"},"modified":"2021-08-31T14:39:51","modified_gmt":"2021-08-31T14:39:51","slug":"how-facebook-deals-with-pcie-faults-to-keep-our-data-centers-running-reliably","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/how-facebook-deals-with-pcie-faults-to-keep-our-data-centers-running-reliably\/","title":{"rendered":"How Facebook deals with PCIe faults to keep our data centers running reliably"},"content":{"rendered":"<p><span>Peripheral component interconnect express (PCIe) hardware continues to push the boundaries of computing thanks to advances in transfer speeds, the number of available lanes for simultaneous data delivery, and a comparatively small footprint on motherboards. Today, PCIe connectivity-based hardware delivers faster data transfers and is one of the de facto methods to connect components to servers.<\/span><\/p>\n<p><span>Our data centers contain millions of PCIe-based hardware components \u2014 including ASIC-based accelerators for video and inference, GPUs, NICs, and SSDs \u2014 connected either directly into a PCI slot on a server\u2019s motherboard or through a PCIe switch like a carrier card.<\/span><\/p>\n<p><span>As with any hardware, PCIe-based components are susceptible to different types of hardware-, firmware-, or software-related failures and performance degradation. The variety of components and vendors, array of failures, and the challenges of scale make monitoring, collecting data, and performing fault isolation for PCIe-based components challenging.<\/span><\/p>\n<p><span>We\u2019ve developed a solution to detect, diagnose, remediate, and repair these issues. Since we\u2019ve implemented it, this methodology has helped make our hardware fleet more reliable, resilient, and performant. And we believe the wider industry can benefit from the same information, strategies, and help build industry standards around this common problem.<\/span><\/p>\n<p>Facebook\u2019s data centers employ a range of PCIe-based hardware components.<\/p>\n<h2><span>Our tools for addressing PCIe faults<\/span><\/h2>\n<p><span>First, let\u2019s outline the tools we use:<\/span><\/p>\n<p><a href=\"https:\/\/engineering.fb.com\/2020\/08\/05\/open-source\/pcicrawler\/\">PCIcrawler<\/a>:<span> An open source, Python-based command line interface tool that can be used to display, filter, and export information about PCI or PCIe buses and devices, including PCI topology and PCIe Advanced Error Reporting (AER) errors. This tool produces visually appealing, treelike outputs for easy debugging as well as machine parsable json output that can be consumed by tools for deployment at scale.<\/span><\/p>\n<p><a href=\"https:\/\/engineering.fb.com\/2020\/12\/09\/data-center-engineering\/how-facebook-keeps-its-large-scale-infrastructure-hardware-up-and-running\/\">MachineChecker<\/a>:<span> An in-house tool for quickly evaluating the production worthiness of servers from a hardware standpoint. MachineChecker helps detect and diagnose hardware problems. It can be run as a command line input tool. It also lives as a library and a service.<\/span><\/p>\n<p><span>An in-house tool for taking a snapshot of the target host\u2019s hardware configuration along with hardware modeling.<\/span><\/p>\n<p><span>An in-house utility service used to parse the custom <\/span><a href=\"https:\/\/man7.org\/linux\/man-pages\/man1\/dmesg.1.html\"><span>dmesg<\/span><\/a><span> and SELs to detect and report PCIe errors on millions of servers. This tool parses the logs on the server at regular intervals and records the rate of correctable errors on a file on the corresponding server. The rate is recorded per 10 minutes, per 30 minutes, per hour, per six hours, and per day. This rate is used to decide which servers have exceeded the configured tolerable PCIe-corrected error rate threshold depending on the platform and the service.<\/span><\/p>\n<p><a href=\"https:\/\/github.com\/ipmitool\/\">IPMI Tool<\/a>:<span> An open source utility for managing and configuring devices that support the Intelligent Platform Management Interface (IPMI). IPMI is an open standard for monitoring, logging, recovery, inventory, and control of hardware that is implemented independent of the main CPU, BIOS, and OS. It\u2019s mainly used to manually extract <\/span><a href=\"https:\/\/www.intel.com\/content\/dam\/support\/us\/en\/documents\/server-products\/SEL_TroubleshootingGuide.pdf\"><span>System Event Logs<\/span><\/a><span> (SELs) for inspection, debugging, and study.<\/span><\/p>\n<p><a href=\"https:\/\/github.com\/openbmc\/\">The OpenBMC Project<\/a>: <span>A Linux distribution for embedded devices that have a baseboard management controller (BMC).<\/span><\/p>\n<p>Facebook auto remediation (<a href=\"https:\/\/engineering.fb.com\/2016\/07\/11\/production-engineering\/making-facebook-self-healing-automating-proactive-rack-maintenance\/\">FBAR<\/a>):<span> A system and a set of daemons that execute code automatically in response to detected software and hardware signals on individual servers. Every day, without human intervention, FBAR takes faulty servers out of production and sends requests to our data center teams to perform physical hardware repairs, making isolated failures a nonissue.<\/span><\/p>\n<p><a href=\"https:\/\/research.fb.com\/wp-content\/uploads\/2016\/11\/scuba-diving-into-data-at-facebook.pdf\">Scuba<\/a>:<span> A fast, scalable, distributed, in-memory database built at Facebook. It is the data management system we use for most of our real-time analysis.<\/span><\/p>\n<h2><span>How we studied PCIe faults<\/span><\/h2>\n<p><span>The sheer variety of PCIe hardware components (ASICs, NICs, SSDs, etc.) makes studying PCIe issues a daunting task. These components can have different vendors, firmware versions, and different applications running on them. On top of this, the applications themselves might have different compute and storage needs, usage profiles, and tolerances.<\/span><\/p>\n<p><span>By leveraging the tools listed above, we\u2019ve been conducting studies to ameliorate these challenges and ascertain the root cause of PCIe hardware failures and performance degradation.<\/span><\/p>\n<p><span>Some of the issues were obvious. PCIe fatal uncorrected errors, for example, are definitely bad, even if there is only one instance on a particular server. MachineChecker can detect this and mark the faulty hardware (ultimately leading to it being replaced).<\/span><\/p>\n<p><span>Depending on the error conditions, uncorrectable errors are further classified into nonfatal errors and fatal errors. Nonfatal errors are ones that cause a particular transaction to be unreliable, but the PCIe link itself is fully functional. Fatal errors, on the other hand, cause the link to be unreliable. Based on our experience, we\u2019ve found that for any uncorrected PCIe error, swapping the hardware component (and sometimes the motherboard) is the most effective action.<\/span><\/p>\n<p><span>Other issues can seem innocuous at first. PCIe-corrected errors, for example, are correctable by definition and are mostly corrected well in practice. Correctable errors are supposed to pose no impact on the functionality of the interface. However, the rate at which correctable errors occur matters. And if the rate is beyond a particular threshold, it leads to a degradation in performance that is not acceptable for certain applications.<\/span><\/p>\n<p><span>We conducted an in-depth study to correlate the performance degradation and system stalls to PCIe-corrected error rates. Determining the threshold is another challenge, since different platforms and different applications have different profiles and needs. We rolled out the PCIe Error Logging Service, observed the failures in the Scuba tables, and correlated events, system stalls, and PCIe faults to determine the thresholds for each platform. We\u2019ve found that swapping hardware is the most effective solution when PCIe-corrected error rates cross a particular threshold.<\/span><\/p>\n<p><span>PCIe defines two error-reporting paradigms: The baseline capability and the AER capability. The baseline capability is required of all PCIe components and provides a minimum defined set of error reporting requirements. The AER capability is implemented with a PCIe AER extended capability structure and provides more robust error reporting. The PCIe AER driver provides the infrastructure to support PCIe AER capability and we leveraged PCIcrawler to take advantage of this.<\/span><\/p>\n<p><span>We recommend that every vendor adopt the PCIe AER functionality and PCIcrawler rather than relying on custom vendor tools, which lack generality. Custom tools are hard to parse and even harder to maintain. Moreover, integrating new vendors, new kernel versions, or new types of hardware requires a lot of time and effort.<\/span><\/p>\n<p><span>Bad (down-negotiated) link speed (usually running at 1\/2 or 1\/4 of the expected speed) and bad (down-negotiated) link width (running at 1\/2, 1\/4, or even 1\/8 of the expected link width) were other concerning PCIe faults. These faults can be difficult to detect without some sort of automated tool because the hardware is working, just not as optimally as it could.<\/span><\/p>\n<p><span>Based on our study at scale, we found that most of these faults could be corrected by reseating hardware components. This is why we try this first before marking the hardware as faulty.<\/span><\/p>\n<p><span>Since a small minority of these faults can be fixed by a reboot, we also record historical repair actions. We have special rules to identify repeat offenders. For example, if the same hardware component on the same server fails a predefined number of times in a predetermined time interval, after a predefined number of reseats, we automatically mark it as faulty and swap it out. In cases where the component swap does not fix it, we will have to resort to a motherboard swap.<\/span><\/p>\n<p><span>We also keep an eye on the repair trend to identify nontypical failure rates. For example, in one case, by using data from custom Scuba tables and their illustrative graphs and timelines, we root-caused a down-negotiation issue to a specific firmware release from a specific vendor. We then worked with the vendor to roll out new firmware that fixed the issue.<\/span><\/p>\n<p><span>It\u2019s also important to rate-limit remediations and repairs as a safety net to prevent bugs in the code from mass draining and unprovisioning, which can result in service outages if not handled properly.<\/span><\/p>\n<p><span>Using this overall methodology, we\u2019ve been able to add hardware health coverage and fix several thousand servers and server components. Every week, we\u2019ve been able to detect, diagnose, remediate, and repair various PCIe faults on hundreds of servers.<\/span><\/p>\n<h2><span>Our PCIe fault workflow<\/span><\/h2>\n<p><span>Here\u2019s a step-by-step breakdown of our process for identifying and fixing PCIe faults:<\/span><\/p>\n<p><span>MachineChecker runs periodically as a service on the millions of hardware servers and switches in our production fleet. Some of the checks include PCIe link speed, PCIe link width, as well as PCIe-uncorrected and PCIe-corrected error rate checks.<\/span><br \/>\n<span>For a particular PCIe endpoint, we find its parent called upstream using PCIcrawler\u2019s PCIe topology information. We consider both ends of a PCIe link.<\/span><span><br \/>\n<\/span><br \/>\n<span>We leverage PCIcrawler\u2019s output, which in turn depends on the generic registers LnkSta, LnkSta2, LnkCtl, and LnkCtl2.<\/span><br \/>\n<span>We calculate expected speed as:<\/span><span><br \/>\n<\/span><span>expected_speed = min (upstream_target_speed, endpoint_capable_speed, upstream_capable_speed).<\/span><br \/>\n<span>We calculate current_speed as:<\/span><span><br \/>\n<\/span><span>current_speed = min (endpoint_current_speed, upstream_current_speed).<\/span><br \/>\n<span>current_speed must be equal to expected_speed.<br \/>\n<\/span>In other words, we should have the current speed of either end be equal to the minimum of the capable speeds, upstream capable, downstream capable, and upstream target speed.<br \/>\n<span>For PCIe link width, we calculate expected_width as:<\/span><span><br \/>\n<\/span><span>expected_width = min(pcie_upstream_device capable_width, pcie_endpoint_device capable width).<\/span><br \/>\n<span>If the expected_width is less than the current width of the upstream, we flag this as a bad link.<\/span><br \/>\n<span>The PCIe Error Logging Service independently runs on our hardware servers and independently records the rate of corrected and uncorrectable errors and their rates in a predetermined format (json).<\/span><br \/>\n<span>MachineChecker checks for uncorrected errors. Even a single uncorrected error event qualifies a server as faulty.<\/span><br \/>\n<span>During its periodic run, MachineChecker also looks up the generated files on the servers and checks them against a prerecorded source of truth in <\/span><a href=\"https:\/\/research.fb.com\/publications\/holistic-configuration-management-at-facebook\/\"><span>Configerator<\/span><\/a><span> (our configuration management system) for a threshold per platform. If the rate exceeds a preset threshold, the hardware is marked as faulty. These thresholds are easily adjustable per platform.<\/span><br \/>\n<span>We also leverage PCIcrawler, which is also preinstalled on all our hardware servers, to check for PCIe AER issues.<\/span><br \/>\n<span>We leverage our in-house tool\u2019s knowledge of hardware configuration to associate a PCIe address to a given hardware part.<\/span><br \/>\n<span>MachineChecker uses PCIcrawler (for link width, link speed, and AER information) and the PCIe Error Parsing Service (which in turn uses SEL and dmesg) to identify hardware issues and create alerts or alarms. MachineChecker leverages information from our in-house tool to identify the hardware components associated with the PCIe addresses and assists data center operators (who may need to swap out the hardware) by supplying additional information, such as the component\u2019s location, model information, and vendor name.<\/span><br \/>\n<span>Application production engineers can subscribe to these alerts or alarms and customize workflows for monitoring, alerting, remediation, and custom repair.<\/span><br \/>\n<span>A subset of all the alerts can undergo a particular remediation. We can also fine-tune the remediation and add special casing, restricting the remediation to, for example, a firmware upgrade if a particular case is well known.<\/span><br \/>\n<span>If the remediation fails sufficiently, a hardware repair ticket is automatically created so that the data center operators can swap the bad hardware component or server with a tested good one.<\/span><br \/>\n<span>We have rate limiting in several places as a safety net to prevent bugs in the code or mass draining and unprovisioning, which can result in service outages if not handled properly.<\/span><\/p>\n<p><span>We\u2019ve added hardware health coverage and fixed several thousand servers and server components with this methodology. We continue to detect, diagnose, remediate, and repair hundreds of servers every week with various PCIe faults. This has made our hardware fleet more reliable, resilient, and performant.<\/span><\/p>\n<p><span>We\u2019d like to thank Aaron Miller, Aleksander Ksi\u0105\u017cek, Chris Chen, Deomid Ryabkov, Wren Turkal and many others who contributed to this work in different aspects.<\/span><\/p>\n<p>The post <a href=\"https:\/\/engineering.fb.com\/2021\/06\/02\/data-center-engineering\/how-facebook-deals-with-pcie-faults-to-keep-our-data-centers-running-reliably\/\">How Facebook deals with PCIe faults to keep our data centers running reliably<\/a> appeared first on <a href=\"https:\/\/engineering.fb.com\/\">Facebook Engineering<\/a>.<\/p>\n<p><a href=\"https:\/\/engineering.fb.com\/2021\/06\/02\/data-center-engineering\/how-facebook-deals-with-pcie-faults-to-keep-our-data-centers-running-reliably\/\">Read More<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Peripheral component interconnect express (PCIe) hardware continues to push the boundaries of computing thanks to advances in transfer speeds, the number of available lanes for simultaneous data delivery, and a comparatively small footprint on motherboards. Today, PCIe connectivity-based hardware delivers faster data transfers and is one of the de facto methods to connect components to&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2021\/08\/31\/how-facebook-deals-with-pcie-faults-to-keep-our-data-centers-running-reliably\/\">Continue reading <span class=\"screen-reader-text\">How Facebook deals with PCIe faults to keep our data centers running reliably<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-313","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":346,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/open-sourcing-a-more-precise-time-appliance\/","url_meta":{"origin":313,"position":0},"title":"Open-sourcing a more precise time appliance","date":"August 31, 2021","format":false,"excerpt":"Facebook engineers have built and open-sourced an Open Compute Time Appliance, an important component of the modern timing infrastructure. To make this possible, we came up with the Time Card \u2014 a PCI Express (PCIe) card that can turn almost any commodity server into a time appliance. With the help\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":172,"url":"https:\/\/fde.cat\/index.php\/2020\/12\/09\/how-facebook-keeps-its-large-scale-infrastructure-hardware-up-and-running\/","url_meta":{"origin":313,"position":1},"title":"How Facebook keeps its large-scale infrastructure hardware up and running","date":"December 9, 2020","format":false,"excerpt":"Facebook\u2019s services rely on fleets of servers in data centers all over the globe \u2014 all running applications and delivering the performance our services need. This is why we need to make sure our server hardware is reliable and that we can manage server hardware failures at our scale with\u2026","rel":"","context":"In &quot;External&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":501,"url":"https:\/\/fde.cat\/index.php\/2021\/11\/09\/ocp-summit-2021-open-networking-hardware-lays-the-groundwork-for-the-metaverse\/","url_meta":{"origin":313,"position":2},"title":"OCP Summit 2021: Open networking hardware lays the groundwork for the metaverse","date":"November 9, 2021","format":false,"excerpt":"Open infrastructure technologies and networking hardware will play an important role as we build new technologies for the metaverse, where billions of people will someday come together in virtual spaces. As we head toward the next major computing platform with a continued spirit of embracing openness and disaggregation, we\u2019re announcing\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":879,"url":"https:\/\/fde.cat\/index.php\/2024\/06\/12\/how-meta-trains-large-language-models-at-scale\/","url_meta":{"origin":313,"position":3},"title":"How Meta trains large language models at scale","date":"June 12, 2024","format":false,"excerpt":"As we continue to focus our AI research and development on solving increasingly complex problems, one of the most significant and challenging shifts we\u2019ve experienced is the sheer scale of computation required to train large language models (LLMs). Traditionally, our AI model training has involved a training massive number of\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":314,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/sre-weekly-issue-273\/","url_meta":{"origin":313,"position":4},"title":"SRE Weekly Issue #273","date":"August 31, 2021","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, StackHawk: StackHawk is helping One Medical equip developers with automated security testing and self-service remediations. See how: http:\/\/sthwk.com\/onemedical Articles Incident Management vs. Incident Response What indeed? It depends on who you ask. Quentin Rousseau \u2014 Rootly Cores that don\u2019t count This academic\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":883,"url":"https:\/\/fde.cat\/index.php\/2024\/06\/19\/pvf-a-novel-metric-for-understanding-ai-systems-vulnerability-against-sdcs-in-model-parameters\/","url_meta":{"origin":313,"position":5},"title":"PVF: A novel metric for understanding AI systems\u2019 vulnerability against SDCs in model parameters","date":"June 19, 2024","format":false,"excerpt":"We\u2019re introducing parameter vulnerability factor (PVF), a novel metric for understanding and measuring AI systems\u2019 vulnerability against silent data corruptions (SDCs) in model parameters. PVF can be tailored to different AI models and tasks, adapted to different hardware faults, and even extended to the training phase of AI models. We\u2019re\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/313","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=313"}],"version-history":[{"count":1,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/313\/revisions"}],"predecessor-version":[{"id":397,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/313\/revisions\/397"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=313"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=313"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=313"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}