{"id":344,"date":"2021-08-31T14:39:28","date_gmt":"2021-08-31T14:39:28","guid":{"rendered":"https:\/\/fde.cat\/?p=344"},"modified":"2021-08-31T14:39:28","modified_gmt":"2021-08-31T14:39:28","slug":"risk-driven-backbone-management-during-covid-19-and-beyond","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/risk-driven-backbone-management-during-covid-19-and-beyond\/","title":{"rendered":"Risk-driven backbone management during COVID-19 and beyond"},"content":{"rendered":"<h2><span>What the research is:\u00a0<\/span><\/h2>\n<p><span>A first-of-its-kind study detailing our backbone management strategy to ensure high service performance throughout the COVID-19 pandemic. The pandemic moved most social interactions online and caused an unprecedented stress test on our global network infrastructure with tens of data center regions. At this scale, failures such as fiber cuts, router misconfigurations, and power outages are a frequent occurrence.<\/span><\/p>\n<p><span>We ran a simulation system that identifies possible failures and quantifies their potential severity with a set of metrics that measure network risk. The risk metrics, in turn, guided operational decisions for capacity deployment. Coupled with traffic priority management and proactive capacity enhancement, our backbone resiliently withstood the COVID-19 stress test while achieving high service availability and low latency, and efficiently handled traffic surges.<\/span><\/p>\n<h2><span>How it works:\u00a0<\/span><\/h2>\n<p><span>To satisfy the network\u2019s service-level objectives (SLO), we started by defining a set of risk metrics around demand loss, availability, and latency stretch. All these metrics are computed with respect to possible failure scenarios in the network, which can be enumerated by going through all the components making up the network. The goal of the failure modeling is to estimate the likelihood of a failure scenario as well as the duration of the failure event. Each component failure is characterized by its mean time between failures and mean time to repair. These statistics are estimated based on a combination of historical data and clustering followed by Bayesian regression modeling on common features such as vendor, ownership, and geographical region.<\/span><\/p>\n<p><span>Our risk simulation system periodically computes the aforementioned risk metrics. It works by taking a fresh snapshot of the network topology and demand and the set of failure scenarios to consider together with their failure characteristics as its input. Due to the high number of failure scenarios, each is sharded onto a number of worker jobs that run the same code as our <\/span><a href=\"https:\/\/engineering.fb.com\/2017\/05\/01\/data-center-engineering\/building-express-backbone-facebook-s-new-long-haul-network\/\"><span>SD-WAN controller<\/span><\/a><span> to compute the traffic engineering decision for the given failure scenario. The decisions are aggregated to derive the risk metrics and then logged for continuous monitoring.<\/span><\/p>\n<\/p>\n<p><span>During the onset of COVID-19, the risk metrics reported a significant increase in demand loss (which captures the highest traffic loss across all simulated failure scenarios), a decrease in availability and increase in latency for all quality of service (QoS) classes. The risk metrics guided us to the possible failure scenarios that, were they to occur, would degrade the network operating conditions for certain regions. Capacity was proactively deployed to mitigate these risks. Another helpful technique was looking at the traffic flows from the regions at risk, differentiating the traffic by criticality and then downgrading the QoS to a lower priority. The QoS classes are, in order of importance, infrastructure control (class 1), user traffic (class 2), internal applications (class 3), and bulk data transfer (class 4). We downgraded a lot of latency-insensitive traffic from class 3 to class 4. Less capacity is thus needed to guarantee the same level of SLO.<\/span><\/p>\n<h2><span>Why it matters:\u00a0<\/span><\/h2>\n<p><span>There is a long lead time of months to years for building up capacity for backbone networks. As such, network operators typically procure capacity based on estimated traffic growth. When COVID-19 hit, there was a significant unplanned ramp-up in traffic within a short period of time, stressing backbone infrastructure all across the world.\u00a0<\/span><\/p>\n<p><span>Facebook was able to react swiftly thanks to its risk-driven backbone management strategy. Leveraging the risk metrics computed by our simulation systems, we quickly identified the operational pain points and prioritized capacity enhancements to bring the network back to normal. Our experience has shown that a metrics-centric approach to backbone management could adapt to rare adverse external shock. We hope our research can help operators looking to build a more resilient network. We would like to thank Ying Zhang, Guanqing Yan, Satyajeet Singh Ahuja, Alexander Nikolaidis, Soshant Bali, Bob Kamma and Gaya Nagarajan for their work on this project.\u00a0<\/span><\/p>\n<p><a href=\"https:\/\/www.usenix.org\/conference\/nsdi21\/presentation\/xia\">To learn more, watch our presentation at NSDI 2021<\/a>.<\/p>\n<h2><span>Read the full paper:<\/span><\/h2>\n<p><a href=\"https:\/\/research.fb.com\/publications\/a-social-network-under-social-distancing-risk-driven-backbone-management-during-covid-19-and-beyond\/\"><span>A social network under social distancing: Risk-driven network management during COVID-19 and beyond<\/span><\/a><\/p>\n<p>The post <a href=\"https:\/\/engineering.fb.com\/2021\/08\/09\/connectivity\/backbone-management\/\">Risk-driven backbone management during COVID-19 and beyond<\/a> appeared first on <a href=\"https:\/\/engineering.fb.com\/\">Facebook Engineering<\/a>.<\/p>\n<p><a href=\"https:\/\/engineering.fb.com\/2021\/08\/09\/connectivity\/backbone-management\/\">Read More<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>What the research is:\u00a0 A first-of-its-kind study detailing our backbone management strategy to ensure high service performance throughout the COVID-19 pandemic. The pandemic moved most social interactions online and caused an unprecedented stress test on our global network infrastructure with tens of data center regions. At this scale, failures such as fiber cuts, router misconfigurations,&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2021\/08\/31\/risk-driven-backbone-management-during-covid-19-and-beyond\/\">Continue reading <span class=\"screen-reader-text\">Risk-driven backbone management during COVID-19 and beyond<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-344","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":484,"url":"https:\/\/fde.cat\/index.php\/2021\/10\/05\/more-details-about-the-october-4-outage\/","url_meta":{"origin":344,"position":0},"title":"More details about the October 4 outage","date":"October 5, 2021","format":false,"excerpt":"Now that our platforms are up and running as usual after yesterday\u2019s outage, I thought it would be worth sharing a little more detail on what happened and why \u2014 and most importantly, how we\u2019re learning from it.\u00a0 This outage was triggered by the system that manages our global backbone\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":630,"url":"https:\/\/fde.cat\/index.php\/2022\/09\/07\/network-entitlement-a-contract-based-network-sharing-solution\/","url_meta":{"origin":344,"position":1},"title":"Network Entitlement: A contract-based network sharing solution","date":"September 7, 2022","format":false,"excerpt":"Meta\u2019s overall network usage and traffic volume has increased as we\u2019ve continued to add new services. Due to the scarcity of fiber resources, we\u2019re developing an explicit resource reservation framework to effectively plan, manage, and operate the shared consumption of network bandwidth, which will help us keep up with demand\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":670,"url":"https:\/\/fde.cat\/index.php\/2023\/01\/27\/watch-metas-engineers-discuss-optimizing-large-scale-networks\/","url_meta":{"origin":344,"position":2},"title":"Watch Meta\u2019s engineers discuss optimizing large-scale networks","date":"January 27, 2023","format":false,"excerpt":"Managing network solutions amidst a growing scale inherently brings challenges around performance, deployment, and operational complexities.\u00a0 At Meta, we\u2019ve found that these challenges broadly fall into three themes: 1.) \u00a0 Data center networking: Over the past decade, on the physical front, we have seen a rise in vendor-specific hardware that\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":319,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/network-hose-managing-uncertain-network-demand-with-model-simplicity\/","url_meta":{"origin":344,"position":3},"title":"Network hose: Managing uncertain network demand with model simplicity","date":"August 31, 2021","format":false,"excerpt":"Our production backbone network connects our data centers and delivers content to our users. The network supports a vast number of different services, distributed across a multitude of data centers. Traffic patterns shift over time from one data center to another due to the introduction of new services, service architecture\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":604,"url":"https:\/\/fde.cat\/index.php\/2022\/07\/06\/watch-metas-engineers-discuss-quic-and-tcp-innovations-for-our-network\/","url_meta":{"origin":344,"position":4},"title":"Watch Meta\u2019s engineers discuss QUIC and TCP innovations for our network","date":"July 6, 2022","format":false,"excerpt":"With more than 75 percent of our internet traffic set to use QUIC and HTTP\/3 together, QUIC is slowly moving to become the de facto protocol used for internet communication at Meta. For Meta\u2019s data center network, TCP remains the primary network transport protocol that supports thousands of services on\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":166,"url":"https:\/\/fde.cat\/index.php\/2020\/12\/30\/2020-year-in-review-connectivity-innovations-faster-apps-and-progress-toward-net-zero\/","url_meta":{"origin":344,"position":5},"title":"2020 year in review: Connectivity innovations, faster apps, and progress toward net zero","date":"December 30, 2020","format":false,"excerpt":"It goes without saying that 2020 has been a challenging year, to put it lightly. But if anything, the COVID-19 pandemic has shined a light on our need to connect as people. For Facebook, that meant our work has become more important than ever. Whether it was finding new and\u2026","rel":"","context":"In &quot;External&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/344","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=344"}],"version-history":[{"count":1,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/344\/revisions"}],"predecessor-version":[{"id":366,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/344\/revisions\/366"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=344"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=344"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=344"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}