{"id":484,"date":"2021-10-05T17:26:45","date_gmt":"2021-10-05T17:26:45","guid":{"rendered":"https:\/\/fde.cat\/index.php\/2021\/10\/05\/more-details-about-the-october-4-outage\/"},"modified":"2021-10-05T17:26:45","modified_gmt":"2021-10-05T17:26:45","slug":"more-details-about-the-october-4-outage","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2021\/10\/05\/more-details-about-the-october-4-outage\/","title":{"rendered":"More details about the October 4 outage"},"content":{"rendered":"<p><span>Now that our platforms are up and running as usual after yesterday\u2019s outage, I thought it would be worth sharing a little more detail on what happened and why \u2014 and most importantly, how we\u2019re learning from it.\u00a0<\/span><\/p>\n<p><span>This outage was triggered by the system that manages our global backbone network capacity. The backbone is the network Facebook has built to connect all our computing facilities together, which consists of tens of thousands of miles of fiber-optic cables crossing the globe and linking all our data centers.<\/span><\/p>\n<p><span>Those data centers come in different forms. Some are massive buildings that house millions of machines that store data and run the heavy computational loads that keep our platforms running, and others are smaller facilities that connect our backbone network to the broader internet and the people using our platforms.\u00a0<\/span><\/p>\n<p><span>When you open one of our apps and load up your feed or messages, the app\u2019s request for data travels from your device to the nearest facility, which then communicates directly over our backbone network to a larger data center. That\u2019s where the information needed by your app gets retrieved and processed, and sent back over the network to your phone.<\/span><\/p>\n<p><span>The data traffic between all these computing facilities is managed by routers, which figure out where to send all the incoming and outgoing data. And in the extensive day-to-day work of maintaining this infrastructure, our engineers often need to take part of the backbone offline for maintenance \u2014 perhaps repairing a fiber line, adding more capacity, or updating the software on the router itself.<\/span><\/p>\n<p><span>This was the source of yesterday\u2019s outage. During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally. Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool didn\u2019t properly stop the command.\u00a0<\/span><\/p>\n<p><span>This change caused a complete disconnection of our server connections between our data centers and the internet. And that total loss of connection caused a second issue that made things worse.\u00a0\u00a0<\/span><\/p>\n<p><span>One of the jobs performed by our smaller facilities is to respond to DNS queries. DNS is the address book of the internet, enabling the simple web names we type into browsers to be translated into specific server IP addresses. Those translation queries are answered by our authoritative name servers that occupy well known IP addresses themselves, which in turn are advertised to the rest of the internet via another protocol called the border gateway protocol (BGP).\u00a0<\/span><\/p>\n<p><span>To ensure reliable operation, our DNS servers disable those BGP advertisements if they themselves can not speak to our data centers, since this is an indication of an unhealthy network connection. In the recent outage the entire backbone was removed from operation,\u00a0 making these locations declare themselves unhealthy and withdraw those BGP advertisements. The end result was that our DNS servers became unreachable even though they were still operational. This made it impossible for the rest of the internet to find our servers.\u00a0<\/span><\/p>\n<p><span>All of this happened very fast. And as our engineers worked to figure out what was happening and why, they faced two large obstacles: first, it was not possible to access our data centers through our normal means because their networks were down, and second, the total loss of DNS broke many of the internal tools we\u2019d normally use to investigate and resolve outages like this.\u00a0<\/span><\/p>\n<p><span>Our primary and out-of-band network access was down, so we sent engineers onsite to the data centers to have them debug the issue and restart the systems. But this took time, because these facilities are designed with high levels of physical and system security in mind. They\u2019re hard to get into, and once you\u2019re inside, the hardware and routers are designed to be difficult to modify even when you have physical access to them. So it took extra time to activate the secure access protocols needed to get people onsite and able to work on the servers. Only then could we confirm the issue and bring our backbone back online.\u00a0<\/span><\/p>\n<p><span>Once our backbone network connectivity was restored across our data center regions, everything came back up with it. But the problem was not over \u2014 we knew that flipping our services back on all at once could potentially cause a new round of crashes due to a surge in traffic. Individual data centers were reporting dips in power usage in the range of tens of megawatts, and suddenly reversing such a dip in power consumption could put everything from electrical systems to caches at risk.\u00a0\u00a0\u00a0<\/span><\/p>\n<p><span>Helpfully, this is an event we\u2019re well prepared for thanks to the \u201cstorm\u201d drills we\u2019ve been running for a long time now. In a storm exercise, we simulate a major system failure by taking a service, data center, or entire region offline, stress testing all the infrastructure and software involved. Experience from these drills gave us the confidence and experience to bring things back online and carefully manage the increasing loads. In the end, our services came back up relatively quickly without any further systemwide failures. And while we\u2019ve never previously run a storm that simulated our global backbone being taken offline, we\u2019ll certainly be looking for ways to simulate events like this moving forward.\u00a0\u00a0<\/span><\/p>\n<p><span>Every failure like this is an opportunity to learn and get better, and there\u2019s plenty for us to learn from this one. After every issue, small and large, we do an extensive review process to understand how we can make our systems more resilient. That process is already underway.\u00a0\u00a0<\/span><\/p>\n<p><span>We\u2019ve done extensive work hardening our systems to prevent unauthorized access, and it was interesting to see how that hardening slowed us down as we tried to recover from an outage caused not by malicious activity, but an error of our own making. I believe a tradeoff like this is worth it \u2014 greatly increased day-to-day security vs. a slower recovery from a hopefully rare event like this. From here on out, our job is to strengthen our testing, drills, and overall resilience to make sure events like this happen as rarely as possible.<\/span><\/p>\n<p>The post <a href=\"https:\/\/engineering.fb.com\/2021\/10\/05\/networking-traffic\/outage-details\/\">More details about the October 4 outage<\/a> appeared first on <a href=\"https:\/\/engineering.fb.com\/\">Facebook Engineering<\/a>.<\/p>\n<p>Facebook Engineering<\/p>","protected":false},"excerpt":{"rendered":"<p>Now that our platforms are up and running as usual after yesterday\u2019s outage, I thought it would be worth sharing a little more detail on what happened and why \u2014 and most importantly, how we\u2019re learning from it.\u00a0 This outage was triggered by the system that manages our global backbone network capacity. The backbone is&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2021\/10\/05\/more-details-about-the-october-4-outage\/\">Continue reading <span class=\"screen-reader-text\">More details about the October 4 outage<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-484","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":482,"url":"https:\/\/fde.cat\/index.php\/2021\/10\/05\/update-about-the-october-4th-outage\/","url_meta":{"origin":484,"position":0},"title":"Update about the October 4th outage","date":"October 5, 2021","format":false,"excerpt":"To all the people and businesses around the world who depend on us, we are sorry for the inconvenience caused by today\u2019s outage across our platforms. We\u2019ve been working as hard as we can to restore access, and our systems are now back up and running. The underlying cause of\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":344,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/risk-driven-backbone-management-during-covid-19-and-beyond\/","url_meta":{"origin":484,"position":1},"title":"Risk-driven backbone management during COVID-19 and beyond","date":"August 31, 2021","format":false,"excerpt":"What the research is:\u00a0 A first-of-its-kind study detailing our backbone management strategy to ensure high service performance throughout the COVID-19 pandemic. The pandemic moved most social interactions online and caused an unprecedented stress test on our global network infrastructure with tens of data center regions. At this scale, failures such\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":670,"url":"https:\/\/fde.cat\/index.php\/2023\/01\/27\/watch-metas-engineers-discuss-optimizing-large-scale-networks\/","url_meta":{"origin":484,"position":2},"title":"Watch Meta\u2019s engineers discuss optimizing large-scale networks","date":"January 27, 2023","format":false,"excerpt":"Managing network solutions amidst a growing scale inherently brings challenges around performance, deployment, and operational complexities.\u00a0 At Meta, we\u2019ve found that these challenges broadly fall into three themes: 1.) \u00a0 Data center networking: Over the past decade, on the physical front, we have seen a rise in vendor-specific hardware that\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":604,"url":"https:\/\/fde.cat\/index.php\/2022\/07\/06\/watch-metas-engineers-discuss-quic-and-tcp-innovations-for-our-network\/","url_meta":{"origin":484,"position":3},"title":"Watch Meta\u2019s engineers discuss QUIC and TCP innovations for our network","date":"July 6, 2022","format":false,"excerpt":"With more than 75 percent of our internet traffic set to use QUIC and HTTP\/3 together, QUIC is slowly moving to become the de facto protocol used for internet communication at Meta. For Meta\u2019s data center network, TCP remains the primary network transport protocol that supports thousands of services on\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":630,"url":"https:\/\/fde.cat\/index.php\/2022\/09\/07\/network-entitlement-a-contract-based-network-sharing-solution\/","url_meta":{"origin":484,"position":4},"title":"Network Entitlement: A contract-based network sharing solution","date":"September 7, 2022","format":false,"excerpt":"Meta\u2019s overall network usage and traffic volume has increased as we\u2019ve continued to add new services. Due to the scarcity of fiber resources, we\u2019re developing an explicit resource reservation framework to effectively plan, manage, and operate the shared consumption of network bandwidth, which will help us keep up with demand\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":319,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/network-hose-managing-uncertain-network-demand-with-model-simplicity\/","url_meta":{"origin":484,"position":5},"title":"Network hose: Managing uncertain network demand with model simplicity","date":"August 31, 2021","format":false,"excerpt":"Our production backbone network connects our data centers and delivers content to our users. The network supports a vast number of different services, distributed across a multitude of data centers. Traffic patterns shift over time from one data center to another due to the introduction of new services, service architecture\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/484","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=484"}],"version-history":[{"count":0,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/484\/revisions"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=484"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=484"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=484"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}