{"id":307,"date":"2021-08-31T14:40:03","date_gmt":"2021-08-31T14:40:03","guid":{"rendered":"https:\/\/fde.cat\/?p=307"},"modified":"2021-08-31T14:40:03","modified_gmt":"2021-08-31T14:40:03","slug":"running-border-gateway-protocol-in-large-scale-data-centers","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/running-border-gateway-protocol-in-large-scale-data-centers\/","title":{"rendered":"Running Border Gateway Protocol in large-scale data centers"},"content":{"rendered":"<h2><span>What the research is:<\/span><\/h2>\n<p><span>A first-of-its-kind study that details the scalable design, software implementation, and operations of Facebook\u2019s data center routing design, based on <\/span><a href=\"https:\/\/en.wikipedia.org\/wiki\/Border_Gateway_Protocol\"><span>Border Gateway Protocol (BGP)<\/span><\/a><span>. BGP was originally designed to interconnect autonomous internet service providers (ISPs) on the global internet. Highly scalable and widely acknowledged as an attractive choice for routing, BGP is the routing protocol that connects the entire internet. Similar to online map services for streets, BGP directs data packets, helping determine the most efficient route through a network.\u00a0<\/span><\/p>\n<p><span>Based on our experience implementing it into our data centers, BGP can form a robust routing foundation, but it requires tight codesign with the data center topology, configuration, switch software, and data center\u2013wide operational pipeline. We devised this routing design for our data centers to build our network quickly and provide high availability for our services, while keeping the design itself scalable. We know failures happen in any large-scale system \u2014 hence, our routing design aims to minimize the impact of any potential failures.<\/span><\/p>\n<h2><span>How it works:<\/span><\/h2>\n<p><span>To achieve the goals we\u2019d set, we had to go beyond using BGP as a mere routing protocol. The resulting design creates a baseline connectivity configuration on top of our existing scalable network topology. We employ a uniform AS numbering scheme that is reused across different <\/span><a href=\"https:\/\/engineering.fb.com\/2014\/11\/14\/production-engineering\/introducing-data-center-fabric-the-next-generation-facebook-data-center-network\/\"><span>data center fabrics<\/span><\/a><span>, simplifying ASN management across data centers. We use hierarchical route summarization on all levels of the topology to scale to our data center sizes while ensuring that forwarding tables in hardware are small.\u00a0<\/span><\/p>\n<p><span>Our policy configuration is tightly integrated with our baseline connectivity configuration. Our policies ensure reliable communication using route propagation scopes and predefined backup paths for failures. They also allow us to maintain the network by seamlessly diverting traffic from problematic\/faulty devices in a graceful fashion. Finally, they ensure that services remain reachable even when an instance of the service is added, removed, or migrated.<\/span><\/p>\n<p>Data center fabric architecture, which consists of server pods and spine planes, supports growing compute and network demands.<br \/>\nThe BGP confederation and AS numbering scheme for server pods and spine planes in the data center are reusable across all data centers.<\/p>\n<p><span>To support the growing scale and evolving routing requirements, our switch-level BGP agent needs periodic updates to add new features, optimization, and bug fixes. To optimize this process (i.e., to ensure fast, frequent changes to the network infrastructure to support good route processing performance), we implemented an in-house BGP agent. We keep the codebase simple and implement only the necessary protocol features required in our data center, but we do not deviate from the BGP specifications.<\/span><span>\u00a0<\/span><\/p>\n<p><span>To minimize impact on production traffic while achieving high release velocity for the BGP agent, we built our own testing and incremental deployment framework, consisting of unit testing, emulation, and canary testing. We use a multi-phase deployment pipeline to push changes to agents.<\/span><\/p>\n<p>Testing and deployment pipeline.<\/p>\n<h2>Why it matters:<\/h2>\n<p><span>BGP has made serious inroads into data centers thanks to its scalability, extensive policy control, and proven track record of running the internet for a few decades. Data center operators are known to use BGP for routing, often in different ways. Yet, because data center requirements are very different from the internet, using BGP to achieve effective data center routing is much more complex.<\/span><\/p>\n<p><span>Facebook\u2019s BGP-based data center routing design marries the stringent requirements of data centers with BGP\u2019s functionality. This design provides us with flexible control over routing and keeps the network reliable. Our in-house BGP software implementation and its testing and deployment pipelines allow us to treat BGP like any other software component, enabling fast incremental updates. Finally, our operational experience running BGP for more than two years across our data center fleet has influenced our current and ongoing routing design and operation. Our experience with BGP has shown that it\u2019s an effective option for large-scale data centers. We hope sharing this research helps others who are looking for a similar solution.<\/span><\/p>\n<p><a href=\"https:\/\/www.usenix.org\/conference\/nsdi21\/presentation\/abhashkumar\">To learn more, watch our presentation from NSDI 2021.<\/a><\/p>\n<h2><span>Read the full paper:<\/span><\/h2>\n<p><a href=\"https:\/\/research.fb.com\/publications\/running-bgp-in-data-centers-at-scale\/\"><span>BGP in Facebook data centers<\/span><\/a><\/p>\n<p>The post <a href=\"https:\/\/engineering.fb.com\/2021\/05\/13\/data-center-engineering\/bgp\/\">Running Border Gateway Protocol in large-scale data centers<\/a> appeared first on <a href=\"https:\/\/engineering.fb.com\/\">Facebook Engineering<\/a>.<\/p>\n<p><a href=\"https:\/\/engineering.fb.com\/2021\/05\/13\/data-center-engineering\/bgp\/\">Read More<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>What the research is: A first-of-its-kind study that details the scalable design, software implementation, and operations of Facebook\u2019s data center routing design, based on Border Gateway Protocol (BGP). BGP was originally designed to interconnect autonomous internet service providers (ISPs) on the global internet. Highly scalable and widely acknowledged as an attractive choice for routing, BGP&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2021\/08\/31\/running-border-gateway-protocol-in-large-scale-data-centers\/\">Continue reading <span class=\"screen-reader-text\">Running Border Gateway Protocol in large-scale data centers<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-307","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":484,"url":"https:\/\/fde.cat\/index.php\/2021\/10\/05\/more-details-about-the-october-4-outage\/","url_meta":{"origin":307,"position":0},"title":"More details about the October 4 outage","date":"October 5, 2021","format":false,"excerpt":"Now that our platforms are up and running as usual after yesterday\u2019s outage, I thought it would be worth sharing a little more detail on what happened and why \u2014 and most importantly, how we\u2019re learning from it.\u00a0 This outage was triggered by the system that manages our global backbone\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":463,"url":"https:\/\/fde.cat\/index.php\/2021\/09\/20\/sre-weekly-issue-287\/","url_meta":{"origin":307,"position":1},"title":"SRE Weekly Issue #287","date":"September 20, 2021","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, StackHawk: Trying to figure out how to keep your APIs secure? You\u2019re not the only one. See how DataRobot is automating API security testing with StackHawk. https:\/\/sthwk.com\/DataRobot Articles Industry Interviews: Colm Doyle, Incident Commander at Slack Lots of details about how Slack\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":300,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/sre-weekly-issue-267\/","url_meta":{"origin":307,"position":2},"title":"SRE Weekly Issue #267","date":"August 31, 2021","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, StackHawk: Serverless doesn\u2019t mean secure. Use modern security testing tools to assess serverless applications for vulnerabilities during development. http:\/\/sthwk.com\/serverless Articles SRE Case Study: Mysterious Traffic Imbalance Yet more proof that DNS behavior varies way more than is obvious at first glance. Who\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":310,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/peering-automation-at-facebook\/","url_meta":{"origin":307,"position":3},"title":"Peering automation at Facebook","date":"August 31, 2021","format":false,"excerpt":"Traffic on the internet travels across many different kinds of links. A fast and reliable way to exchange traffic between different networks and service providers is through peering. Initially, we managed peering via a time-intensive manual process. Reliable peering is essential for Facebook and for everyone\u2019s internet use. But there\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":487,"url":"https:\/\/fde.cat\/index.php\/2021\/10\/11\/sre-weekly-issue-291\/","url_meta":{"origin":307,"position":4},"title":"SRE Weekly Issue #291","date":"October 11, 2021","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, Rootly: Manage incidents directly from Slack with Rootly \ud83d\ude92. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging the right team, postmortem timeline, setting up reminders, and more. Book a demo: https:\/\/rootly.io\/?utm_source=sreweekly Articles Understanding How Facebook Disappeared from the\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":895,"url":"https:\/\/fde.cat\/index.php\/2024\/07\/15\/sre-weekly-issue-433\/","url_meta":{"origin":307,"position":5},"title":"SRE Weekly Issue #433","date":"July 15, 2024","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, FireHydrant: We\u2019ve gone all out on our new integration with Microsoft Teams. If you\u2019re a MS Teams user, FireHydrant now supports the most comprehensive integration for incident management. Run the entire IM process without ever leaving the chat. https:\/\/firehydrant.com\/blog\/introducing-a-brand-new-microsoft-teams-integration\/ 5 Non-Technical Skills\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/307","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=307"}],"version-history":[{"count":1,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/307\/revisions"}],"predecessor-version":[{"id":404,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/307\/revisions\/404"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=307"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=307"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=307"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}