{"id":319,"date":"2021-08-31T14:39:51","date_gmt":"2021-08-31T14:39:51","guid":{"rendered":"https:\/\/fde.cat\/?p=319"},"modified":"2021-08-31T14:39:51","modified_gmt":"2021-08-31T14:39:51","slug":"network-hose-managing-uncertain-network-demand-with-model-simplicity","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/network-hose-managing-uncertain-network-demand-with-model-simplicity\/","title":{"rendered":"Network hose: Managing uncertain network demand with model simplicity"},"content":{"rendered":"<p><span>Our production <\/span><a href=\"https:\/\/engineering.fb.com\/2017\/05\/01\/data-center-engineering\/building-express-backbone-facebook-s-new-long-haul-network\/\"><span>backbone network<\/span><\/a><span> connects our data centers and delivers content to our users. The network supports a vast number of different services, distributed across a multitude of data centers. Traffic patterns shift over time from one data center to another due to the introduction of new services, service architecture changes, changes in user behavior, new data centers, etc. As a result, we\u2019ve seen exponential and highly variable traffic demands for many years.<\/span><\/p>\n<p><span>To meet service bandwidth expectations we need an accurate long-term demand forecast. However, because of the nature of our services, the fluidity of the workloads, and the difficulty in predicting future service behavior, it is difficult to predict future traffic between each data center pair (i.e., the traffic matrix). To account for this traffic uncertainty, we\u2019ve made design methodology changes that will eliminate our dependence on predicting the future traffic matrix. We achieve this by planning the production network for an aggregate traffic originating from or terminating toward a data center, referred to as network hose. By planning for network hose, we reduce the forecast complexity by an order of magnitude.<\/span><\/p>\n<h2><span>The traditional approach to network planning<\/span><\/h2>\n<\/p>\n<p><span>The traditional approach to network planning is to size the topology to accommodate a given traffic matrix under a set of potential failures that we define using a failure protection policy. In this approach:<\/span><\/p>\n<p>Traffic matrix <span>is the volume of the traffic we forecast between any two data centers (pairwise demands).<\/span><br \/>\nFailure protection policy <span>is a set of failures that are commonly observed in any large network, such as singular fiber cuts or dual submarine link failures, or a set of failures that have been encountered multiple times in the past (e.g., dual aerial fiber failure outside a data center).\u00a0<\/span><br \/>\n<span>We use a cost optimization model to calculate the network capacity plan. Essentially, the well-known integer linear programming formulation ensures availability of capacity to service the traffic matrix under all failures that are defined in our policy.<br \/>\n<\/span><\/p>\n<h2><span>What is the problem?<\/span><\/h2>\n<p><span>The following reasons made us rethink the classical approach:<\/span><\/p>\n<p>Lack of long-term fidelity: <span>Backbone network capacity turnup requires longer lead times, typically in the order of months or even years, in the case of procuring or building a terrestrial fiber route or building a submarine cable across the Atlantic Ocean. Given our services\u2019 past growth and dynamic nature, it\u2019s challenging to forecast service behavior for a time frame over six months. Our original approach was to handle traffic uncertainties by dimensioning the network for worst-case assumptions and sizing for a higher percentile, say P95.<\/span><\/p>\n<p><span>Asking every service owner to provide a traffic estimate per data center pair is hardly manageable. With the traditional approach, a service owner needs to give an explicit demand spec. That is daunting because not only do we see changes in current service behavior, but we also don\u2019t know what new services will be introduced and will consume our network in a one-year or above time frame. The problem of forecasting exact traffic is even more difficult in the long term because the upcoming data centers are not even in production when the forecast is requested.<\/span><\/p>\n<p>Abstracting network as a consumable resource:<span> A service typically requires compute, storage, and network resources to operate from a data center. Each data center has a known compute and storage resource that can be distributed to different services. A service owner can reason the short-term and long-term requirements for these resources and can consider them as a consumable entity per data center. However, this is not true for the network because the network is a shared resource. It is desirable to create a planning method that can abstract the network\u2019s complexity and present it to services like any other consumable entity per data center.<\/span><br \/>\nOperational churn: <span>Tracking every service\u2019s traffic surge, identifying its root cause, and estimating its potential impact is becoming increasingly difficult. Most of these surges are harmless because not all services surge simultaneously. Nonetheless, this still creates operational overhead for tracking many false alarms.<\/span><\/p>\n<h3>Network-hose planning model:<\/h3>\n<p><span>Instead of forecasting traffic for each (source, destination) pair, we perform a traffic forecast for total egress and total ingress traffic per data center, i.e., <\/span><span>network hose<\/span><span>. Instead of asking how much traffic a service would generate from X to Y, we ask how much ingress and egress traffic a service is expected to generate from X. Thus, we replace the O(N^2) data points per service with O(N). When planning for aggregated traffic, we naturally bake in statistical multiplexing in the forecast.<\/span><\/p>\n<\/p>\n<p><span>The figure above reflects the change in input for the planning problem. Instead of a classical traffic matrix, we now only have traffic hose forecast and generate a network plan that supports it under all failures defined by the failure policy.<\/span><\/p>\n<h2><span>Solving the planning challenge<\/span><\/h2>\n<p><span>While the<\/span> <span>network hose model captures the end-to-end demand uncertainty concisely, it creates a different challenge: Dealing with the many demand sets realizing the hose constraints. In other words, if we take the convex polytope of all the demand sets that satisfy the hose constraint, it has a continuous space inside the polytope to deal with. Typically, this would be useful for an optimization problem as we can leverage linear programming techniques to solve this effectively. However, this model\u2019s key difference is that each point inside the convex polytope is a single traffic matrix. The long-term network build plan has to satisfy all such demand sets if we fulfill the hose constraint. That creates an enormous computational challenge as designing a cross-layer global production network is already an intensive optimization problem for a single demand set.<\/span><\/p>\n<p><span>The above reasons drive the need to intelligently identify a few demand sets from this convex polytope that can serve as reference demand sets for the network design problem. In finding these reference demand sets, we are interested in a few fundamental properties that they should satisfy:<\/span><\/p>\n<p><span>These are the demand sets that are likely to drive the need for additional resources on the production network (fiber and equipment).<\/span><br \/>\n<span>If we design the network explicitly for this subset, then we would like to guarantee with high probability that we cover the remaining demand sets.<\/span><br \/>\n<span>The number of reference demand sets should be as small as possible to reduce the cross-layer network design problem\u2019s state space.<\/span><\/p>\n<p><span>To identify these reference demand sets, we exploit the cuts in the topology and location (latitude, longitude) of the data centers to gain insights into the maximum flow that can cross a network cut. See the example below for a network cut in an example topology. This network cut partitions the topology into two sets of nodes, (1,2,3,4) and (5,6,7,8). To size the link on this network cut, we need only one traffic matrix that generates maximum traffic over the graph cut. All other traffic matrices with lower or equal traffic over this cut get admitted on this with no additional bandwidth requirement over the graph cut.<\/span><\/p>\n<\/p>\n<p><span>Note that, in a topology with N nodes, we can create 2^N network cuts and have one traffic matrix per cut. However, the geographical nature of these cuts is essential, given the planer nature of our topology. It turns out that simple cuts (typically a straight line cut) are more critical to dimension the topology than more complicated cuts. As shown in the figure below, a traffic matrix for each of the simple cuts is more meaningful than a traffic matrix for the \u201ccomplicated\u201d cuts: A complicated cut is already taken into account by a set of simple cuts.<\/span><\/p>\n<\/p>\n<p><span>By focusing on the simple cuts, we reduce the number of reference demand sets to the smallest traffic matrix set. We solve these traffic matrices using a cost optimization model and produce a network plan supporting all possible traffic matrices. Based on simulations, we observe that given the nature of our topology, additional capacity required to keep the hose-based traffic matrix is not significant but provides powerful simplicity in our planning and operational workflows.<\/span><\/p>\n<p><span>By adopting a hose-based network capacity planning method, we have reduced the forecast complexity by an order of magnitude, enabled services to reason network like any other consumable entity, and simplified operations by eliminating a significant number of traffic surge\u2013related alarms because we now track traffic surge in aggregate from a data center and not surge between every two data centers.<\/span><\/p>\n<p><span>Many people contributed to this project, but we\u2019d particularly like to thank Tyler Price,<\/span> <span>Alexander Gilgur, Hao Zhong,<\/span><span> Y<\/span><span>ing Zhang<\/span><span>, <\/span><span>Alexander Nikolaidis, <\/span><span>Gaya Nagarajan, Steve Politis, Biao Lu, <\/span><span>Ying Zhang, and Abhinav Triguna for being instrumental in making this project happen<\/span><span>.<\/span><\/p>\n<p>The post <a href=\"https:\/\/engineering.fb.com\/2021\/06\/15\/data-infrastructure\/network-hose\/\">Network hose: Managing uncertain network demand with model simplicity<\/a> appeared first on <a href=\"https:\/\/engineering.fb.com\/\">Facebook Engineering<\/a>.<\/p>\n<p><a href=\"https:\/\/engineering.fb.com\/2021\/06\/15\/data-infrastructure\/network-hose\/\">Read More<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Our production backbone network connects our data centers and delivers content to our users. The network supports a vast number of different services, distributed across a multitude of data centers. Traffic patterns shift over time from one data center to another due to the introduction of new services, service architecture changes, changes in user behavior,&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2021\/08\/31\/network-hose-managing-uncertain-network-demand-with-model-simplicity\/\">Continue reading <span class=\"screen-reader-text\">Network hose: Managing uncertain network demand with model simplicity<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-319","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":630,"url":"https:\/\/fde.cat\/index.php\/2022\/09\/07\/network-entitlement-a-contract-based-network-sharing-solution\/","url_meta":{"origin":319,"position":0},"title":"Network Entitlement: A contract-based network sharing solution","date":"September 7, 2022","format":false,"excerpt":"Meta\u2019s overall network usage and traffic volume has increased as we\u2019ve continued to add new services. Due to the scarcity of fiber resources, we\u2019re developing an explicit resource reservation framework to effectively plan, manage, and operate the shared consumption of network bandwidth, which will help us keep up with demand\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":307,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/running-border-gateway-protocol-in-large-scale-data-centers\/","url_meta":{"origin":319,"position":1},"title":"Running Border Gateway Protocol in large-scale data centers","date":"August 31, 2021","format":false,"excerpt":"What the research is: A first-of-its-kind study that details the scalable design, software implementation, and operations of Facebook\u2019s data center routing design, based on Border Gateway Protocol (BGP). BGP was originally designed to interconnect autonomous internet service providers (ISPs) on the global internet. Highly scalable and widely acknowledged as an\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":604,"url":"https:\/\/fde.cat\/index.php\/2022\/07\/06\/watch-metas-engineers-discuss-quic-and-tcp-innovations-for-our-network\/","url_meta":{"origin":319,"position":2},"title":"Watch Meta\u2019s engineers discuss QUIC and TCP innovations for our network","date":"July 6, 2022","format":false,"excerpt":"With more than 75 percent of our internet traffic set to use QUIC and HTTP\/3 together, QUIC is slowly moving to become the de facto protocol used for internet communication at Meta. For Meta\u2019s data center network, TCP remains the primary network transport protocol that supports thousands of services on\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":670,"url":"https:\/\/fde.cat\/index.php\/2023\/01\/27\/watch-metas-engineers-discuss-optimizing-large-scale-networks\/","url_meta":{"origin":319,"position":3},"title":"Watch Meta\u2019s engineers discuss optimizing large-scale networks","date":"January 27, 2023","format":false,"excerpt":"Managing network solutions amidst a growing scale inherently brings challenges around performance, deployment, and operational complexities.\u00a0 At Meta, we\u2019ve found that these challenges broadly fall into three themes: 1.) \u00a0 Data center networking: Over the past decade, on the physical front, we have seen a rise in vendor-specific hardware that\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":482,"url":"https:\/\/fde.cat\/index.php\/2021\/10\/05\/update-about-the-october-4th-outage\/","url_meta":{"origin":319,"position":4},"title":"Update about the October 4th outage","date":"October 5, 2021","format":false,"excerpt":"To all the people and businesses around the world who depend on us, we are sorry for the inconvenience caused by today\u2019s outage across our platforms. We\u2019ve been working as hard as we can to restore access, and our systems are now back up and running. The underlying cause of\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":484,"url":"https:\/\/fde.cat\/index.php\/2021\/10\/05\/more-details-about-the-october-4-outage\/","url_meta":{"origin":319,"position":5},"title":"More details about the October 4 outage","date":"October 5, 2021","format":false,"excerpt":"Now that our platforms are up and running as usual after yesterday\u2019s outage, I thought it would be worth sharing a little more detail on what happened and why \u2014 and most importantly, how we\u2019re learning from it.\u00a0 This outage was triggered by the system that manages our global backbone\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/319","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=319"}],"version-history":[{"count":1,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/319\/revisions"}],"predecessor-version":[{"id":391,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/319\/revisions\/391"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=319"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=319"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=319"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}