{"id":628,"date":"2022-09-06T16:00:53","date_gmt":"2022-09-06T16:00:53","guid":{"rendered":"https:\/\/fde.cat\/index.php\/2022\/09\/06\/viewing-the-world-as-a-computer-global-capacity-management\/"},"modified":"2022-09-06T16:00:53","modified_gmt":"2022-09-06T16:00:53","slug":"viewing-the-world-as-a-computer-global-capacity-management","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2022\/09\/06\/viewing-the-world-as-a-computer-global-capacity-management\/","title":{"rendered":"Viewing the world as a computer: Global capacity management"},"content":{"rendered":"<p><span>Meta currently operates 14 data centers around the world. This rapidly expanding global data center footprint poses new challenges for service owners and for our infrastructure management systems. Systems like<\/span><a href=\"https:\/\/engineering.fb.com\/2019\/06\/06\/data-center-engineering\/twine\/\" target=\"_blank\" rel=\"noopener\"> <span>Twine<\/span><\/a><span>, which we use to scale cluster management, and<\/span><a href=\"https:\/\/atscaleconference.com\/videos\/ras-a-resource-allowance-system-for-perpetual-region-wide-resource-allocation\/\" target=\"_blank\" rel=\"noopener\"> <span>RAS<\/span><\/a><span>, which handles perpetual region-wide resource allocation, have provided the abstractions and automation necessary for service owners to be machine-agnostic within a region. However, as we expand our number of data center regions, we need new approaches to global service and capacity management.<\/span><\/p>\n<p><span>That\u2019s why we\u2019ve created new systems, called Global Reservations Service and Regional Fluidity, that determine the best placement for a service based on intent, needs, and current congestion.<\/span><\/p>\n<p><span>Twine and RAS allowed service owners to be machine-agnostic within a region and to view<\/span><a href=\"https:\/\/www.morganclaypool.com\/doi\/10.2200\/S00874ED3V01Y201809CAC046\"> <span>the data center as a computer<\/span><\/a><span>. But we want to take this concept to the next level \u2014 beyond the data center. Our goal is for service owners to be region-agnostic, which means they can manage any service at any data center. Once they\u2019ve become region-agnostic, service owners can operate with a computer\u2019s level of abstraction \u2014 making it possible to view the entire world as a computer.\u00a0<\/span><\/p>\n<div class=\"fb-video\"><\/div>\n<h2><span>Designing a new approach to capacity management<\/span><\/h2>\n<p><span>As we prepare for a growing number of regions with different failure characteristics, we have to transform our approach to capacity management. The solution starts with continuously evolving our disaster-readiness strategy and changing how we plan for disaster-readiness buffer capacity.<\/span><\/p>\n<p><span>Having more regions also means redistributing capacity more often to increase volume in new regions, as well as more frequent hardware decommissions and refreshes. Many of today\u2019s systems provide only regional abstractions, which limits an infrastructure\u2019s ability to automate movements across regions. Additionally, many of today\u2019s service owners hard-code specific regions and manually compute the capacity required to be disaster-ready.\u00a0<\/span><\/p>\n<p><span>We often don\u2019t know service owners\u2019 intentions for using specific regions. As a result, our infrastructure provides less flexibility for shifting capacity across regions to optimize for different goals or to be more efficient<\/span><span>. At Meta, we realized we needed to start thinking more holistically about the solution and develop a longer-term vision of transparent, automated global-capacity management.<\/span><\/p>\n<h2><span>Latency-tolerant vs. latency-sensitive services<\/span><\/h2>\n<p><span>When scaling our global capacity, Meta initially scoped approaches to two common service types:\u00a0<\/span><\/p>\n<p>Stateless and latency-sensitive <span>services power products that demand a fast response time, such as viewing photos\/videos on Facebook and Instagram mobile apps.<\/span><br \/>\nLatency-tolerant <span>services power products in which a slight delay in fulfilling requests is tolerable, such as uploading a large file or seeing a friend\u2019s comments on a video.<\/span><\/p>\n<p><span>When considering placement of a latency-sensitive service, we must also take into account the placement of the service\u2019s upstream and downstream dependencies. Collectively, our infrastructure needs to place these services near one another to minimize latency. However, for latency-tolerant services, we do not have this constraint.<\/span><\/p>\n<p><span>By using infrastructure that has the flexibility to change the placement of a service, we can improve the performance of the products. Our Global Reservations Service simplifies capacity management for latency-tolerant services, while the Regional Fluidity system simplifies capacity management for latency-sensitive services.<\/span><\/p>\n<h2><span>Global Reservations for latency-tolerant services<\/span><\/h2>\n<p><span>Global Reservations help simplify reasoning about fault tolerance for service owners. When service owners request the capacity needed to serve their workloads, Global Reservations automatically adds and resizes disaster-readiness buffers over time to provide fault tolerance. This process works well for latency-tolerant services that are easily movable in isolation.<\/span><\/p>\n<h3><span>Building on top of regional capacity management foundations<\/span><\/h3>\n<p><span>We built Global Reservations on top of the regional capacity management foundation provided by RAS, which we presented last year at Systems @Scale. When service owners create regional reservations to acquire capacity for their workloads, the reservations provide strong guarantees about the placement of capacity. To determine the best placement for latency needs, RAS then performs continuous region-wide optimizations to improve the placement.\u00a0<\/span><\/p>\n\n<p><span>Previously, strong capacity management guarantees existed only at the regional level. Service owners had to monitor how much capacity to provision in each region for fault tolerance. As a result, the fault tolerance and capacity reasoning for each region became increasingly complex. As our scale continued to increase, we realized that this level of guarantee no longer provided the performance that our products required and that people who used them expected.\u00a0<\/span><\/p>\n<p><span>When we performed periodic hardware decommissions and refreshes, we observed that a subset of these services often did not have latency constraints. We then created Global Reservations to give service owners a simpler way to reason about their capacity \u2014 globally and with guarantees.\u00a0<\/span><\/p>\n<p><span>Instead of creating individual reservations within each region, the service owner now creates a single global reservation. Then, the Global Reservation Service determines how much capacity is needed in each region. Because these services are latency-tolerant, we can easily shift them across regions to the placement that provides the performance necessary for continuous global optimization. After developing the Global Reservations Service, we modified Twine to run at a global scope to allocate containers on global reservations.<\/span><\/p>\n<p><span>To illustrate how the service owner interacts with Global Reservations, the figure below shows a hypothetical service owner\u2019s request for 300 servers in the North America continental locality. In this example, the service owner is amenable to any region in North America that provides sufficient servers to meet their demand and tolerate the loss of any region.<\/span><\/p>\n\n<p><span>To meet that need, the Global Reservations Service provisions 150 servers each in three different regions, for a total of 450 servers. If we lose any region, the extra 150 machines serve as the disaster-readiness buffer to provide sufficient capacity.<\/span><\/p>\n<p><span>As we increase the number of regions, the Global Reservations Service automatically spreads capacity over more of them. For example, we can allocate 30 servers in each of 11 regions. Spreading servers across more regions creates a more efficient disaster-readiness buffer.<\/span><\/p>\n<h2><span>Global Reservations as an assignment problem<\/span><\/h2>\n\n<p><span>At the core of the Global Reservations Service, the Solver assigns objects to bins; the objects are machines and the bins are global reservations. This format enables the encoding of various constraints and objectives.<\/span><\/p>\n<p><span>In Table X, rows correspond to machines and columns to global reservations. Each entry contains an assignment variable X (either 0 or 1), which indicates whether that machine is assigned to that global reservation. Because we can assign a machine to exactly one global reservation, the variables in the red row must add up to 1. Additionally, we designate one special placeholder reservation as \u201cunassigned\u201d to hold all unassigned servers.<\/span><\/p>\n\n<p><span>Because we must assign enough machines to fulfill the request, the variables in the green column must equal the sum of at least the amount requested by Global Reservation 0.<\/span><\/p>\n\n<p><span>Since we must provide fault tolerance if any one region becomes unavailable, the total number of machines assigned to a global reservation minus the amount in the largest region for that reservation must be at least the amount requested by the global reservation.<\/span><\/p>\n\n<p><span>For example, to minimize the total number of allocated servers in our fleet, we use the following code:<\/span><\/p>\n\n<p><span>The solver also translates other types of constraints and objectives into a mixed-integer linear programming problem. If the assignments of these variables satisfies all the constraints, the Solver optimizes for the various objectives. For example, if Global Reservation 0 has two machines assigned in Region 1 and one machine assigned in Region 2, the solver generates the regional reservations for RAS in each region and continuously works to adjust the placement.<\/span><\/p>\n<p><span>Overall, global reservations help simplify global service management. The Global Reservations Service performs continuous optimization and improves service placement over time, making it possible to optimize infrastructure for other objectives. For example, if we remove capacity in a region to facilitate a hardware refresh, global reservations help us automatically redistribute capacity to other regions.<\/span><\/p>\n<p><span>Because global reservations are declarative and intent-based, service owners encode their intent in global reservations instead of hard-coding specific regions. The infrastructure now has the flexibility to improve the placement over time while still satisfying the service owner\u2019s intent.<\/span><\/p>\n<p><span>However, this approach does not work for latency-sensitive services, for which we must consider placement of upstream and downstream dependencies as well.<\/span><\/p>\n<h2><span>Regional Fluidity for latency-sensitive services<\/span><\/h2>\n<p><span>To scale global capacity management for latency-sensitive services, we must be able to place and move them safely. This requires us to understand latency and geographic distribution requirements by attributing demand for the services. After modeling services and understanding demand, we can use regional fluidity to rebalance the services by redistributing the demand source. This system leverages the capabilities of infrastructure to safely move services toward a globally optimized state.<\/span><\/p>\n<h3><span>Understanding demand sources of traffic<\/span><\/h3>\n<p><span>Most latency-sensitive services have placement requirements to reduce and\/or eliminate cross-region network requests (see the figure below).<\/span><\/p>\n\n<p><span>This diagram illustrates how service capacity requirements relate to two demand sources: users of the Facebook app and users of the Instagram app. To minimize user latency for these demand sources, the news feed services must be located in the same regions as the application front-end services that handle requests from the demand sources. This means service capacity cannot be allocated arbitrarily. The capacity must be allocated in proportion to the set of front-end drivers of the demand sources.<\/span><\/p>\n<p><span>Attribution of demand sources involves the following:\u00a0<\/span><\/p>\n<p><span>We leverage a<\/span><a href=\"https:\/\/research.facebook.com\/publications\/canopy-end-to-end-performance-tracing-at-scale\/\" target=\"_blank\" rel=\"noopener\"> <span>distributed tracing framework<\/span><\/a><span> to quantify the demand source attribution. For example, a request from the Instagram demand source may call the IG service, which then calls the Foo service. Our distributed tracing framework can determine that the request from IG to Foo originated from the Instagram demand source.<\/span><br \/>\n<span>Each demand source can be shifted independently, which means we can control the distribution of requests to each region from the Facebook demand source independent of traffic from the Instagram demand source. This operation can be performed through global traffic load balancers.<\/span><br \/>\n<span>Once a request from a demand source enters a region, the request typically must finish within that region or enter another independently shiftable demand source. For example, once we have distributed a request from the Instagram demand source to the IG service, the services we transitively call must stay within the region.<\/span><\/p>\n<h3><span>Redistributing capacity globally by shifting demand sources<\/span><\/h3>\n<p><span>Once we understand the demand sources for our services, we can shift the demand across regions to redistribute our capacity globally. In the figure below, we shift demand for a single demand source (the Facebook demand source) from Region A to Region B. In the initial state, we observe how much of each service\u2019s demand is attributed to the various demand sources. In this example, Facebook drives 80 percent of the first service\u2019s traffic in Region A and 75 percent in Region B, with enough capacity in both regions to handle the load.<\/span><\/p>\n\n<p><span>If we want to reclaim capacity in Region A and have a surplus of supply in Region B, we start by distributing additional capacity in Region B. Next, we shift some of the Facebook demand source traffic from Region A to Region B. The capacity must be distributed before the traffic shift to prevent overloading the service in that region. Once the traffic demand in Region A has decreased, we can reclaim the excess capacity.<\/span><\/p>\n<h3><span>Making regional capacity shift plans at scale<\/span><\/h3>\n<p><span>As the above example illustrated, this process is complex even for a simple two-region setup. When there are many regions, we need an automated solution for planning capacity. Similar to Global Reservations, the process of deriving a placement plan can be modeled as an assignment problem in which we assign capacity to services in specific regions.<\/span><\/p>\n<p><span>The Solver produces a placement plan for various demand sources by using the service dependencies and traffic ratios as constraints to determine how much capacity can be shifted safely. Then the Solver considers the constraints, optimization functions, demand attribution, and supply, and generates a feasible plan. Finally, the Solver performs global optimizations, allowing more efficient operation at a global scope.<\/span><\/p>\n\n<p><span>Regional Fluidity introduces the following new elements to provide first-class fluidity:<\/span><\/p>\n<p><span>Modeling to understand service latency and geographic distribution requirements<\/span><br \/>\n<span>Solvers to address global placement (bin-packing assignment) problems, honoring constraints and optimizing global objectives<\/span><br \/>\n<span>Automation to rebalance services across regions and redistribute the demand driving these services<\/span><\/p>\n<p><span>\u00a0After the Solver generates a plan, the Orchestrator executes that plan.<\/span><\/p>\n\n<p><span>The Orchestrator drives the execution of the plan from end to end, sequencing various actions to ensure the safety of all services at all times. Automation drives the actions as much as possible, but some require human intervention.<\/span><\/p>\n<p><span>For example, to shift a demand source from Region A to Region B, we must:\u00a0<\/span><\/p>\n<p><span>Add additional capacity in Region B<\/span><br \/>\n<span>Add additional replicas of the Twine jobs for the affected services in Region B<\/span><br \/>\n<span>Increase traffic in Region B and decrease traffic in Region A<\/span><br \/>\n<span>Decrease the replicas of the Twine jobs for the affected services in Region A<\/span><br \/>\n<span>Reclaim capacity in Region A<\/span><\/p>\n<p><span>These steps ensure that both regions maintain a safe state during the shift. To temporarily upsize the double-occupancy buffer, we request the needed capacity at a temporarily increased cost, which allows for safe and automatic regional capacity shifts.<\/span><\/p>\n<p><span>Overall, Regional Fluidity provides the following benefits by managing global capacity for infrastructure:\u00a0<\/span><\/p>\n<p>Homogenization of the hardware footprint toward a standard set of region types, such as compute and storage.<span> As the <\/span><span>regions become less specialized, hardware capacity becomes increasingly fungible. Because the services move around more easily, we can reduce the amount of stranded power caused by mismatches between service needs and regional hardware footprints, improving the unitization.<\/span><br \/>\nSafe redistribution of service capacity across our many regions.<span> After redistribution, the service capacity management is decoupled from regional capacity supply planning. Similarly, we provide capacity abstraction for service owners, to eliminate the need to reason manually about regional capacity distribution.<\/span><br \/>\nGlobal optimization, trading off region-local inefficiencies for better global outcomes.<span> For example, we can place services in a way that<\/span><span> uses underlying resources most effectively, which improves overall global efficiency.<\/span><\/p>\n<h2><span>Toward the world as a computer<\/span><\/h2>\n<p><span>Meta\u2019s solutions currently address two points in the service spectrum: latency-tolerant services and latency-sensitive services. However, additional types of services with different requirements lie along that spectrum. For example, AI training workloads are more fluid than storage services, but they require data to be local to the training data. Additionally, stateful or storage systems, such as databases and caches, are typically multitenant services. However, copying data often increases the time required to shift these services. To manage network traffic more effectively, we need to provide global capacity management for more types of services.<\/span><\/p>\n<p><span>At Meta, we want to develop fully automated, transparent, global capacity management for all services and to empower service owners to reason about capacity globally. By providing the abstractions necessary to model and understand service intent, we can provide the infrastructure with enough flexibility to improve placement over time. We\u2019ve made significant progress for many types of services, but more work lies ahead.<\/span><\/p>\n\n<p><span>The first step is to update the expectations service owners have for infrastructure. Instead of reasoning about regional Twine jobs and reservations, service owners can now operate on the global level. The infrastructure then automatically decides the regional placement of their workloads. Global reservations also capture more intent from service owners, such as latency constraints and sharding constraints.\u00a0<\/span><\/p>\n<p><span>A capacity regionalizer component, which runs continuously to improve placement, determines the exact regional capacity breakdown and then safely orchestrates the necessary changes across the various global abstractions. The capacity orchestrator also supports services that require more planning or orchestration.<\/span><\/p>\n<p><span>For example, we can integrate a global shard placement system into global capacity management much like we orchestrate regional shifts. However, we can now add extra steps in a specific sequence to support globally sharded systems. The capacity orchestrator instructs the global shard management system to build additional shard replicas in the new region after the Twine job has been upsized and before the traffic shift. Similarly, the capacity orchestrator instructs the global shard management system to drop the shard replicas in the old region before downsizing the Twine job. This allows for safer, fully automated regional shifts for globally sharded services.\u00a0<\/span><\/p>\n<p><span>Much of this work is in progress or still being designed, and we have many exciting challenges ahead:<\/span><\/p>\n<p><span>How do we find the global abstractions that best capture service intent while providing the infrastructure with enough flexibility to improve placement over time?<\/span><br \/>\n<span>How do we safely orchestrate regional shifts for all types of services?<\/span><br \/>\n<span>As the number of regions with different failure characteristics continues to grow, how do we model stateful and multitenant services and safely perform automatic regional shifts?<\/span><br \/>\n<span>How do we continue to evolve our disaster-readiness strategy while keeping it simple for service owners?<\/span><\/p>\n<p><span>We\u2019ve taken the important first step toward seeing the world as a computer with global capacity management. But our journey is only 1 percent finished. At Meta, we are excited for the future, and we look forward to seeing our vision become reality.\u00a0<\/span><\/p>\n<p>The post <a href=\"https:\/\/engineering.fb.com\/2022\/09\/06\/data-center-engineering\/viewing-the-world-as-a-computer-global-capacity-management\/\">Viewing the world as a computer: Global capacity management<\/a> appeared first on <a href=\"https:\/\/engineering.fb.com\/\">Engineering at Meta<\/a>.<\/p>\n<p>Engineering at Meta<\/p>","protected":false},"excerpt":{"rendered":"<p>Meta currently operates 14 data centers around the world. This rapidly expanding global data center footprint poses new challenges for service owners and for our infrastructure management systems. Systems like Twine, which we use to scale cluster management, and RAS, which handles perpetual region-wide resource allocation, have provided the abstractions and automation necessary for service&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2022\/09\/06\/viewing-the-world-as-a-computer-global-capacity-management\/\">Continue reading <span class=\"screen-reader-text\">Viewing the world as a computer: Global capacity management<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-628","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":344,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/risk-driven-backbone-management-during-covid-19-and-beyond\/","url_meta":{"origin":628,"position":0},"title":"Risk-driven backbone management during COVID-19 and beyond","date":"August 31, 2021","format":false,"excerpt":"What the research is:\u00a0 A first-of-its-kind study detailing our backbone management strategy to ensure high service performance throughout the COVID-19 pandemic. The pandemic moved most social interactions online and caused an unprecedented stress test on our global network infrastructure with tens of data center regions. At this scale, failures such\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":533,"url":"https:\/\/fde.cat\/index.php\/2022\/01\/18\/foqs-making-a-distributed-priority-queue-disaster-ready\/","url_meta":{"origin":628,"position":1},"title":"FOQS: Making a distributed priority queue disaster-ready","date":"January 18, 2022","format":false,"excerpt":"Facebook Ordered Queueing Service (FOQS) is a fully managed, distributed priority queueing service used for reliable message delivery among many services. FOQS has evolved from a regional deployment into a geo-distributed, global deployment to ensure that data stored within logical queues is highly available, even through large-scale disaster scenarios. Migrating\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":630,"url":"https:\/\/fde.cat\/index.php\/2022\/09\/07\/network-entitlement-a-contract-based-network-sharing-solution\/","url_meta":{"origin":628,"position":2},"title":"Network Entitlement: A contract-based network sharing solution","date":"September 7, 2022","format":false,"excerpt":"Meta\u2019s overall network usage and traffic volume has increased as we\u2019ve continued to add new services. Due to the scarcity of fiber resources, we\u2019re developing an explicit resource reservation framework to effectively plan, manage, and operate the shared consumption of network bandwidth, which will help us keep up with demand\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":348,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/apricot-subsea-cable-will-boost-internet-capacity-speeds-in-the-asia-pacific-region\/","url_meta":{"origin":628,"position":3},"title":"Apricot subsea cable will boost internet capacity, speeds in the Asia-Pacific region","date":"August 31, 2021","format":false,"excerpt":"We are excited to announce our participation in the Apricot subsea cable system, together with leading regional and global partners. When completed, the project (which is still subject to regulatory approvals) will deliver much-needed internet capacity, redundancy, and reliability to expand connections in the Asia-Pacific region. The 12,000-kilometer-long cable will\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":515,"url":"https:\/\/fde.cat\/index.php\/2021\/12\/13\/slick-adopting-slos-for-improved-reliability\/","url_meta":{"origin":628,"position":4},"title":"SLICK: Adopting SLOs for improved reliability","date":"December 13, 2021","format":false,"excerpt":"To support the people and communities who use our apps and products, we need to stay in constant contact with them. We want to provide the experiences we offer reliably. We also need to establish trust with the larger community we support. This can be especially challenging in a large-scale,\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":484,"url":"https:\/\/fde.cat\/index.php\/2021\/10\/05\/more-details-about-the-october-4-outage\/","url_meta":{"origin":628,"position":5},"title":"More details about the October 4 outage","date":"October 5, 2021","format":false,"excerpt":"Now that our platforms are up and running as usual after yesterday\u2019s outage, I thought it would be worth sharing a little more detail on what happened and why \u2014 and most importantly, how we\u2019re learning from it.\u00a0 This outage was triggered by the system that manages our global backbone\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/628","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=628"}],"version-history":[{"count":0,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/628\/revisions"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=628"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=628"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=628"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}