{"id":670,"date":"2023-01-27T14:00:28","date_gmt":"2023-01-27T14:00:28","guid":{"rendered":"https:\/\/fde.cat\/index.php\/2023\/01\/27\/watch-metas-engineers-discuss-optimizing-large-scale-networks\/"},"modified":"2023-01-27T14:00:28","modified_gmt":"2023-01-27T14:00:28","slug":"watch-metas-engineers-discuss-optimizing-large-scale-networks","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2023\/01\/27\/watch-metas-engineers-discuss-optimizing-large-scale-networks\/","title":{"rendered":"Watch Meta\u2019s engineers discuss optimizing large-scale networks"},"content":{"rendered":"<p><span>Managing network solutions amidst a growing scale inherently brings challenges around performance, deployment, and operational complexities.\u00a0<\/span><\/p>\n<p><span>At Meta, we\u2019ve found that these challenges broadly fall into three themes:<\/span><\/p>\n<p>1.)<span> \u00a0 <\/span>Data center networking:<span> Over the past decade, on the physical front, we have seen a rise in vendor-specific hardware that comes with heterogeneous feature and architecture sets (e.g., non-blocking architecture). On the software side, there has been a massive increase in scale and capacity demand (in the order of magnitude of MWs per physical building) to manage hyperscale architectures such as ours. Also, the pivot to <a href=\"https:\/\/tech.facebook.com\/ideas\/2022\/5\/making-the-metaverse\/\" target=\"_blank\" rel=\"noopener\">metaverse<\/a> has led to a significant increase in AI, HPC, and <a href=\"https:\/\/engineering.fb.com\/2022\/09\/19\/ml-applications\/data-ingestion-machine-learning-training-meta\/\" target=\"_blank\" rel=\"noopener\">machine learning workloads<\/a> that demand huge networking bandwidth and compute capacity and pose challenges around safe co-existence of existing web, legacy and modern workloads.<\/span><\/p>\n<p>2.)<span> \u00a0 <\/span>WAN optimizations:<span> Over the last few years, there has been a rapid increase in content creation fueled by a growing creator economy and hybrid and remote work, that has led to huge capacity and network bandwidth demands on the backbone networks.<\/span><\/p>\n<p>3.)<span> \u00a0 <\/span>Operational Efficiency and Metrics Improvements:<span> Traditional network metrics such as packet loss and jitter are too specific to the network\/host and do not provide correlation between the application behavior and network performance.<\/span><\/p>\n<p><span>At the recent <a href=\"https:\/\/atscaleconference.com\/\" target=\"_blank\" rel=\"noopener\">Networking@Scale<\/a> virtual conference in November 2022, engineers from Meta discussed these challenges and presented solutions <\/span><span>across these themes that help <\/span><span>bring better network performance than ever to people using our family of apps<\/span><span>:\u00a0<\/span><\/p>\n<h2><span>Developing, deploying, operating in-house network switches at a massive scale<\/span><\/h2>\n<p><span>Shrikrishna Khare, Software Engineer, Meta<br \/>\n<\/span><span>Srikrishna Gopu, Software Engineer, Meta<\/span><\/p>\n<div class=\"fb-video\"><\/div>\n<p><a href=\"https:\/\/engineering.fb.com\/2015\/03\/10\/data-center-engineering\/facebook-open-switching-system-fboss-and-wedge-in-the-open\/\" target=\"_blank\" rel=\"noopener\"><span>FBOSS<\/span><\/a><span> is one of the largest services in Meta and powers Meta\u2019s network. The presenters Shrikrishna Khare and Srikrishna Gopu, talk about their experience designing, developing, and operating FBOSS: An in-house software built to manage and support a set of features required for data center switches of a large-scale Internet content provider. They present key ideas underpinning the FBOSS model that helped them build a stable and scalable network.<\/span><\/p>\n<p><span>The presentation also introduced the Switch Abstraction Interface (<\/span><a href=\"https:\/\/www.opencompute.org\/projects\/sai\" target=\"_blank\" rel=\"noopener\"><span>SAI<\/span><\/a><span>) layer that defines a vendor-independent API for programming the forwarding ASIC. The new FBOSS implementation was deployed at a massive scale to a brownfield deployment and was also leveraged to onboard a new switch vendor into the Meta infrastructure.\u00a0<\/span><\/p>\n<h2><span>Wiring the planet: Scaling Meta\u2019s global optical network<\/span><\/h2>\n<p><span>Stephen Grubb, Optical Engineer, Meta<br \/>\n<\/span><span>Joseph Kakande, Network Engineer, Meta<\/span><\/p>\n<div class=\"fb-video\"><\/div>\n<p><span>Stephen Grubb and Joseph Kakande talk about the expansive global fiber network that is being built and managed by BBE (Backbone Engineering \u2013 which plans, designs, builds, and supports the global network that interconnects Meta\u2019s data centers (DCs) and points-of-presence (POPs) to the internet), with special highlights on the submarine fiber optic systems that are being built to connect the globe.<\/span><\/p>\n<p><span>This talk showcases<\/span><a href=\"https:\/\/engineering.fb.com\/2021\/03\/28\/connectivity\/echo-bifrost\/\" target=\"_blank\" rel=\"noopener\"> <span>Bifrost and Echo<\/span><\/a><span>, which are the first networks to directly connect the US and Singapore and will support SGA, Meta\u2019s first APAC data center. They also discussed the vast<\/span><a href=\"https:\/\/engineering.fb.com\/2021\/09\/28\/connectivity\/2africa-pearls\/\"> <span>2Africa<\/span><\/a><span> project, which is both the world\u2019s largest submarine cable network and has the potential to connect the largest number of people, 3 billion. The talk also covers the connection of our submarine networks to our terrestrial backbone and describes how Meta designs and builds the hierarchies of the optical transport layer built on top of those fiber paths. They also discuss In-house software system suites, solutions for distributed provisioning and monitoring of this global fleet of hardware, and approaches to diagnosis and remediation of network failures.<\/span><\/p>\n<h2><span>Milisampler: Fine-grained network traffic analysis<\/span><\/h2>\n<p><span>Yimeng Zhao, Research Scientist, Meta<\/span><\/p>\n<div class=\"fb-video\"><\/div>\n<p><span>Yimeng Zhao talks about radically improving the visibility, monitoring, and diagnosis of Meta\u2019s planet-scale production network via innovations in traffic measurement tools.<\/span><\/p>\n<p><span>Managing data center networks with low loss requires understanding traffic patterns, especially burstiness of the traffic, at fine time granularity. Yet, monitoring traffic with millisecond granularity fleet wide is challenging. To gain more visibility into our production network, Millisampler, a BPF-based, lightweight traffic measurement tool that operates at high granularity timescale was built and deployed in every server in the entire fleet at Meta for continual monitoring.<\/span><\/p>\n<p><span>Millisampler data allows us to characterize microbursts at millisecond or even microsecond granularity. And simultaneous data collection enables analysis of how synchronized bursts interact in rack buffers. This talk covers the design, implementation, and production experience with Millisampler, as well as some interesting observations collected from the Millisampler data.<\/span><\/p>\n<h2><span>Network SLOs: Knowing when the network is the barrier to application performance<\/span><\/h2>\n<p><span>Brandon Schlinker, Research Scientist, Meta<br \/>\n<\/span><span>Sharad Jaiswal, Optimization Engineer, Meta<\/span><\/p>\n<div class=\"fb-video\"><\/div>\n<p><span>At Meta, we need to be able to readily determine if network conditions are responsible for instances of poor quality of experience (QoE) such as images loading slowly or video stalling during playback. Brandon Schlinker and Sharad Jaiswal from Meta\u2019s Traffic Engineering team, introduced the concept of Network SLOs, which can be thought of as a product\u2019s \u201cminimum network requirements\u2019 for good QoE. They describe the approach and design in deriving Network SLOs via a combination of statistical tools and operationalizing them. They also described approaches to evaluate Network SLO compliance, and highlighted case-studies where these SLOs helped triage regressions in QoE, identify gaps in Meta\u2019s edge network capacity, and surface inefficiencies in how product utilizes the network.<\/span><\/p>\n<h2><span>Improving L4 routing consistency at Meta<\/span><\/h2>\n<p><span>Aman Sharma, Software Engineer, Meta<br \/>\n<\/span><span>Andrii Vasylevskyi, Software Engineer, Meta<\/span><\/p>\n<div class=\"fb-video\"><\/div>\n<p><span>Aman Sharma and Andrii Vasylevskyi talk about the design, development, use cases, and improvements in Layer 4 load balancing by developing a tool called Shiv. When a large number of backends are added or removed, remappings in the network routing tables occur, resulting in broken end-to-end connections and impacted user experience (e.g., stalled videos).<\/span><\/p>\n<p><span>Shiv routes packets to backends using a consistent hash of the 5-tuple of the packet (namely, the source IP, destination IP, source port, destination port, and protocol). Shiv\u2019s objective is to route packets for a connection (which all have the same 5-tuple) to the same backend for the duration of the connection and avoid connection breakage.<\/span><\/p>\n<p>The post <a href=\"https:\/\/engineering.fb.com\/2023\/01\/27\/networking-traffic\/watch-metas-engineers-discuss-optimizing-large-scale-networks\/\">Watch Meta\u2019s engineers discuss optimizing large-scale networks<\/a> appeared first on <a href=\"https:\/\/engineering.fb.com\/\">Engineering at Meta<\/a>.<\/p>\n<p>Engineering at Meta<\/p>","protected":false},"excerpt":{"rendered":"<p>Managing network solutions amidst a growing scale inherently brings challenges around performance, deployment, and operational complexities.\u00a0 At Meta, we\u2019ve found that these challenges broadly fall into three themes: 1.) \u00a0 Data center networking: Over the past decade, on the physical front, we have seen a rise in vendor-specific hardware that comes with heterogeneous feature and&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2023\/01\/27\/watch-metas-engineers-discuss-optimizing-large-scale-networks\/\">Continue reading <span class=\"screen-reader-text\">Watch Meta\u2019s engineers discuss optimizing large-scale networks<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-670","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":533,"url":"https:\/\/fde.cat\/index.php\/2022\/01\/18\/foqs-making-a-distributed-priority-queue-disaster-ready\/","url_meta":{"origin":670,"position":0},"title":"FOQS: Making a distributed priority queue disaster-ready","date":"January 18, 2022","format":false,"excerpt":"Facebook Ordered Queueing Service (FOQS) is a fully managed, distributed priority queueing service used for reliable message delivery among many services. FOQS has evolved from a regional deployment into a geo-distributed, global deployment to ensure that data stored within logical queues is highly available, even through large-scale disaster scenarios. Migrating\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":834,"url":"https:\/\/fde.cat\/index.php\/2024\/03\/06\/how-the-new-einstein-1-platform-manages-massive-data-and-ai-workloads-at-scale\/","url_meta":{"origin":670,"position":1},"title":"How the New Einstein 1 Platform Manages Massive Data and AI Workloads at Scale","date":"March 6, 2024","format":false,"excerpt":"In our \u201cEngineering Energizers\u201d Q&A series, we feature Leo Tran, Chief Architect of Platform Engineering at Salesforce. With over 15 years of engineering leadership experience, Leo is instrumental in developing the Einstein 1 Platform. This platform integrates generative AI, data management, CRM capabilities, and trusted systems to provide businesses with\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":322,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/consolidating-facebook-storage-infrastructure-with-tectonic-file-system\/","url_meta":{"origin":670,"position":2},"title":"Consolidating Facebook storage infrastructure with Tectonic file system","date":"August 31, 2021","format":false,"excerpt":"What the research is:\u00a0 Tectonic, our data center scale distributed file system, enables better resource utilization, promotes simpler services, and requires less operational complexity than our previous approach. Our previous storage infrastructure consisted of a set of use-case specific storage systems. Clusters, or instances of these storage systems, used to\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":599,"url":"https:\/\/fde.cat\/index.php\/2022\/06\/14\/applying-federated-learning-to-protect-data-on-mobile-devices\/","url_meta":{"origin":670,"position":3},"title":"Applying federated learning to protect data on mobile devices","date":"June 14, 2022","format":false,"excerpt":"What the research is: Federated learning with differential privacy (FL-DP) is one of the latest privacy-enhancing technologies being evaluated at Meta as we constantly work to enhance user privacy and further safeguard users\u2019 data in the products we design, build, and maintain. FL-DP enhances privacy in two important ways: It\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":892,"url":"https:\/\/fde.cat\/index.php\/2024\/07\/08\/unlocking-data-clouds-secret-for-scaling-massive-data-volumes-and-slashing-processing-bottlenecks\/","url_meta":{"origin":670,"position":4},"title":"Unlocking Data Cloud\u2019s Secret for Scaling Massive Data Volumes and Slashing Processing Bottlenecks","date":"July 8, 2024","format":false,"excerpt":"In our Engineering Energizers Q&A series, we explore engineers who have pioneered advancements in their fields. Today, we meet Rahul Singh, Vice President of Software Engineering, leading the India-based Data Cloud team. His team is focused on delivering a robust, scalable, and efficient Data Cloud platform that consolidates customer data\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":538,"url":"https:\/\/fde.cat\/index.php\/2022\/02\/01\/behind-the-scenes-of-hyperforce-salesforces-infrastructure-for-the-public-cloud\/","url_meta":{"origin":670,"position":5},"title":"Behind the Scenes of Hyperforce: Salesforce\u2019s Infrastructure for the Public Cloud","date":"February 1, 2022","format":false,"excerpt":"Salesforce has been running cloud infrastructure for over two decades, bringing companies and their customers together. When Salesforce first started out in 1999, the world was very different; back then, the only practical way to provide our brand of Software-As-A-Service was to run everything yourself\u200a\u2014\u200anot just the software, but the\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/670","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=670"}],"version-history":[{"count":0,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/670\/revisions"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=670"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=670"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=670"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}