{"id":604,"date":"2022-07-06T16:02:16","date_gmt":"2022-07-06T16:02:16","guid":{"rendered":"https:\/\/fde.cat\/index.php\/2022\/07\/06\/watch-metas-engineers-discuss-quic-and-tcp-innovations-for-our-network\/"},"modified":"2022-07-06T16:02:16","modified_gmt":"2022-07-06T16:02:16","slug":"watch-metas-engineers-discuss-quic-and-tcp-innovations-for-our-network","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2022\/07\/06\/watch-metas-engineers-discuss-quic-and-tcp-innovations-for-our-network\/","title":{"rendered":"Watch Meta\u2019s engineers discuss QUIC and TCP innovations for our network"},"content":{"rendered":"<p><span>With more than 75 percent of our internet traffic set to use <\/span><a href=\"https:\/\/engineering.fb.com\/2020\/10\/21\/networking-traffic\/how-facebook-is-bringing-quic-to-billions\/\"><span>QUIC and HTTP\/3 together<\/span><\/a><span>, QUIC is slowly moving to become the de facto protocol used for internet communication at Meta. For Meta\u2019s data center network, TCP remains the primary network transport protocol that supports thousands of services on top of it. As our network continues to expand, our engineers are continually looking for ways to make our data centers even more efficient and reliable. Engineers at Meta have been working to bring better network performance than ever to people using our family of apps. Solutions we\u2019ve deployed in production via QUIC and TCP innovations have helped improve performance, congestion management, and platform extensibility across the entire breadth of our network (CDN, edge, backbone, WAN, and data center layers) at Meta.\u00a0<\/span><\/p>\n<p><span>At the recently held <\/span><a href=\"https:\/\/atscaleconference.com\/events\/networking-scale-summer-2022\/\"><span>Networking @Scale 2022<\/span><\/a><span> virtual conference, themed around transport innovation, engineers from Meta discussed the challenges faced in our network around efficiency, reliability, and deployment at scale.\u00a0<\/span><\/p>\n<p><span>Here is some of the latest work being done at Meta to enhance network performance at scale:<\/span><\/p>\n<h2><span>Quick cache DSR<\/span><\/h2>\n<p><span>Matt Joras, Software Engineer, Meta<br \/>\n<\/span><span>Yair Gottdenker, Production Engineer, Meta<\/span><\/p>\n<div class=\"fb-video\"><\/div>\n<p><span>Matt Joras and Yair Gottdenker present a unique solution utilizing QUIC\u2019s properties at the CDN layer to implement a form of direct server return (DSR) from the caching layer directly to the client. This solution helps bypass most intracluster communication in a typical CDN architecture when serving cached content and avoids streaming content through multiple hops, resulting in significant CPU cycles savings and intracluster network bandwidth improvement. Their talk covers the implementation details, performance improvements, and future applications.<\/span><\/p>\n<h2><span>Improving transfer times in the backbone network using QUIC Jump Start<\/span><\/h2>\n<p><span>Joseph Beshay, Research Scientist, Meta<\/span><\/p>\n<div class=\"fb-video\"><\/div>\n<p><span>Transfers in high-BDP links incur a startup delay for congestion control to probe the bandwidth of the underlying link. The impact of this delay is inversely proportional to the size of the transfer since small transfers may repeatedly spend all their transfer time probing for the available bandwidth and never reach it or utilize it. Joseph Beshay presents an application of QUIC in Meta\u2019s backbone network. In this talk, Joseph presents how the congestion control state can be cached in QUIC and how this state can be used to \u201cjump-start\u201d new connections to significantly reduce startup delays in high-BDP links.<\/span><\/p>\n<h2><span>Tackling data center congestion and bursts<\/span><\/h2>\n<p><span>Abhishek Dhamija, Production Engineer, Meta<br \/>\n<\/span><span>Balasubramanian Madhavan, Software Engineer, Meta<\/span><\/p>\n<div class=\"fb-video\"><\/div>\n<p><span>With Meta\u2019s increasing user base, its data center (DC) network is growing fast. It is critical to ensure that the network delivers the highest levels of reliability and performance. Abhishek Dhamija and Balasubramanian Madhavan discuss two specific DC transport tuning initiatives that allow (a) handling sustained congestion in the network using DCTCP, which uses ECN-based congestion signals, and (b) tackling bursts in the network using receiver window turning The talk covers the motivation, implementation overview, handling the coexistence of multiple congestion control mechanisms in the DC using BPF-based enablement knobs, wins, and lessons learned for these initiatives.<\/span><\/p>\n<h2><span>NetEdit: Fine-grained network tuning at scale<\/span><\/h2>\n<p><span>Prashanth Kannan, Software Engineer, Meta<br \/>\n<\/span><span>Prankur Gupta, Software Engineer, Meta<\/span><\/p>\n<div class=\"fb-video\"><\/div>\n<p><span>Large-scale network changes must be executed without compromising production traffic, making it essential for every change to be thoroughly developed, validated, and tested before deployment. Prashanth Kannan and Prankur Gupta share the design, implementation, and production experience of a highly extensible, stateless, and modular BPF-based network feature platform called NetEdit that was developed with monitoring and observability at its core, to effectively tune the network transport across millions of servers at Meta.\u00a0<\/span><\/p>\n<h2><span>Network entitlement: From hose-based approval to host-based admission<\/span><\/h2>\n<p><span>Guanqing Yan, Software Engineer, Meta<br \/>\n<\/span><span>Manikandan Somasundaram, Software Engineer, Meta<\/span><\/p>\n<div class=\"fb-video\"><\/div>\n<p><span>The wide area network (WAN) connects many data center (DC) regions and hundreds of points-of-presence (POPs) of Meta. The WAN resource is shared by several high network demand services at Meta. The network must be built for peak demand and account for failure scenarios to reduce the impact on Meta products. However, building a resilient, overprovisioned network for all service peak demands at our current growth rates is practically infeasible due to fiber sourcing, deployment constraints, and the costs involved.\u00a0<\/span><\/p>\n<p><span>This talk by Guanqing Yan and Manikandan Somasundaram presents Meta\u2019s production traffic classification and WAN entitlement solution currently used by Meta\u2019s services to share the network safely and efficiently. The network entitlement framework aims to provide a simple, stable, operations-friendly network abstraction for sharing the backbone. The framework includes two key parts: (1) a hose-based entitlement granting system that establishes an agile contract while achieving network efficiency and meeting long-term SLO guarantees, and (2) a flexible large-scale distributed host-based traffic admission system that enforces the contract on the production traffic.<\/span><\/p>\n<p>The post <a href=\"https:\/\/engineering.fb.com\/2022\/07\/06\/networking-traffic\/watch-metas-engineers-discuss-quic-and-tcp-innovations-for-our-network\/\">Watch Meta\u2019s engineers discuss QUIC and TCP innovations for our network<\/a> appeared first on <a href=\"https:\/\/engineering.fb.com\/\">Engineering at Meta<\/a>.<\/p>\n<p>Engineering at Meta<\/p>","protected":false},"excerpt":{"rendered":"<p>With more than 75 percent of our internet traffic set to use QUIC and HTTP\/3 together, QUIC is slowly moving to become the de facto protocol used for internet communication at Meta. For Meta\u2019s data center network, TCP remains the primary network transport protocol that supports thousands of services on top of it. As our&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2022\/07\/06\/watch-metas-engineers-discuss-quic-and-tcp-innovations-for-our-network\/\">Continue reading <span class=\"screen-reader-text\">Watch Meta\u2019s engineers discuss QUIC and TCP innovations for our network<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-604","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":166,"url":"https:\/\/fde.cat\/index.php\/2020\/12\/30\/2020-year-in-review-connectivity-innovations-faster-apps-and-progress-toward-net-zero\/","url_meta":{"origin":604,"position":0},"title":"2020 year in review: Connectivity innovations, faster apps, and progress toward net zero","date":"December 30, 2020","format":false,"excerpt":"It goes without saying that 2020 has been a challenging year, to put it lightly. But if anything, the COVID-19 pandemic has shined a light on our need to connect as people. For Facebook, that meant our work has become more important than ever. Whether it was finding new and\u2026","rel":"","context":"In &quot;External&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":670,"url":"https:\/\/fde.cat\/index.php\/2023\/01\/27\/watch-metas-engineers-discuss-optimizing-large-scale-networks\/","url_meta":{"origin":604,"position":1},"title":"Watch Meta\u2019s engineers discuss optimizing large-scale networks","date":"January 27, 2023","format":false,"excerpt":"Managing network solutions amidst a growing scale inherently brings challenges around performance, deployment, and operational complexities.\u00a0 At Meta, we\u2019ve found that these challenges broadly fall into three themes: 1.) \u00a0 Data center networking: Over the past decade, on the physical front, we have seen a rise in vendor-specific hardware that\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":787,"url":"https:\/\/fde.cat\/index.php\/2023\/11\/15\/watch-metas-engineers-on-building-network-infrastructure-for-ai\/","url_meta":{"origin":604,"position":2},"title":"Watch: Meta\u2019s engineers on building network infrastructure for AI","date":"November 15, 2023","format":false,"excerpt":"Meta is building for the future of AI at every level \u2013 from hardware like MTIA v1, Meta\u2019s first-generation AI inference accelerator to publicly released models like Llama 2, Meta\u2019s next-generation large language model, as well as new generative AI (GenAI) tools like Code Llama. Delivering next-generation AI products and\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":868,"url":"https:\/\/fde.cat\/index.php\/2024\/05\/22\/post-quantum-readiness-for-tls-at-meta\/","url_meta":{"origin":604,"position":3},"title":"Post-quantum readiness for TLS at Meta","date":"May 22, 2024","format":false,"excerpt":"Today, the internet (like most digital infrastructure in general) relies heavily on the security offered by public-key cryptosystems such as RSA, Diffie-Hellman (DH), and elliptic curve cryptography (ECC). But the advent of quantum computers has raised real questions about the long-term privacy of data exchanged over the internet. In the\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":501,"url":"https:\/\/fde.cat\/index.php\/2021\/11\/09\/ocp-summit-2021-open-networking-hardware-lays-the-groundwork-for-the-metaverse\/","url_meta":{"origin":604,"position":4},"title":"OCP Summit 2021: Open networking hardware lays the groundwork for the metaverse","date":"November 9, 2021","format":false,"excerpt":"Open infrastructure technologies and networking hardware will play an important role as we build new technologies for the metaverse, where billions of people will someday come together in virtual spaces. As we head toward the next major computing platform with a continued spirit of embracing openness and disaggregation, we\u2019re announcing\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":836,"url":"https:\/\/fde.cat\/index.php\/2024\/03\/12\/building-metas-genai-infrastructure\/","url_meta":{"origin":604,"position":5},"title":"Building Meta\u2019s GenAI Infrastructure","date":"March 12, 2024","format":false,"excerpt":"Marking a major investment in Meta\u2019s AI future, we are announcing two 24k GPU clusters. We are sharing details on the hardware, network, storage, design, performance, and software that help us extract high throughput and reliability for various AI workloads. We use this cluster design for Llama 3 training. We\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/604","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=604"}],"version-history":[{"count":0,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/604\/revisions"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=604"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=604"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=604"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}