{"id":805,"date":"2023-12-19T17:01:57","date_gmt":"2023-12-19T17:01:57","guid":{"rendered":"https:\/\/fde.cat\/index.php\/2023\/12\/19\/how-meta-built-the-infrastructure-for-threads\/"},"modified":"2023-12-19T17:01:57","modified_gmt":"2023-12-19T17:01:57","slug":"how-meta-built-the-infrastructure-for-threads","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2023\/12\/19\/how-meta-built-the-infrastructure-for-threads\/","title":{"rendered":"How Meta built the infrastructure for Threads"},"content":{"rendered":"<p><span>On July 5, 2023, Meta launched Threads, the newest product in our family of apps, to an unprecedented success that saw it garner <\/span><a href=\"https:\/\/www.threads.net\/@zuck\/post\/CuhOXGmr74R\"><span>over 100 million sign ups<\/span><\/a><span> in its first five days.<\/span><\/p>\n<p><span>A small,<\/span><a href=\"https:\/\/engineering.fb.com\/2023\/09\/07\/culture\/threads-inside-story-metas-newest-social-app\/\"> <span>nimble team of engineers built Threads<\/span><\/a><span> over the course of only five months of technical work. While the app\u2019s production launch had been under consideration for some time, the business finally made the decision and informed the infrastructure teams to prepare for its launch with only two days\u2019 advance notice. The decision was made with full confidence that Meta\u2019s infrastructure teams can deliver based on their past track record and the maturity of the infrastructure. Despite the daunting challenges with minimal lead time, the infrastructure teams supported the app\u2019s rapid growth exceptionally well.<\/span><\/p>\n<p><span>The seamless scale that people experienced as they signed up by the millions came on the shoulders of over a decade of infrastructure and product development. This was not infrastructure purposely built for Threads, but that had been built over the course of Meta\u2019s lifetime for many products. It had already been built for scale, growth, performance, and reliability, and it managed to exceed our expectations as Threads grew at a pace that no one could have predicted.<\/span><\/p>\n<p><span>A huge amount of infrastructure goes into serving Threads. But, because of space limitations, we will only give examples of two existing components that played an important role:<\/span> <a href=\"https:\/\/engineering.fb.com\/2021\/08\/06\/core-infra\/zippydb\/\" target=\"_blank\" rel=\"noopener\"><span>ZippyDB<\/span><\/a><span>, our distributed key\/value datastore, and<\/span> <a href=\"https:\/\/engineering.fb.com\/2020\/08\/17\/production-engineering\/async\/\" target=\"_blank\" rel=\"noopener\"><span>Async<\/span><\/a><span>, our aptly named asynchronous serverless function platform.<\/span><\/p>\n<h2><span>ZippyDB: Scaling keyspaces for Threads<\/span><\/h2>\n<p><span>Let\u2019s zoom in on the storage layer, where we leveraged ZippyDB, a distributed key\/value database that is run as a fully managed service for engineers to build on. It is built from the ground up to leverage Meta\u2019s infrastructure, and keyspaces hosted on it can be flexibly placed across any number of data centers. Any keyspace hosted on it can be scaled up and down with relative ease. When it comes to building applications that can serve everyone on the planet, ZippyDB is the natural choice to meet the scale demands of its internal state.\u00a0<\/span><\/p>\n\n<p><span>The speed at which we can scale the capacity of a keyspace is made possible by two key features: First, the service runs on a common pool of hardware and is plugged into Meta\u2019s overall capacity management framework. Once new capacity is allocated to the service, the machines are automatically added to the service\u2019s pool and the load balancer kicks in to move data to the new machines. We can absorb thousands of new machines in a matter of a few hours once they are added to the service. While this is great, it is not enough since the end-to-end time in approving capacity, possibly draining it from other services and adding it to ZippyDB, can still be in order of a couple of days. We need to also be able to absorb a surge on shorter notice.<\/span><\/p>\n<p><span>To enable the immediate absorption, we rely on the service architecture\u2019s multi-tenancy and its strong isolation features. This allows for different keyspaces, potentially with complimentary load demands to share the underlying hosts, without worrying about their service level getting impacted when other workloads run hot. There is also slack in the hosts pool due to unused capacity of individual keyspaces as well as buffers for handling disaster recovery events. We can pull levers that shift unused allocations between keyspaces \u2013 dipping into any existing slack and letting the hosts run at a higher utilization level to let a keyspace ramp up almost immediately and sustain it over a short interval (a couple of days). All these are simple config changes with tools and automation built around them as they are fairly routine for day-to-day operations.<\/span><\/p>\n<p><span>The combined effects of strong multi-tenancy and ability to absorb new hardware makes it possible for the service to scale more or less seamlessly, even in the face of a sudden large new demand.<\/span><\/p>\n<h2><span>Optimizing ZippyDB for a product launch<\/span><\/h2>\n<p><span>ZippyDB\u2019s resharding protocol allows us to quickly and transparently increase the sharding factor (i.e., horizontal scaling factor) of a ZippyDB use case with zero downtime for clients, all while maintaining full consistency and correctness guarantees. This allows us to rapidly scale out use cases on the critical path of new product launches with zero interruptions to the launch, even if its load increases by 100x.<\/span><\/p>\n<p><span>We achieve this by having clients hash their keys to logical shards, which are then mapped to a set of physical shards. When a use case grows and requires resharding, we provision a new set of physical shards and install a new logical-to-physical shard mapping in our clients through live configuration changes without downtime. Using hidden access keys on the server itself, and smart data migration logic in our resharding workers, we are then able to atomically move a logical shard from the original mapping to the new mapping. Once all logical shards have been migrated, resharding is complete and we remove the original mapping.<\/span><\/p>\n<p><span>Because scaling up use cases is a critical operation for new product launches, we have invested heavily in our resharding stack to ensure ZippyDB scaling does not block product launches. Specifically, we have designed the resharding stack in a coordinator-worker model so it is horizontally scalable, allowing us to increase resharding speeds when needed, such as during the Threads launch. Additionally, we have developed a set of emergency operator tools to effortlessly deal with sudden use case growth.\u00a0<\/span><\/p>\n<p><span>The combination of these allowed the ZippyDB team to effectively respond to the rapid growth of Threads. Often, when creating new use cases in ZippyDB, we start small initially and then reshard as growth requires. This approach prevents overprovisioning and promotes efficiency in capacity usage. As the viral growth of Threads began, it became evident that we needed to prepare Threads for a 100x growth by proactively performing resharding. With the help of automation tools developed in the past, we completed the resharding just in time as the Threads team opened up the floodgates to traffic at midnight UK time. This enabled delightful user experiences with Threads, even as its user base soared.<\/span><\/p>\n<h2><span>Async: Scaling workload execution for Threads<\/span><\/h2>\n<p><span>Async (also known as <\/span><a href=\"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3600006.3613155\"><span>XFaaS<\/span><\/a><span>) is a serverless function platform capable of deferring computing to off-peak hours, allowing engineers at Meta to reduce their time from solution conception to production deployment. Async currently processes trillions of function calls per day on more than 100,000 servers and can support <\/span><a href=\"https:\/\/engineering.fb.com\/2022\/07\/27\/developer-tools\/programming-languages-endorsed-for-server-side-use-at-meta\/\"><span>multiple programming languages<\/span><\/a><span>, including HackLang, Python, Haskel, and Erlang.\u00a0<\/span><\/p>\n\n<p><span>The platform abstracts the details of deployment, queueing, scheduling, scaling, and disaster recovery and readiness, so that developers can focus on their core business logic and offload the rest of the heavy lifting to Async. By onboarding their code in this platform, their code automatically inherits hyperscale attributes. Scalability is not the only key feature of Async. Code uploaded to the platform also inherits guarantees on execution with configurable retries, time for delivery, rate limits, and capacity accountability.<\/span><\/p>\n<p><span>The workloads commonly executed on Async are those that do not require blocking an active user\u2019s experience with a product and can be performed anywhere from a few seconds to several hours after a user\u2019s action. Async played a critical role in offering users the ability to build community quickly by choosing to follow people on Threads that they already follow on Instagram. Specifically, when a new user joins Threads and chooses to follow the same set of people they do on Instagram, the computationally expensive operation of executing the user\u2019s request to follow the same social graph in Threads is conducted via Async in a scalable manner, which avoids blocking or negatively impacting the user\u2019s onboarding experience.\u00a0<\/span><\/p>\n<p><span>Doing this for 100 million users in five days required significant processing power. Moreover, many celebrities joined Threads, and when that happened millions of people could be queued up to follow them. Both this operation and the corresponding notifications also occurred in Async, enabling scalable operations in the face of a large number of users.<\/span><\/p>\n<p><span>While the volume of Async jobs generated from the rapid Threads user onboarding was several orders of magnitude higher than our initial expectations, Async gracefully absorbed the increased load and queued them for controlled execution. <\/span><span>Specifically, the execution was managed within rate limits, which ensured that we were sending notifications and allowing people to make connections in a timely manner without overloading the downstream services that receive traffic from these Async jobs. Async automatically adjusted the flow of execution to match its capacity as well as the capacity of dependent services, such as the social graph database, all without manual intervention from either Threads engineers or infrastructure engineers.<\/span><\/p>\n<h2><span>Where infrastructure and culture meet<\/span><\/h2>\n<p><span>Threads\u2019 swift development within a mere five months of technical work underscores the strengths of Meta\u2019s infrastructure and engineering culture. Meta\u2019s products leverage a shared infrastructure that has withstood the test of time, empowering product teams to move fast and rapidly scale successful products. The infrastructure boasts a high level of automation, ensuring that, except for efforts to secure capacity on short notice, the automatic redistribution, load balancing, and scaling up of workloads occurred smoothly and transparently.\u00a0<\/span><span>Meta thrives on a move-fast engineering culture, wherein engineers take strong ownership and collaborate seamlessly to accomplish a large shared goal, with efficient processes that would take a typical organization months to coordinate. As an example, our<\/span><a href=\"https:\/\/atscaleconference.com\/videos\/metas-sev-culture-how-todays-sevs-create-tomorrows-reliability\/\"> <span>SEV incident-management culture<\/span><\/a><span> has been an important tool in getting the right visibility, focus, and action in places where we all need to coordinate and move fast. Overall, these factors combined to ensure the success of the Threads launch.<\/span><\/p>\n<p>The post <a href=\"https:\/\/engineering.fb.com\/2023\/12\/19\/core-infra\/how-meta-built-the-infrastructure-for-threads\/\">How Meta built the infrastructure for Threads<\/a> appeared first on <a href=\"https:\/\/engineering.fb.com\/\">Engineering at Meta<\/a>.<\/p>\n<p>Engineering at Meta<\/p>","protected":false},"excerpt":{"rendered":"<p>On July 5, 2023, Meta launched Threads, the newest product in our family of apps, to an unprecedented success that saw it garner over 100 million sign ups in its first five days. A small, nimble team of engineers built Threads over the course of only five months of technical work. While the app\u2019s production&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2023\/12\/19\/how-meta-built-the-infrastructure-for-threads\/\">Continue reading <span class=\"screen-reader-text\">How Meta built the infrastructure for Threads<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-805","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":757,"url":"https:\/\/fde.cat\/index.php\/2023\/09\/07\/threads-the-inside-story-of-metas-newest-social-app\/","url_meta":{"origin":805,"position":0},"title":"Threads: The inside story of Meta\u2019s newest social app","date":"September 7, 2023","format":false,"excerpt":"Earlier this year, a small team of engineers at Meta started working on an idea for a new app. It would have all the features people expect from a text-based conversations app, but with one very key, distinctive goal \u2013 being an app that would allow people to share their\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":782,"url":"https:\/\/fde.cat\/index.php\/2023\/11\/06\/how-meta-built-threads-in-5-months\/","url_meta":{"origin":805,"position":1},"title":"How Meta built Threads in 5 months","date":"November 6, 2023","format":false,"excerpt":"In about five short months, a small team of engineers at Meta took Threads, the new text-based conversations app, from from an idea to the most successful app launch of all time, pulling in over 100M users in its first five days. But this achievement wouldn\u2019t have been possible without\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":809,"url":"https:\/\/fde.cat\/index.php\/2024\/01\/08\/sre-weekly-issue-406\/","url_meta":{"origin":805,"position":2},"title":"SRE Weekly Issue #406","date":"January 8, 2024","format":false,"excerpt":"View on sreweekly.com A message from our sponsor, FireHydrant: Signals is now available in beta. Sign up to experience alerting for modern DevOps teams: Page teams, not services. Ingest inputs from any source. Bucket pricing based on usage. And one platform \u2014 ring to retro \u2014 finally. https:\/\/firehydrant.com\/blog\/signals-beta-live\/ How to\u2026","rel":"","context":"In &quot;SRE&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":865,"url":"https:\/\/fde.cat\/index.php\/2024\/05\/14\/behind-the-scenes-of-threads-for-web\/","url_meta":{"origin":805,"position":3},"title":"Behind the scenes of Threads for web","date":"May 14, 2024","format":false,"excerpt":"When Threads first launched one of the top feature requests was for a web client. In this episode of the Meta Tech Podcast, Pascal Hartig (@passy) sits down with Ally C. and Kevin C., two engineers on the Threads Web Team that delivered the basic version of Threads for web\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":843,"url":"https:\/\/fde.cat\/index.php\/2024\/03\/21\/threads-has-entered-the-fediverse\/","url_meta":{"origin":805,"position":4},"title":"Threads has entered the fediverse","date":"March 21, 2024","format":false,"excerpt":"Threads has entered the fediverse! As part of our beta experience, now available in a few countries, Threads users aged 18+ with public profiles can now choose to share their Threads posts to other ActivityPub-compliant servers. People on those servers can now follow federated Threads profiles and see, like, reply\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":674,"url":"https:\/\/fde.cat\/index.php\/2023\/02\/06\/the-evolution-of-facebooks-ios-app-architecture\/","url_meta":{"origin":805,"position":5},"title":"The evolution of Facebook\u2019s iOS app architecture","date":"February 6, 2023","format":false,"excerpt":"Facebook for iOS (FBiOS) is the oldest mobile codebase at Meta. Since the app was rewritten in 2012, it has been worked on by thousands of engineers and shipped to billions of users, and it can support hundreds of engineers iterating on it at a time. After years of iteration,\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/805","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=805"}],"version-history":[{"count":0,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/805\/revisions"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=805"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=805"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=805"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}