{"id":322,"date":"2021-08-31T14:39:51","date_gmt":"2021-08-31T14:39:51","guid":{"rendered":"https:\/\/fde.cat\/?p=322"},"modified":"2021-08-31T14:39:51","modified_gmt":"2021-08-31T14:39:51","slug":"consolidating-facebook-storage-infrastructure-with-tectonic-file-system","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/consolidating-facebook-storage-infrastructure-with-tectonic-file-system\/","title":{"rendered":"Consolidating Facebook storage infrastructure with Tectonic file system"},"content":{"rendered":"<h2><span>What the research is:\u00a0<\/span><\/h2>\n<p><span>Tectonic, our data center scale distributed file system, enables better resource utilization, promotes simpler services, and requires less operational complexity than our previous approach. Our previous storage infrastructure consisted of a set of use-case specific storage systems. Clusters, or instances of these storage systems, used to scale to tens of petabytes. As Facebook\u2019s scale grew, this constellation of storage system architecture became increasingly resource inefficient and operationally complex. <\/span><\/p>\n<p><span>Each Tectonic cluster scales to exabytes and serves storage needs for an entire data center. With Tectonic, our consolidated storage architecture promotes resource efficiency by harvesting resources that were otherwise stranded in smaller clusters. This consolidation has also significantly simplified our storage operations because we now have a single system and fewer clusters to manage. <\/span><\/p>\n<h2><span>How it works:\u00a0<\/span><\/h2>\n<p><span>In building this system we simultaneously solved three high-level challenges: supporting exabyte-scale, isolating performance between tenants, and enabling tenant-specific optimizations. <\/span><\/p>\n<p><span>Exabyte-scale clusters are important for operational simplicity and resource sharing. Tectonic disaggregates the file system metadata into independently scalable layers, and hash-partitions each metadata layer into a scalable shared key-value store. Combined with a linearly scalable storage node layer, this disaggregated metadata allows the system to meet the storage needs of an entire data center. <\/span><\/p>\n<p><span>Tectonic simplifies performance isolation by solving the isolation problem in each tenant for groups of applications with similar traffic patterns and latency requirements. Instead of managing resources between hundreds of applications, Tectonic only manages resources between dozens of traffic groups. <\/span><\/p>\n<p><span>Tectonic uses tenant-specific optimizations to match the performance of specialized storage systems. These optimizations are enabled by a client-driven microservice architecture that includes a rich set of client-side configurations for controlling how tenants interact with Tectonic.<\/span><\/p>\n<h2><span>Why it matters:\u00a0<\/span><\/h2>\n<p><span>Most large-scale cloud services are dependent on storage. As cloud services become more popular, the need for data storage and processing is growing rapidly.<\/span> <span>Distributed storage systems must scale and evolve to store and process this data efficiently. For example, with growth in storage footprint, scalability of individual storage clusters can become a bottleneck. <\/span><\/p>\n<p><span>Adopting Tectonic has helped our storage scale and yielded many operational and efficiency improvements. By moving our data warehouse onto Tectonic, we\u2019ve reduced the number of data warehouse clusters by 10x, simplifying operations and unstranding resources. Tectonic manages these efficiency improvements while providing performance that\u2019s comparable to or better than that of our previous specialized storage systems.<\/span><\/p>\n<h2><span>Read the full paper:<\/span><\/h2>\n<p><a href=\"https:\/\/www.usenix.org\/system\/files\/fast21-pan.pdf\"><span>Facebook\u2019s Tectonic Filesystem: Efficiency from Exascale<\/span><\/a><\/p>\n<p>The post <a href=\"https:\/\/engineering.fb.com\/2021\/06\/21\/data-infrastructure\/tectonic-file-system\/\">Consolidating Facebook storage infrastructure with Tectonic file system<\/a> appeared first on <a href=\"https:\/\/engineering.fb.com\/\">Facebook Engineering<\/a>.<\/p>\n<p><a href=\"https:\/\/engineering.fb.com\/2021\/06\/21\/data-infrastructure\/tectonic-file-system\/\">Read More<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>What the research is:\u00a0 Tectonic, our data center scale distributed file system, enables better resource utilization, promotes simpler services, and requires less operational complexity than our previous approach. Our previous storage infrastructure consisted of a set of use-case specific storage systems. Clusters, or instances of these storage systems, used to scale to tens of petabytes.&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2021\/08\/31\/consolidating-facebook-storage-infrastructure-with-tectonic-file-system\/\">Continue reading <span class=\"screen-reader-text\">Consolidating Facebook storage infrastructure with Tectonic file system<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-322","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":836,"url":"https:\/\/fde.cat\/index.php\/2024\/03\/12\/building-metas-genai-infrastructure\/","url_meta":{"origin":322,"position":0},"title":"Building Meta\u2019s GenAI Infrastructure","date":"March 12, 2024","format":false,"excerpt":"Marking a major investment in Meta\u2019s AI future, we are announcing two 24k GPU clusters. We are sharing details on the hardware, network, storage, design, performance, and software that help us extract high throughput and reliability for various AI workloads. We use this cluster design for Llama 3 training. We\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":618,"url":"https:\/\/fde.cat\/index.php\/2022\/08\/10\/scaling-data-ingestion-for-machine-learning-training-at-meta\/","url_meta":{"origin":322,"position":1},"title":"Scaling data ingestion for machine learning training at Meta","date":"August 10, 2022","format":false,"excerpt":"Many of Meta\u2019s products, such as search and language translations, utilize AI models to continuously improve user experiences. As the performance of hardware we use to support training infrastructure increases, we need to scale our data ingestion infrastructure accordingly to handle workloads more efficiently. GPUs, which are used for training\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":518,"url":"https:\/\/fde.cat\/index.php\/2021\/12\/16\/power-loss-siren-making-meta-resilient-to-power-loss-events\/","url_meta":{"origin":322,"position":2},"title":"Power Loss Siren: Making Meta resilient to power loss events","date":"December 16, 2021","format":false,"excerpt":"There are thousands of distributed services running on millions of servers in Meta\u2019s data centers. Part of ensuring the reliability of those services means making them resilient to power loss events as our data center fleet grows. To help increase resiliency, we built the Power Loss Siren (PLS) \u2014 a\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":759,"url":"https:\/\/fde.cat\/index.php\/2023\/09\/07\/arcadia-an-end-to-end-ai-system-performance-simulator\/","url_meta":{"origin":322,"position":3},"title":"Arcadia: An end-to-end AI system performance simulator","date":"September 7, 2023","format":false,"excerpt":"We\u2019re introducing Arcadia, Meta\u2019s unified system that simulates the compute, memory, and network performance of AI training clusters. Extracting maximum performance from an AI cluster and increasing overall efficiency warrants a multi-input system that accounts for various hardware and software parameters across compute, storage, and network collectively. Arcadia gives Meta\u2019s\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":879,"url":"https:\/\/fde.cat\/index.php\/2024\/06\/12\/how-meta-trains-large-language-models-at-scale\/","url_meta":{"origin":322,"position":4},"title":"How Meta trains large language models at scale","date":"June 12, 2024","format":false,"excerpt":"As we continue to focus our AI research and development on solving increasingly complex problems, one of the most significant and challenging shifts we\u2019ve experienced is the sheer scale of computation required to train large language models (LLMs). Traditionally, our AI model training has involved a training massive number of\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":839,"url":"https:\/\/fde.cat\/index.php\/2024\/03\/18\/logarithm-a-logging-engine-for-ai-training-workflows-and-services\/","url_meta":{"origin":322,"position":5},"title":"Logarithm: A logging engine for AI training workflows and services","date":"March 18, 2024","format":false,"excerpt":"Systems and application logs play a key role in operations, observability, and debugging workflows at Meta. Logarithm is a hosted, serverless, multitenant service, used only internally at Meta, that consumes and indexes these logs and provides an interactive query interface to retrieve and view logs. In this post, we present\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/322","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=322"}],"version-history":[{"count":1,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/322\/revisions"}],"predecessor-version":[{"id":388,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/322\/revisions\/388"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=322"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=322"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=322"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}