{"id":342,"date":"2021-08-31T14:39:28","date_gmt":"2021-08-31T14:39:28","guid":{"rendered":"https:\/\/fde.cat\/?p=342"},"modified":"2021-08-31T14:39:28","modified_gmt":"2021-08-31T14:39:28","slug":"how-we-built-a-general-purpose-key-value-store-for-facebook-with-zippydb","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/how-we-built-a-general-purpose-key-value-store-for-facebook-with-zippydb\/","title":{"rendered":"How we built a general purpose key value store for Facebook with ZippyDB"},"content":{"rendered":"<p><span>ZippyDB is the largest strongly consistent, geographically distributed key-value store at Facebook. Since we first deployed ZippyDB in 2012, this key-value store has expanded rapidly, and today, ZippyDB serves a number of use cases, ranging from metadata for a distributed filesystem, counting events for both internal and external purposes, to product data that\u2019s used for various app features. ZippyDB offers a lot of flexibility to applications in terms of tunable durability, consistency, availability, and latency guarantees, which has made the service a popular choice within Facebook for storing both ephemeral and nonephemeral small key-value\u00a0<\/span><span>data. In this post, we are sharing for the first time the history and evolution of ZippyDB and some of the unique design choices and trade-offs made in building this service that addressed the majority of key-value store scenarios at Facebook.<\/span><\/p>\n<h2><span>History of ZippyDB<\/span><\/h2>\n<p><span>ZippyDB uses<\/span><a href=\"https:\/\/www.facebook.com\/notes\/facebook-engineering\/under-the-hood-building-and-open-sourcing-rocksdb\/10151822347683920\/\"> <span>RocksDB<\/span><\/a><span> as the underlying storage engine. Before ZippyDB, various teams across Facebook used RocksDB directly to manage their data. This resulted, however, in a duplication of efforts in terms of each team solving similar challenges such as consistency, fault tolerance, failure recovery, replication, and capacity management. To address the needs of these various teams, we built ZippyDB to provide a highly durable and consistent key-value data store that allowed products to move a lot faster by offloading all the data and the challenges associated with managing this data at scale to ZippyDB.<\/span><\/p>\n<p><span>One of the significant design decisions we made early during the development of ZippyDB was to reuse as much of the existing infrastructure as possible. Consequently, most of our initial efforts were focused on building a reusable and flexible data replication library called Data Shuttle. We built a fully managed distributed key-value store by combining Data Shuttle with a preexisting and well-established storage engine (RocksDB) and layering this on top of our existing shard management (<\/span><a href=\"https:\/\/engineering.fb.com\/2020\/08\/24\/production-engineering\/scaling-services-with-shard-manager\/\"><span>Shard Manager<\/span><\/a><span>)<\/span> <span>and <\/span><span>distributed <\/span><span>configuration service (built on <\/span><a href=\"https:\/\/zookeeper.apache.org\/\"><span>ZooKeeper<\/span><\/a><span>), that together solves load balancing, shard placement, failure detection, and service discovery.<\/span><\/p>\n<h2><span>Architecture<\/span><\/h2>\n<\/p>\n<p><span>ZippyDB is deployed in units known as tiers. A tier consists of compute and storage resources spread across several geographic areas known as regions<\/span> <span>worldwide, which makes it resilient to failures. There are only a handful of ZippyDB tiers that exist today, including the default \u201cwildcard\u201d tier and specialized tiers for distributed filesystem metadata and other product groups within Facebook. Each tier hosts multiple use cases. Normally, use cases are created on the wildcard tier, which is our generic multitenant tier. This is the preferred tier because of its better utilization of hardware and lower operational overhead, but we occasionally bring up dedicated tiers if there is a need, usually due to stricter isolation requirements.<\/span><\/p>\n<p><span>The data belonging to a use case on a tier is split into units known as shards,<\/span> <span>which are<\/span> <span>the basic units of data management on the server side. Each shard is replicated across multiple regions (for fault tolerance) using Data Shuttle, which uses either <a href=\"https:\/\/dl.acm.org\/doi\/10.1145\/279227.279229\">Paxos<\/a> or<\/span> <span>async replication to<\/span> <span>replicate<\/span> <span>data, depending on the configuration. Within a shard, a subset of replicas are configured to be a part of the Paxos quorum group, also known as global scope,<\/span> <span>where data is synchronously replicated using Multi-Paxos to provide high durability and availability in case of failures. The remaining replicas, if any, are configured as followers.<\/span> <span>These are similar to learners in Paxos terminology and receive data asynchronously. Followers allow applications to have many in-region replicas to support low-latency reads with relaxed consistency, while keeping the quorum size small for lower write latency. This flexibility in replica role configuration within a shard allows applications to strike a balance between durability, write performance, and read performance depending on their needs.<\/span><\/p>\n<p><span>In addition to the sync or async replication strategy, applications also have the option to provide \u201chints\u201d to the service about the regions in which the replicas of a shard must be placed. These hints, also known as stickiness<\/span> <span>constraints, allow applications to have some control over the latency of reads and writes by having replicas built in regions from where they expect most of the access to come. ZippyDB also provides a caching layer and integrates with a pub-sub system allowing subscriptions to data mutations on shards, both of which are opt-ins depending on the requirements of the use case.<\/span><\/p>\n<h2><span>Data model<\/span><\/h2>\n<p><span>ZippyDB supports a simple key-value<\/span> <span>data model with APIs to get, put, and delete keys along with their batch variants. It supports iterating over key prefixes and deleting a range of keys. These APIs are very similar to the API exposed by the underlying RocksDB storage engine. In addition, we also support a test-and-set API for basic read-modify-write operations and transactions, conditional writes for more generic read-modify-write operations (more about this later). This minimal API set has proved to be sufficient for most use cases to manage their data on ZippyDB. For ephemeral data, ZippyDB has native TTL support where the client can optionally specify the expiry time for an object at the time of the write. We piggyback on RocksDB\u2019s periodic compaction support to clean up all the expired keys efficiently while filtering out dead keys on the read side in between compaction runs. Many applications actually access data on ZippyDB through an ORM layer on top of ZippyDB, which translates these accesses into ZippyDB API. Among other things, this layer serves to abstract the details of the underlying storage service.<\/span><\/p>\n<\/p>\n<p><span>Shard is the unit of data management on the server side. The optimal assignment of shards to servers needs to take into account load, failure domains, user constraints, etc., and this is handled by ShardManager<\/span><span>. <\/span><span>ShardManager<\/span> <span>is responsible for monitoring servers for load imbalance, failures, and initiating shard movement between servers.\u00a0<\/span><\/p>\n<p><span>Shard, often referred to as physical shard or p-shard, is a server-side concept and isn\u2019t exposed to applications directly. Instead, we allow use cases to partition their key space into smaller units of related data known as \u03bcshards (micro-shards)<\/span><span>. <\/span><span>A typical<\/span> <span>physical<\/span> <span>shard<\/span> <span>has a size of 50\u2013100 GB, hosting several tens of thousands of \u03bcshards. This additional layer of abstraction allows ZippyDB to reshard the data transparently without any changes on the client.<\/span><\/p>\n<p><span>ZippyDB supports two kinds of mappings from <\/span><span>\u03bc<\/span><span>shards to physical shards: compact mapping and Akkio mapping. Compact mapping<\/span> <span>is used<\/span> <span>when the assignment is fairly static and mapping is only changed when there is a need to split shards that have become too large or hot. In practice, this is a fairly infrequent operation when compared with Akkio mapping, where mapping of <\/span><span>\u03bc<\/span><span>shards is managed by a service known as<\/span><a href=\"https:\/\/engineering.fb.com\/2018\/10\/08\/core-data\/akkio\/\"> <span>Akkio<\/span><\/a><span>. Akkio splits use cases\u2019 key space into \u03bcshards and places these \u03bcshards in regions where the information is typically accessed. Akkio helps reduce data set duplication and provides a significantly more efficient solution for low latency access than having to place data in every region.<\/span><\/p>\n<\/p>\n<p><span>As we mentioned earlier, Data Shuttle uses Multi-Paxos to synchronously replicate data to all replicas in the global scope<\/span><span>. <\/span><span>Conceptually, time is subdivided into units known as epochs. Each epoch has a unique leader, whose role is assigned using an external shard management service called ShardManager. Once a leader is assigned, it has a lease for the entire duration of the epoch. Periodic heartbeats used to keep a lease active until ShardManager bumps up the epoch on the shard (e.g., for failover, primary load balancing, etc.). When a failure occurs, ShardManager detects the failure, assigns a new leader with a higher epoch and restores write availability. Within each epoch, the leader generates a total ordering of all writes to the shard, by assigning each write a monotonically increasing sequence number. The writes are then written to a replicated durable log using Multi-Paxos<\/span> <span>to achieve consensus on the ordering. Once the writes have reached consensus, they are drained in-order across all replicas.<\/span><\/p>\n<p><span>We chose to use an external service to detect failures and assign leaders to keep the design of the service simple in the initial implementation. However, in the future we plan to move towards detecting failures entirely within<\/span> <span>Data Shuttle (\u201cin-band\u201d) and reelecting the leaders more proactively without having to wait for ShardManager and incurring delays.<\/span><\/p>\n<h2><span>Consistency<\/span><\/h2>\n<p><span>ZippyDB provides configurable consistency and durability levels to applications, which can be specified as options in read and write APIs. This allows applications to make durability, consistency, and performance trade-offs dynamically on a per-request level.<\/span><\/p>\n<p><span>By default, a write involves persisting the data on a majority of replicas\u2019 Paxos logs and writing the data to RocksDB on the primary before acknowledging the write to the client. With the default write mode, a read on primary will always see the most recent write. Some applications cannot tolerate cross-region latencies for every write, so ZippyDB supports a fast-acknowledge<\/span> <span>mode, where writes are acknowledged as soon as they are enqueued on the primary for replication. The durability and consistency guarantees for this mode are obviously lower, which is the trade-off for higher performance.<\/span><\/p>\n<p><span>On the read side, the three most popular consistency levels are eventual, read-your-writes, and strong. The eventual consistency level supported by ZippyDB is actually a much stronger consistency level than the more well-known eventual consistency. ZippyDB provides total ordering for all writes within a shard and ensures that reads aren\u2019t served by replicas that are lagging behind primary\/quorum beyond a certain configurable threshold (heartbeats are used to detect lag), so eventual reads supported by ZippyDB are closer to bounded staleness consistency in literature.<\/span><\/p>\n<p><span>For read-your-writes, the clients cache the latest sequence number returned by the server for writes and use the version to run at-or-later<\/span> <span>queries<\/span> <span>while reading. The cache of versions is within the same client process.<\/span><\/p>\n<p><span>ZippyDB also provides strong consistency or <\/span><a href=\"https:\/\/dl.acm.org\/doi\/10.1145\/78969.78972\"><span>linearizability<\/span><\/a><span>, <\/span><span>where<\/span> <span>clients can see the effects of the most recent writes regardless of where the writes or reads come from. Strong reads today are implemented by routing the reads to the primary in order to avoid the need to speak to a quorum, mostly for performance reasons. The primary relies on owning the lease to ensure that there is no other primary before serving reads. In certain outlier cases, where the primary hasn\u2019t heard about the lease renewal, strong reads on primary turn into a quorum check and read.<\/span><\/p>\n<h2><span>Transactions and conditional writes<\/span><\/h2>\n<h2><\/h2>\n<\/p>\n<p><span>ZippyDB supports transactions and conditional writes for use cases that need atomic read-modify-write operations on a set of keys.<\/span><\/p>\n<p><span>All transactions are serializable by default on a shard, and we don\u2019t support lower isolation levels. This simplifies the server-side implementation and the reasoning about correctness of concurrently executing transactions on the client side. Transactions use optimistic concurrency control to detect and resolve <\/span><a href=\"https:\/\/dl.acm.org\/doi\/10.1145\/568271.223787\"><span>conflicts<\/span><\/a><span>, which works as shown in the figure above. The clients typically read from a secondary all of the data from a snapshot<\/span> <span>of the DB, compose the write<\/span> <span>set, and send both the read and write sets to the primary to commit. Upon receiving the read and write sets and the snapshot against which reads were performed, the primary checks whether there were conflicting writes by other concurrently executing transactions that have already been admitted. The transaction is admitted only if there are no conflicts, after which the transaction is guaranteed to succeed, assuming no server failures. Conflict resolution on the primary relies on tracking all of the recent<\/span> <span>writes performed by previously admitted transactions during the same epoch on the primary. Transactions spanning epochs are rejected, as this simplifies write set<\/span> <span>tracking without requiring replication. The history of writes maintained on the primary is also periodically purged to keep the space usage low. Since the complete history isn\u2019t maintained, the primary needs to maintain a minimum tracked version<\/span> <span>and reject all transactions that have reads against a snapshot with lower version to guarantee serializability. Read-only transactions work exactly similar to read-write transactions, except that the write set is empty.<\/span><\/p>\n<p><span>Conditional write is implemented using \u201cserver-side transactions\u201d. It provides a more user friendly client side API for use cases where clients want to atomically modify a set of keys based on some common preconditions such as key_present, key_not_present, and value_matches_or_key_not_present. When a primary receives a conditional write request it sets up a transaction context and converts the preconditions and write set to a transaction on the server, reusing all of the machinery for transactions. The conditional-write API can be more efficient than the transaction API in cases where clients can compute the precondition without requiring a read.<\/span><\/p>\n<h2><span>The future of ZippyDB<\/span><\/h2>\n<p><span>Distributed key-value stores have many applications, and the need for them often comes up while building a variety of systems, from products to storing metadata for various infrastructure services. Building a scalable, strongly consistent, and fault-tolerant key-value store can be very challenging and often requires thinking through many trade-offs to provide a curated combination of system capabilities and guarantees that works well in practice for a variety of workloads. This blog post introduced ZippyDB, Facebook\u2019s biggest key-value store, which has been in production for more than six years serving a lot of different workloads. Since its inception, the service has seen very steep adoption, mostly due to the flexibility that it offers in terms of making efficiency, availability, and performance trade-offs. The service also enables us to use engineering resources effectively as a company and use our key-value store capacity efficiently as a single pool. ZippyDB is still evolving and currently undergoing significant architectural changes, such as storage-compute disaggregation, fundamental changes to membership management, failure detection and recovery, and distributed transactions, in order to adapt to the changing ecosystem and product requirements.<\/span><\/p>\n<p>The post <a href=\"https:\/\/engineering.fb.com\/2021\/08\/06\/core-data\/zippydb\/\">How we built a general purpose key value store for Facebook with ZippyDB<\/a> appeared first on <a href=\"https:\/\/engineering.fb.com\/\">Facebook Engineering<\/a>.<\/p>\n<p><a href=\"https:\/\/engineering.fb.com\/2021\/08\/06\/core-data\/zippydb\/\">Read More<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>ZippyDB is the largest strongly consistent, geographically distributed key-value store at Facebook. Since we first deployed ZippyDB in 2012, this key-value store has expanded rapidly, and today, ZippyDB serves a number of use cases, ranging from metadata for a distributed filesystem, counting events for both internal and external purposes, to product data that\u2019s used for&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2021\/08\/31\/how-we-built-a-general-purpose-key-value-store-for-facebook-with-zippydb\/\">Continue reading <span class=\"screen-reader-text\">How we built a general purpose key value store for Facebook with ZippyDB<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-342","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":322,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/consolidating-facebook-storage-infrastructure-with-tectonic-file-system\/","url_meta":{"origin":342,"position":0},"title":"Consolidating Facebook storage infrastructure with Tectonic file system","date":"August 31, 2021","format":false,"excerpt":"What the research is:\u00a0 Tectonic, our data center scale distributed file system, enables better resource utilization, promotes simpler services, and requires less operational complexity than our previous approach. Our previous storage infrastructure consisted of a set of use-case specific storage systems. Clusters, or instances of these storage systems, used to\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":294,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/the-design-of-strongly-consistent-global-secondary-indexes-in-apache-phoenix%e2%80%8a-%e2%80%8apart-1\/","url_meta":{"origin":342,"position":1},"title":"The Design of Strongly Consistent Global Secondary Indexes in Apache Phoenix\u200a\u2014\u200aPart 1","date":"August 31, 2021","format":false,"excerpt":"The Design of Strongly Consistent Global Secondary Indexes in Apache Phoenix\u200a\u2014\u200aPart\u00a01Phoenix is a relational database with a SQL interface that uses HBase as its backing store. This combination allows it to leverage the flexibility and scalability of HBase, which is a distributed key-value store. Phoenix provides additional functionality on top\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":462,"url":"https:\/\/fde.cat\/index.php\/2021\/09\/20\/how-whatsapp-is-enabling-end-to-end-encrypted-backups\/","url_meta":{"origin":342,"position":2},"title":"How WhatsApp is enabling end-to-end encrypted backups","date":"September 20, 2021","format":false,"excerpt":"For years, in order to safeguard the privacy of people\u2019s messages, WhatsApp has provided end-to-end encryption by default \u200b\u200bso messages can be seen only by the sender and recipient, and no one in between. Now, we\u2019re planning to give people the option to protect their WhatsApp backups using end-to-end encryption\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":839,"url":"https:\/\/fde.cat\/index.php\/2024\/03\/18\/logarithm-a-logging-engine-for-ai-training-workflows-and-services\/","url_meta":{"origin":342,"position":3},"title":"Logarithm: A logging engine for AI training workflows and services","date":"March 18, 2024","format":false,"excerpt":"Systems and application logs play a key role in operations, observability, and debugging workflows at Meta. Logarithm is a hosted, serverless, multitenant service, used only internally at Meta, that consumes and indexes these logs and provides an interactive query interface to retrieve and view logs. In this post, we present\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":513,"url":"https:\/\/fde.cat\/index.php\/2021\/12\/09\/using-redis-hash-instead-of-set-to-reduce-cache-size-and-operating-costs\/","url_meta":{"origin":342,"position":4},"title":"Using Redis HASH instead of SET to reduce cache size and operating costs","date":"December 9, 2021","format":false,"excerpt":"What if we told you that there was a way to dramatically reduce the cost to operate on cloud providers? That\u2019s what we found when we dug into the different data structures offered in Redis. Before we committed to one, we did some research into the difference in memory usage\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":547,"url":"https:\/\/fde.cat\/index.php\/2022\/03\/07\/augmenting-flexible-paxos-in-logdevice-to-improve-read-availability\/","url_meta":{"origin":342,"position":5},"title":"Augmenting Flexible Paxos in LogDevice to improve read availability","date":"March 7, 2022","format":false,"excerpt":"We\u2019ve improved read availability in LogDevice, Meta\u2019s scalable distributed log storage system, by removing a fundamental trade-off in Flexible Paxos, the algorithm used to gain consensus among our distributed systems. At Meta\u2019s scale, systems need to be reliable, even in the face of organic failures like power loss events, or\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/342","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=342"}],"version-history":[{"count":1,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/342\/revisions"}],"predecessor-version":[{"id":368,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/342\/revisions\/368"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=342"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=342"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=342"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}