Being able to distribute large, widely -consumed objects (so-called hot content) efficiently to hosts is becoming increasingly important within Meta’s private cloud. These are commonly distributed content types such as executables, code artifacts, AI models, and search indexes that help enable our software systems.
Owl is a new system for high-fanout distribution of large data objects to hosts in Meta’s private cloud. It distributes over 700 petabytes of data per day to millions of client processes and has improved download speeds and cache hit rate by a factor of 2-3x over BitTorrent and other prior systems used for distribution at Meta.
The distribution challenge: Why we built Owl
There are three dimensions to distribution:
Fanout: The same content may be read by a handful of clients or by millions of processes running in data centers around the globe.
Size: Objects to be distributed range from around 1 MB to a few TBs.
Hotness: Clients may read an object within a few seconds of each other, or their reads may be spread over hours.
Distribution requirements are exacting. First, content distribution must be fast. Slow executable delivery, for example, increases downtime and delays fixes. We strive to provide data at a rate bounded by either available network bandwidth of the reading host or by the available write bandwidth of its storage media.
Second, content distribution must be efficient. One dimension of efficiency is scalability, i.e., the number of clients that can have their distribution needs met by a given number of servers. Another dimension involves network usage. We measure this in terms of both bytes transmitted and communication locality (e.g., an in-rack data transfer is more efficient than a cross-region transfer). A final dimension of efficiency includes resource usage on client machines, such as CPU cycles, memory, and disk I/O. Not only should we use as few resources as possible, but we also should also be able to adjust for their relative importance on different clients. Some services are memory constrained, while others cannot write to disk.
Finally, content distribution must be reliable. We measure reliability as the percentage of download requests that the distribution system satisfies within a latency service level agreement (SLA). Operational ease-of-management is another facet that often gets overlooked as a prerequisite for high reliability. In a production environment, workloads may change. Additionally, misses in dependency SLAs, including partial outages or performance faults, can occur. To maintain a high SLA for distribution during such events, engineers must be alerted quickly and have a clear picture of the operational health of each client. Finally, when reliability, speed, or efficiency start to degrade, they need simple knobs that adjust behavior and quickly restore operational health.
Prior solutions to Owl
Prior to Owl, Meta had at least three different systems for large content distribution that had grown organically over time to cater to this challenge in separate domains: executable distribution, model distribution, and index distribution. However, no prior solution met all the above requirements.
We identified two root causes of this:
No prior system had struck the correct balance between decentralization and centralization.
No prior system had sufficient flexibility to meet the requirements of the varied service types at Meta that require content distribution.
The first implementation provided highly centralized distribution via hierarchical caching. Clients downloaded content from first-level caches on remote hosts. These caches, in turn, handled cache misses by reading from other caches, with the final layer of the hierarchy being a distributed storage system.
Hierarchical caching was inefficient for hot content distribution and had difficulty scaling. The cache hierarchy needed a large number of dedicated hosts, with the number of hosts increasing to keep pace with growth in workloads from services consuming data and the number of reading clients. Load spikes caused by hot content presented a continual problem: Strict quotas were necessary to protect the centralized caches. However, readers of hot content were frequently throttled due to the bursty request pattern. In general, provisioning for transient spikes caused by hot content and setting quotas appropriately was quite challenging.
Meta also used two highly decentralized systems: a location-aware BitTorrent implementation and a static peer-to-peer distribution tree based on consistent hashing. In both cases, a peer is any process that wishes to download data (millions of such processes exist at Meta).
The decentralized systems scaled much better than hierarchical caching. But they brought their own problems. First, each peer made distribution decisions based on local information, meaning resource efficiency and tail latency could be poor. With each peer making independent caching decisions, the collection of peers could retain either more or fewer copies of a data object than necessary. More important, operating these decentralized solutions was difficult. Engineers struggled to get a clear picture of overall health and status without aggregating data from large numbers of peers, given that each peer had a different and limited view of the state of distribution. In general, reasoning about systemwide correctness and efficiency proved very difficult.
In summary, highly decentralized systems turned out to be inefficient and difficult to operate, while highly centralized systems scaled poorly. As a result, we chose to create a new, split design with a decentralized data plane and a centralized control plane. The decentralized data plane streams data from sources to clients via a distribution tree built by the centralized control plane. However, these trees are ephemeral and per-data-chunk. Each edge in a tree persists only while a chunk is being transferred from a source to a peer.
The design realizes a mechanism-policy split. Peers are simple and provide the mechanism for caching and transferring data chunks. The centralized control plane makes fine grained policy decisions about distribution. This includes identifying sources from where peers should get each chunk of content, when and how they should cache fetched content, and how they should retry failed downloads. The control plane consists of a handful of trackers (borrowing terminology from BitTorrent). They have a complete picture of the distribution state, what data each peer is downloading, and a list of chunks in each peer’s cache.
This fine-grained state allows the trackers to make optimal decisions about data placement and distribution that minimize network hops and maximize cache hit rate. Centralizing the control plane has also made distribution easy to operate and debug: Engineers can understand what decisions led to low availability, high latency, or poor cache hit rate because a tracker with a consistent view of the distribution state makes these decisions.
The second major challenge faced by prior systems was a lack of flexibility. At Meta, clients have vastly different resources to spare for distribution. Some clients can dedicate gigabytes of memory or disk for peer-to-peer caching. Others have no resources to spare. Clients have very different access patterns and scale.
Finally, the objectives for distribution can differ. Some clients need low latency, while others wish to reduce the load on external storage to avoid throttling. Each of the previous solutions was customized for a subset of the use cases optimizing for its client’s needs. To unify the disparate distribution solutions, we could not simply provide a one-size-fits-all solution. It was critical that we did not regress any client’s key metrics.
We therefore chose to make customization a first-class design priority. Trackers implement modular interfaces for specifying different policies for caching and fetching data. Further, each policy is itself configurable to allow for different trade-offs across client types and responses to changing workloads. We use trace-driven emulation to search through the space of possible customizations. Doing so lets us find the best policies and configurations for each observed workload.
An overview of Owl
Our new solution, Owl, is a highly customizable data distribution system with a centralized control plane and a decentralized data plane. Owl has been in production at Meta for two years and has scaled out rapidly (production traffic increased by almost 200x in 2021). Currently, Owl has over 10 million unique clients (binaries concurrently using the library), and it downloads up to 800 petabytes of data per day. Owl supports 106 unique types of clients and has customized policies for 55 of these.
Owl has two basic components:
Peer libraries: Libraries linked into every binary that uses Owl to download data. If a binary is using Owl, we refer to it as a peer.
Trackers: Dedicated Owl services that manage the control plane for a group of peers.
Owl’s peer library provides a simple API for downloading data. Client processes fetch content from a source object by specifying a range of data to read and a unique identifier for the object being read. The caller can optionally specify a deadline and classes that check data integrity or decrypt provided data.
These peers can cache data in memory and/or on disk. The caches may be shared with the client binary if the client does not modify downloaded data. Owl uses the caches to serve content requests from other peers. Owl policies usually prefer to fetch data from a peer rather than an external data source. As a result, peer-to-peer distribution satisfies most requests.
A tracker manages the download state for a set of peers. Typically, peers and trackers are grouped by region (a region is several colocated data centers), with 3-4 trackers per region providing scale and redundancy. Trackers are multitenant. In general, each tracker supports all Owl buckets. However, we use a separate set of trackers for binary distribution to provide strict performance isolation for this sensitive workload.
Trackers associate data and peers. For each chunk, tracker metadata specifies which peers are caching the chunk and which are downloading it. Tracker metadata also specifies the source of each peer’s download e.g., an external source or another peer. For each peer, the tracker metadata specifies the peer’s location (host, rack, region, etc.) and its cache state (the chunks in the cache, last access time, and so on). In contrast with highly decentralized systems like BitTorrent, Owl trackers can maintain such detailed up-to-date states because trackers make all major decisions about caching and downloading chunks on behalf of peers.
Peers associate with one tracker. Each peer picks a random instance from the set of available trackers and registers by sending a remote procedure call (RPC). Peers register with a new random tracker if their association with the current tracker fails.
One of Owl’s primary design principles is to keep peers as simple as possible. This is achieved via a mechanism-policy split. The peers provide the mechanism to perform simple actions. These include downloading a chunk from a given source, caching or evicting a chunk from cache, or providing cached data in response to a request from another peer. To download content, peers ask trackers to decide from where they should fetch content, how they should retry failed downloads, and even what chunks they should cache locally.
This design principle has proven invaluable for operational simplicity. The team can change configuration values on trackers within seconds if necessary and roll out new policies regularly without updating any of the peers. We associate each peer with a bucket that uniquely identifies the type of the client binary with which the library is linked. The bucket provides a way to customize Owl behavior for each type of client. It also lets us monitor usage, performance, and reliability for each Owl customer individually.
Our paper, “Owl: Scale and flexibility in distribution of hot content,” recently appeared at the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’22)and describes the design and implementation of Owl in more detail. The paper discusses the challenges we faced in scaling Owl to handle workloads at Meta, and how we use Owl’s flexibility to customize its behavior for each type of client. Finally, it relates some additional challenges we faced during deployment and the insights that allowed us to overcome those challenges.
Engineering at Meta