{"id":328,"date":"2021-08-31T14:39:51","date_gmt":"2021-08-31T14:39:51","guid":{"rendered":"https:\/\/fde.cat\/?p=328"},"modified":"2021-08-31T14:39:51","modified_gmt":"2021-08-31T14:39:51","slug":"ribbon-filter-practically-smaller-than-bloom-and-xor","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/ribbon-filter-practically-smaller-than-bloom-and-xor\/","title":{"rendered":"Ribbon filter: Practically smaller than Bloom and Xor"},"content":{"rendered":"<h2><span>What the research is:<\/span><\/h2>\n<p><span>The Ribbon filter is a new data structure that is more space-efficient than the popular Bloom filters that are widely used for optimizing data retrieval. One of the ways that Bloom, and now Ribbon, filters solve real engineering problems is by providing smooth configurability unmatched by other filters. Bloom filters work by overapproximating a set of keys associated with some data resource. With a Bloom filter, almost all negative queries to that resource can be skipped (filtered) because the Bloom filter rejects almost all query keys not associated with the resource.<\/span><\/p>\n<p><span>With proper data layout and design, the Ribbon filter is the first Bloom alternative to match the near-continuous, hazard-free, accuracy-versus-space trade-off provided by Bloom filters.<\/span><\/p>\n<p><span>Here, near-continuous means efficiently utilizing any amount of memory to represent any number of keys, so that wasted memory such as in internal fragmentation can be minimized to zero. The typical hazard to the accuracy-versus-space trade-off is bit alignment, where some data sizes (e.g., 4, 8, or 16 bits per key) are faster to access than others. Like Bloom, our data layout for Ribbon does not suffer this hazard in access times so it is more freely configurable. And Ribbon filters add new freedom of configurability to the space-versus-time trade-off.<\/span><\/p>\n<p><span>Building on some prior lines of research, the Ribbon filter combines a simplified, faster, and more flexible construction algorithm; a data layout optimized for filter queries; and near-continuous configurability to make a practical alternative to static (immutable) Bloom filters.<\/span><\/p>\n<p><span>While well-engineered Bloom filters are extremely fast, they use roughly 50 percent more space (overhead) than the information-theoretic lower bound for filters on arbitrary keys. When Bloom filters cannot meet an application\u2019s space efficiency targets, Ribbon filter variants dominate in space-versus-time trade-offs with near continuous configurability and space overhead as low as 1 percent or less. Ribbon filters have O(1) query times and save roughly 1\/3 of memory compared with Bloom filters.<\/span><\/p>\n<h2><span>How it works:\u00a0<\/span><\/h2>\n<p><span>Like some related immutable structures used for perfect hashing and maps, Ribbon filters are constructed by solving a linear system given by hash functions applied to a set of keys. Each row in the linear system expresses that querying as some key, which involves XOR-ing the values at some set of array indices, must yield a prescribed value to indicate it is \u201cin\u201d the set of keys.\u00a0<\/span><\/p>\n<p><span>Despite using Boolean \u2014 GF(2) \u2014 arithmetic, the approach to solving this logical system of equations is to use Gaussian elimination, which fundamentally means subtracting equations from one another until you can isolate variables (unknowns). If a solution exists, this approach will find it.<\/span><\/p>\n<p><span>The name Ribbon has dual meanings. First, we use a linear system from Dietzfelbinger and Walzer whose sorted coefficient matrix resembles a physical ribbon, or a wavy approximation of a band matrix. Gaussian elimination is fundamentally more efficient on this system because it is already close to a reduced form.<\/span><\/p>\n<p><span>Ribbon also stands for Rapid Incremental Boolean Banding ON the fly, which is the name of our fast and flexible new Gaussian solver. Through an approach resembling insertion into a linear-probed hash table, Ribbon does Gaussian elimination on the fly. This saves time and space in construction because row reductions can be done in registers rather than in memory, and because the reduced form of the ribbon coefficient matrix \u2014 a band matrix \u2014 is more space-efficient than explicitly representing the ribbon form. On-the-fly construction also unlocks a solution to the core challenge of the Ribbon approach: scaling its space efficiency to very large numbers of keys.\u00a0\u00a0<\/span><\/p>\n<h2><span>Why it matters:\u00a0<\/span><\/h2>\n<p><span>At Facebook\u2019s scale, we expect Ribbon filters to save several percent of RAM resources, with a tiny increase in CPU usage for some major storage systems. However, we do not implement efficiency gains at all engineering costs, so it\u2019s also important to have a user-friendly data structure. This issue stalled implementation of other Bloom alternatives offering some space savings.\u00a0<\/span><\/p>\n<p><span>The Ribbon filter opens these new trade-offs without introducing notable discontinuities or hazards in the configuration space. In other words, there is some complexity to make Ribbon filters general and highly configurable, but these details can be hidden behind a relatively simple API. You have essentially free choice over any three of the four core performance dimensions \u2014 number of keys added to the set, memory usage, CPU efficiency, and accuracy \u2014 and the accuracy is automatically well optimized.<\/span><\/p>\n<h2><span>Read the full paper:\u00a0<\/span><\/h2>\n<p><a href=\"https:\/\/arxiv.org\/abs\/2103.02515\"><span>Ribbon filter: Practically smaller than Bloom and Xor<\/span><\/a><\/p>\n<p>The post <a href=\"https:\/\/engineering.fb.com\/2021\/07\/09\/data-infrastructure\/ribbon-filter\/\">Ribbon filter: Practically smaller than Bloom and Xor<\/a> appeared first on <a href=\"https:\/\/engineering.fb.com\/\">Facebook Engineering<\/a>.<\/p>\n<p><a href=\"https:\/\/engineering.fb.com\/2021\/07\/09\/data-infrastructure\/ribbon-filter\/\">Read More<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>What the research is: The Ribbon filter is a new data structure that is more space-efficient than the popular Bloom filters that are widely used for optimizing data retrieval. One of the ways that Bloom, and now Ribbon, filters solve real engineering problems is by providing smooth configurability unmatched by other filters. Bloom filters work&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2021\/08\/31\/ribbon-filter-practically-smaller-than-bloom-and-xor\/\">Continue reading <span class=\"screen-reader-text\">Ribbon filter: Practically smaller than Bloom and Xor<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-328","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":839,"url":"https:\/\/fde.cat\/index.php\/2024\/03\/18\/logarithm-a-logging-engine-for-ai-training-workflows-and-services\/","url_meta":{"origin":328,"position":0},"title":"Logarithm: A logging engine for AI training workflows and services","date":"March 18, 2024","format":false,"excerpt":"Systems and application logs play a key role in operations, observability, and debugging workflows at Meta. Logarithm is a hosted, serverless, multitenant service, used only internally at Meta, that consumes and indexes these logs and provides an interactive query interface to retrieve and view logs. In this post, we present\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":510,"url":"https:\/\/fde.cat\/index.php\/2021\/11\/30\/restriction-rules\/","url_meta":{"origin":328,"position":1},"title":"Restriction Rules","date":"November 30, 2021","format":false,"excerpt":"Restriction Rules: Complementing Salesforce\u2019s Record Access Control Mechanism Introduction If you\u2019ve taken Salesforce Admin 201 training, you might remember learning about Sharing Settings. Sharing Settings include Sharing Models, Criteria Sharing Rules, Manual Sharing, and more. I\u2019m a software engineer on the Record Access Experience team here at Salesforce. When I\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":704,"url":"https:\/\/fde.cat\/index.php\/2023\/04\/17\/a-fine-grained-network-traffic-analysis-with-millisampler\/","url_meta":{"origin":328,"position":2},"title":"A fine-grained network traffic analysis with Millisampler","date":"April 17, 2023","format":false,"excerpt":"What the research is:\u00a0 Millisampler is one of Meta\u2019s latest characterization tools and allows us to observe, characterize, and debug network performance at high-granularity timescales efficiently. This lightweight network traffic characterization tool for continual monitoring operates at fine, configurable timescales. It collects time series of ingress and egress traffic volumes,\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":278,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/a-peek-at-datoramas-virtual-sandbox\/","url_meta":{"origin":328,"position":3},"title":"A Peek at Datorama\u2019s Virtual Sandbox","date":"August 31, 2021","format":false,"excerpt":"What is a\u00a0Sandbox?A Sandbox is an isolated environment where you can safely experiment and test changes that you intend to apply to your live environment. The idea is to learn about the implications of these intended changes in advance, in order to decide on how to best conduct them, without\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":496,"url":"https:\/\/fde.cat\/index.php\/2021\/10\/26\/kangaroo-a-new-flash-cache-optimized-for-tiny-objects\/","url_meta":{"origin":328,"position":4},"title":"Kangaroo: A new flash cache optimized for tiny objects","date":"October 26, 2021","format":false,"excerpt":"What the research is:\u00a0 Kangaroo is a new flash cache that enables more efficient caching of tiny objects (objects that are ~100 bytes or less) and overcomes the challenges presented by existing flash cache designs. Since Kangaroo is implemented within CacheLib, Facebook\u2019s open source caching engine, developers can use Kangaroo\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":181,"url":"https:\/\/fde.cat\/index.php\/2020\/12\/10\/coral-a-sql-translation-analysis-and-rewrite-engine-for-modern-data-lakehouses\/","url_meta":{"origin":328,"position":5},"title":"Coral: A SQL translation, analysis, and rewrite engine for modern data lakehouses","date":"December 10, 2020","format":false,"excerpt":"Co-authors: Walaa Eldin Moustafa, Wenye Zhang, Sushant Raikar, Raymond Lam, Ron Hu, Shardul Mahadik, Laura Chen, Khai Tran, Chris Chen, and Nagarathnam Muthusamy Introduction At LinkedIn, our big data compute infrastructure continually grows over time, not only to keep pace with the growth in the number of data applications, or\u2026","rel":"","context":"In &quot;External&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/328","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=328"}],"version-history":[{"count":1,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/328\/revisions"}],"predecessor-version":[{"id":382,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/328\/revisions\/382"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=328"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=328"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=328"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}