{"id":174,"date":"2021-01-28T19:59:00","date_gmt":"2021-01-28T19:59:00","guid":{"rendered":"https:\/\/fde.cat\/index.php\/2021\/01\/28\/taming-memory-fragmentation-in-venice-with-jemalloc\/"},"modified":"2021-02-02T13:41:44","modified_gmt":"2021-02-02T13:41:44","slug":"taming-memory-fragmentation-in-venice-with-jemalloc","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2021\/01\/28\/taming-memory-fragmentation-in-venice-with-jemalloc\/","title":{"rendered":"Taming memory fragmentation in Venice with Jemalloc"},"content":{"rendered":"<div class=\"resourceParagraph section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceparagraph\"><\/a>\n <\/div>\n<div class=\"resource-text-section\">\n<div class=\"resource-paragraph rich-text\">\n<p>Sometimes, an engineering problem arises that might make us feel like maybe we don&#8217;t know what we&#8217;re doing, or at the very least, forces us out of the comfort zone of our area of expertise. That day came for the Venice team at Linkedin when we began to notice that some <a href=\"https:\/\/engineering.linkedin.com\/blog\/topic\/venice\" target=\"_blank\" rel=\"noopener\">Venice<\/a> processes would consume all available memory and crash if given enough time to run. This ended up being due to memory fragmentation. In this blog post, we&#8217;ll explain the different ways we tried to diagnose the problem, and how we solved it.<\/p>\n<h2>What is Venice?<\/h2>\n<p><a href=\"https:\/\/engineering.linkedin.com\/blog\/topic\/venice\" target=\"_blank\" rel=\"noopener\">Venice<\/a> is LinkedIn&#8217;s platform for serving derived data. Venice was designed for very low latency lookups while also supporting high throughput ingestion of data. Reads are served directly from Venice, while writes enter the system asynchronously, either pushed as full versioned data sets, or as incremental updates coming from either Hadoop or stream sources like <a href=\"http:\/\/samza.apache.org\/\" target=\"_blank\" rel=\"noopener\">Samza<\/a>. At its core, Venice is a sharded multi-tenant clustering software that leverages <a href=\"https:\/\/kafka.apache.org\/\" target=\"_blank\" rel=\"noopener\">Kafka<\/a> and <a href=\"https:\/\/rocksdb.org\/\" target=\"_blank\" rel=\"noopener\">RocksDB<\/a> to power high throughput lookups for LinkedIn features (like People You May Know).<\/p>\n<p>You can learn more about Venice <a href=\"https:\/\/engineering.linkedin.com\/blog\/2017\/02\/building-venice-with-apache-helix\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/p>\n<h2>The symptom of the problem<\/h2>\n<p>As mentioned earlier, we noticed that if we left a server node alone for long enough, it would end up crashing. We generally try to release Venice to production on a weekly cadence to align with our development cycle, but we began to notice that if we slipped and left the software running for longer (a few weeks), things started to get scary. Nodes would go offline across clusters in batches. OS system metrics and logs seemed to point to a growing lack of available memory. Moreover, we had error messages in backtraces from generated core dumps and <span class=\"monospace\">hs_err_pid<\/span> logs that all seemed to indicate failures on allocations, as well as sometimes having lines from <a href=\"https:\/\/man7.org\/linux\/man-pages\/man1\/dmesg.1.html\" target=\"_blank\" rel=\"noopener\">dmesg<\/a> telling us that OOM killer had stepped in. With some added metrics, we could see that, over time, resident memory (RSS) would trend steadily upward along with virtual memory (VSS).<\/p>\n<p><b>A memory leak!<br \/> <\/b>A lot of software engineers have played this game before. There are plenty of tools and strategies out there for puzzling memory leaks, and many an article offering succor for such maladies. I am a Java developer, and I&#8217;m used to working with tools like <a href=\"https:\/\/visualvm.github.io\/\" target=\"_blank\" rel=\"noopener\">visualvm<\/a> and <a href=\"http:\/\/cr.openjdk.java.net\/~sundar\/8022483\/webrev.01\/raw_files\/new\/src\/share\/classes\/com\/sun\/tools\/hat\/resources\/oqlhelp.html\" target=\"_blank\" rel=\"noopener\">OQL<\/a>. This influenced our initial track of investigation. We grabbed some heap dumps and got to digging.<\/p>\n<p>visualvm has a nice feature where you can compare two heap dumps and see what the difference is in memory footprint and object counts. If hooking up a profiling tool to a production service is problematic, a handy alternative can be to start the service, grab a heap dump, let it run for a while until you see that the process has hit its \u201cdegraded\u201d state, and grab another heap dump at that point. You can then plug both dumps into visualvm and get a report on the difference of objects.\u00a0\u00a0<\/p>\n<p>But we didn\u2019t find anything. Unfortunately though, this process didn&#8217;t tell us very much. We were not seeing an accumulation of objects in the heap that would indicate that we were leaving around an increasing volume of objects. Moreover, the heap sizes weren&#8217;t actually getting bigger over time.<\/p>\n<p><b>Could it be something off-heap?<br \/> <\/b><a href=\"https:\/\/docs.oracle.com\/javase\/8\/docs\/technotes\/guides\/troubleshoot\/tooldescr006.html\" target=\"_blank\" rel=\"noopener\">jcmd<\/a> is another great tool that comes bundled with Java. It has utilities for tracking native memory allocations (i.e., off-heap memory allocations made by the JVM). You can execute it on your process with the <span class=\"monospace\">VM.native_memory<\/span> command, and get a report on the memory usage on your process. You can also add the <span class=\"monospace\">baseline<\/span> argument and you can set a starting point on what the memory usage is at time of execution, so you can come back later and see what the difference is. You&#8217;ll get a nice report that looks like the following:<\/p>\n<p><span class=\"monospace\">jcmd &lt;pid&gt; VM.native_memory<\/span><\/p>\n<\/p><\/div>\n<\/p><\/div>\n<\/div>\n<div class=\"resourceImageBlock section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceimageblock\"><\/a>\n <\/div>\n<ul class=\"resource-image-block single\">\n<li class=\"resource-image\"> <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/content.linkedin.com\/content\/dam\/engineering\/site-assets\/images\/blog\/posts\/2021\/01\/venicememory2.png?resize=750%2C356&#038;ssl=1\" alt=\"memory-allocation-report-example\" height=\"356\" width=\"750\"  data-recalc-dims=\"1\"> <\/li>\n<\/ul>\n<\/div>\n<div class=\"resourceParagraph section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceparagraph_1558787047\"><\/a>\n <\/div>\n<div class=\"resource-text-section\">\n<div class=\"resource-paragraph rich-text\">\n<p>jcmd was more succinct than the initial heap dump approach in telling us that there wasn&#8217;t anything there to help us find our memory leak, but it did get us incrementally closer. We at this point knew two things:<\/p>\n<ol>\n<li>\n<p>The process memory footprint is growing.<\/p>\n<\/li>\n<li>\n<p>The memory growth source is not from something tracked by the JVM.<\/p>\n<\/li>\n<\/ol>\n<h2>Native libraries<\/h2>\n<p>This is where things got uncomfortable. As the investigation progressed, I had joked to a colleague that being a professional Java developer should have meant that I shouldn&#8217;t need to go this deep. He responded with, \u201cThat\u2019s like saying Java developers are only meant to write programs, and whether those programs actually run is someone else&#8217;s problem.\u201d<\/p>\n<p>We knew we had to start looking at allocations related to libraries that used native code. In Java, any function labeled with \u201c<a href=\"https:\/\/docs.oracle.com\/en\/java\/javase\/11\/docs\/specs\/jni\/design.html\" target=\"_blank\" rel=\"noopener\">native\u201d<\/a> can have a link to code that is executed outside the bounds of the JVM as \u201cnative\u201d code. Any resources consumed by such code are not tracked by the JVM, nor are they garbage collected the way JVM objects are, whenever a garbage collection occurs. Perhaps even more troubling, an object on the JVM heap that references an object in native code might be easy to miss if you&#8217;re using things like heap dumps or JVM profilers. As an example, in RocksDB an LRUCache object in Java might look to be sized on the order of bytes, but in fact can be measured on the order of gigabytes in total size outside the JVM.<\/p>\n<p>RocksDB is the most prominent native library used in Venice\u2014Venice is a storage service, and\u00a0 RocksDB handles the storage part. When we budget resources for our server, a majority of it goes to RocksDB block cache. Moreover, we have configured RocksDB (based on existing documentation) to help us be sure that significant memory allocations are billed to the memory usage limit we have specified for the block cache.\u00a0<\/p>\n<p>RocksDB is an open sourced software with a fairly dense code base. At this juncture, we began deep-diving and trying to figure out if anyone else in the RocksDB community had any similar trouble. We ended up finding a few tickets in the RocksDB community that talked about resource over-usage of one kind or another. Many of them seemed to converge on leveraging jemalloc to get to the bottom of the issue.<\/p>\n<h2>Jemalloc and Plumber<\/h2>\n<p><a href=\"https:\/\/metacpan.org\/pod\/Devel::Plumber\" target=\"_blank\" rel=\"noopener\">Plumber<\/a>, written by <a href=\"https:\/\/www.linkedin.com\/in\/gregnbanks\/\" target=\"_blank\" rel=\"noopener\">Greg Banks<\/a> (who happens to be the colleague who mocked me earlier), is a tool written in Perl that can look at process core dumps and help categorize allocated blocks with <a href=\"https:\/\/www.gnu.org\/software\/libc\/\" target=\"_blank\" rel=\"noopener\">glibc<\/a>&#8216;s <a href=\"https:\/\/en.cppreference.com\/w\/c\/memory\/malloc\" target=\"_blank\" rel=\"noopener\">malloc<\/a>. Plumber walks the data structures in glibc and categorizes memory allocations as being one of the following: \u201cFree\u201d (unused), \u201cLeaked\u201d (no pointers to this assigned block), \u201cMaybe Leaked\u201d (pointers point to some part of the assigned block, but not to the beginning), or \u201cReached\u201d (the block is addressable).<\/p>\n<p><a href=\"http:\/\/jemalloc.net\/\" target=\"_blank\" rel=\"noopener\">Jemalloc<\/a> is a malloc implementation developed by Jason Evans (the \u201cJE\u201d part of the jemalloc name). It comes with an impressive set of bells and whistles out of the box; most importantly for our purposes, it includes a set of tools for profiling memory allocations through the call stack. You can configure jemalloc to dump stats at intervals based on time, intervals based on allocations, or whenever a new high watermark of memory has been breached. You can search and categorize these stats on the commandline with <a href=\"https:\/\/github.com\/jemalloc\/jemalloc\/wiki\/Use-Case%3A-Leak-Checking\" target=\"_blank\" rel=\"noopener\">jeprof<\/a>, or have jeprof paint you a nice picture to show the call stacks of where memory is going. An abbreviated example looks like:<\/p>\n<\/p><\/div>\n<\/p><\/div>\n<\/div>\n<div class=\"resourceImageBlock section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceimageblock_1567969950\"><\/a>\n <\/div>\n<ul class=\"resource-image-block single\">\n<li class=\"resource-image\"> <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/content.linkedin.com\/content\/dam\/engineering\/site-assets\/images\/blog\/posts\/2021\/01\/venicememory3.png?resize=750%2C670&#038;ssl=1\" alt=\"flow-chart-showing-memory-usage\" height=\"670\" width=\"750\"  data-recalc-dims=\"1\"> <\/li>\n<\/ul>\n<\/div>\n<div class=\"resourceParagraph section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceparagraph_2046212649\"><\/a>\n <\/div>\n<div class=\"resource-text-section\">\n<div class=\"resource-paragraph rich-text\">\n<p>But we didn\u2019t find anything.\u00a0 There&#8217;s a <a href=\"https:\/\/medium.com\/swlh\/native-memory-the-silent-jvm-killer-595913cba8e7\" target=\"_blank\" rel=\"noopener\">Medium article<\/a> in which the author describes having a few weeks of \u201cexistential crisis\u201d followed by a \u201csatisfying conclusion\u201d when they managed to use jemalloc to get to the bottom of their issue. But this sweet relief eluded us. The trials we performed did not seem to indicate any obvious code leaks in our usage of RocksDB or in RocksDB&#8217;s internals. In fact, the amount of memory purported to be being used was well within parameters. So what was the source of our problem!?<\/p>\n<p><b>Maybe a breakthrough?<br \/> <\/b>We did find something though.\u00a0 When we used jemalloc, the problem went away. Resident Memory was no longer perpetually increasing and remained stable. This is why the profiling tools made available through jemalloc didn&#8217;t find anything\u2014there was no longer any problem to find. Resident Memory had stabilized when we used jemalloc, and when we canary tested it in production, we saw that nodes which had started using jemalloc were doing significantly better then their cousins which had not.<\/p>\n<\/p><\/div>\n<\/p><\/div>\n<\/div>\n<div class=\"resourceImageBlock section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceimageblock_1027820401\"><\/a>\n <\/div>\n<ul class=\"resource-image-block single\">\n<li class=\"resource-image\"> <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/content.linkedin.com\/content\/dam\/engineering\/site-assets\/images\/blog\/posts\/2021\/01\/venicememory4.png?resize=750%2C333&#038;ssl=1\" alt=\"graph-showing-memory-usage-with-versus-without-jemalloc\" height=\"333\" width=\"750\"  data-recalc-dims=\"1\"> <\/li>\n<\/ul>\n<\/div>\n<div class=\"resourceParagraph section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceparagraph_1681653260\"><\/a>\n <\/div>\n<div class=\"resource-text-section\">\n<div class=\"resource-paragraph rich-text\">\n<p>While we could all exchange high-fives and go home, that didn\u2019t seem like it would fully solve our problem. Why was it that using jemalloc made this problem go away? To understand why it helped, we needed to look at and understand what changed.\u00a0<\/p>\n<p>Our servers are using a Linux distribution that has glibc as the default allocator. This isn&#8217;t out of the ordinary. In fact, glibc is probably the most pervasive malloc implementation out there, and there are some great resources for understanding how it works. What makes glibc and jemalloc different, though, are their respective design goals. Most importantly, jemalloc tries very hard to both put an upper bound on memory fragmentation, and to also return memory back to the operating system.<\/p>\n<h2>Returning memory to the OS<\/h2>\n<p>This was a new notion to me. Surely a good piece of code, which behaves itself and puts away its toys when it&#8217;s done, will keep its resource usage to a minimum? As it turns out, that isn&#8217;t exactly the case. It really comes down to the behavior of the allocator, and most allocators won&#8217;t necessarily return memory right away to the OS. Jemalloc has a few ways this can be tuned, but by default it uses a time-based strategy for returning memory back to the OS. Memory allocations are serviced first by memory already held by the process that is free for reuse, and then failing that, new memory is acquired from the OS. Given enough time, should a chunk of memory not have anything assigned to it, it will be returned (and if returning resources quickly is important to you, you can configure jemalloc to return the memory immediately).<\/p>\n<p>Now let\u2019s look at how glibc malloc works. It uses <span class=\"monospace\"><a href=\"https:\/\/man7.org\/linux\/man-pages\/man2\/mmap.2.html\" target=\"_blank\" rel=\"noopener\">mmap<\/a>()<\/span> and <span class=\"monospace\"><a href=\"https:\/\/man7.org\/linux\/man-pages\/man2\/brk.2.html\" target=\"_blank\" rel=\"noopener\">brk<\/a>()\/sbrk()<\/span> calls in order to get more memory from the OS. <span class=\"monospace\">mmap()<\/span> will provision a chunk of memory of a given size and return a starting address. <span class=\"monospace\">brk()<\/span> (and <span class=\"monospace\">sbrk()<\/span>) will change the location of the program break, which defines the end of the process&#8217;s data segment. Memory assigned to a process in this way can only be returned back to the OS if <span class=\"monospace\">sbrk()<\/span> is called in such a way as to rein in the end of the data segment (shrink it), or if <span class=\"monospace\">madvise(MADV_DONTNEED)<\/span> or <span class=\"monospace\">munmap<\/span> is called on an <span class=\"monospace\">mmap()<\/span>&#8216;d range. This is illustrated in the below diagram.<\/p>\n<\/p><\/div>\n<\/p><\/div>\n<\/div>\n<div class=\"resourceImageBlock section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceimageblock_1316558244\"><\/a>\n <\/div>\n<ul class=\"resource-image-block single\">\n<li class=\"resource-image\"> <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/content.linkedin.com\/content\/dam\/engineering\/site-assets\/images\/blog\/posts\/2021\/01\/venicememory5.png?resize=750%2C387&#038;ssl=1\" alt=\"diagram-illustrating-memory-allocation-in-glibc-malloc\" height=\"387\" width=\"750\"  data-recalc-dims=\"1\"> <\/li>\n<\/ul>\n<\/div>\n<div class=\"resourceParagraph section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceparagraph_1950357469\"><\/a>\n <\/div>\n<div class=\"resource-text-section\">\n<div class=\"resource-paragraph rich-text\">\n<p>These can only be safely used so long as, in that address range which would be returned, there are no longer any active allocations being used by the process. If you run <a href=\"https:\/\/man7.org\/linux\/man-pages\/man1\/pmap.1.html\" target=\"_blank\" rel=\"noopener\"><span class=\"monospace\">pmap<\/span><\/a> on a process, you can see where these two commands come into play in the process&#8217;s virtual address space. The first portion starts at a low address, and then after a few lines, it&#8217;ll suddenly go to a high address. High addresses are <span class=\"monospace\">mmap<\/span>&#8216;d and <span class=\"monospace\">brk()<\/span>\/<span class=\"monospace\">sbrk()<\/span> control the lower range.<\/p>\n<p>Allocations are done via <span class=\"monospace\">brk()<\/span> and others via <span class=\"monospace\">mmap()<\/span>. Smaller, manageable allocations are done via <span class=\"monospace\">brk()<\/span>, while larger ones are handled via an <span class=\"monospace\">mmap()<\/span> call. Knowing how this worked, I created a bit of simple C code with a pathological pattern. It would allocate objects in two waves, and then it would delete the earlier allocations while keeping around the ones that came later. This renders the address range where they would be allocated only half-utilized by the process, but still uses the memory of all allocations made thus far.<\/p>\n<p>You can watch this in action with the following code and top (though the behavior may differ slightly depending on your system\u2019s allocator):<\/p>\n<\/p><\/div>\n<\/p><\/div>\n<\/div>\n<div class=\"resourceEmbedBlock section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceembedblock\"><\/a>\n <\/div>\n<div class=\"resource-embedded-media-container\">\n<div class=\"resource-embedded-media full-width\">\n<div class=\"github-gist\" data-gist-src=\"https:\/\/gist.github.com\/ZacAttack\/8c67b998c90afdb19c715dfe327112d2.js\" data-gist-iframe=\"https:\/\/nonprofit.linkedin.com\/content\/dam\/static-sites\/thirdPartyJS\/github-gists\"><\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/div>\n<div class=\"resourceParagraph section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceparagraph_209698438\"><\/a>\n <\/div>\n<div class=\"resource-text-section\">\n<div class=\"resource-paragraph rich-text\">\n<p>Running the above and charting the memory usage produces the following graph (where each data point is a stop point in the above code).<\/p>\n<\/p><\/div>\n<\/p><\/div>\n<\/div>\n<div class=\"resourceImageBlock section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceimageblock_978416107\"><\/a>\n <\/div>\n<ul class=\"resource-image-block single\">\n<li class=\"resource-image\"> <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/content.linkedin.com\/content\/dam\/engineering\/site-assets\/images\/blog\/posts\/2021\/01\/venicememory6.png?resize=750%2C547&#038;ssl=1\" alt=\"chart-showing-memory-usage-with-default-system-allocator-(glibc)\" height=\"547\" width=\"750\"  data-recalc-dims=\"1\"> <\/li>\n<\/ul>\n<\/div>\n<div class=\"resourceParagraph section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceparagraph_107949156\"><\/a>\n <\/div>\n<div class=\"resource-text-section\">\n<div class=\"resource-paragraph rich-text\">\n<p><i>Memory usage with default system allocator (glibc)<\/i><\/p>\n<p>The second and third points are after allocating some objects. The last data point is after freeing the first set of allocated objects.<\/p>\n<p>What we found was that, as allocations went up, memory would also go up. However, as objects were deleted, memory would not go back down unless all objects created at the top of the address range were also removed, exposing the stack-like behavior of the glibc allocator. In order to avoid this, you would need to make sure that any allocations that you expected to stick around would not be assigned to a high order address space. If it were, that would mean that memory could not be reclaimed and would stick around. Running the same program with jemalloc shows more predictable results. As objects are allocated, memory goes up, but as objects are deallocated, memory goes down. If you download jemalloc or have it installed on your system, you can try it out with the following command (and the above code):<\/p>\n<\/p><\/div>\n<\/p><\/div>\n<\/div>\n<div class=\"resourceEmbedBlock section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceembedblock_1123531590\"><\/a>\n <\/div>\n<div class=\"resource-embedded-media-container\">\n<div class=\"resource-embedded-media full-width\">\n<div class=\"github-gist\" data-gist-src=\"https:\/\/gist.github.com\/ZacAttack\/4c680aedf46852bf706d673e4848ccaf.js\" data-gist-iframe=\"https:\/\/nonprofit.linkedin.com\/content\/dam\/static-sites\/thirdPartyJS\/github-gists\"><\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/div>\n<div class=\"resourceParagraph section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceparagraph_1545929251\"><\/a>\n <\/div>\n<div class=\"resource-text-section\">\n<div class=\"resource-paragraph rich-text\">\n<p>Plotting the memory usage again produces the below graph.<\/p>\n<\/p><\/div>\n<\/p><\/div>\n<\/div>\n<div class=\"resourceImageBlock section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceimageblock_1044891732\"><\/a>\n <\/div>\n<ul class=\"resource-image-block single\">\n<li class=\"resource-image\"> <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/content.linkedin.com\/content\/dam\/engineering\/site-assets\/images\/blog\/posts\/2021\/01\/venicememory7.png?resize=750%2C508&#038;ssl=1\" alt=\"chart-showing-memory-usage-with-the-jemalloc-allocator\" height=\"508\" width=\"750\"  data-recalc-dims=\"1\"> <\/li>\n<\/ul>\n<\/div>\n<div class=\"resourceParagraph section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceparagraph_185278008\"><\/a>\n <\/div>\n<div class=\"resource-text-section\">\n<div class=\"resource-paragraph rich-text\">\n<p><i>Memory usage with the jemalloc allocator<\/i><\/p>\n<p>The dip is after deallocating the first batch of objects.<\/p>\n<h2>Memory fragmentation<\/h2>\n<p>Memory fragmentation is a bit of a loaded term and can mean a lot of different things depending on the context (you can have fragmentation of physical memory (RSS), virtual memory (RSS), or of memory managed by user code. First, let&#8217;s parse what fragmentation is.<\/p>\n<p>Here\u2019s an allegory for the situation. Let&#8217;s say you&#8217;re a city planner and you are trying to divide up a plot of land into houses with addresses. As families show up, you ask Mayor Kernel for some land, which they give to you in a single, large, contiguous block with addresses for the land plots. You start assigning plots from the lowest numbered address. Depending on the size of the family you have to accomodate, you combine adjacent plots for bigger houses for bigger families. Initially this works fine\u2014folks are showing up and everything is filled. At some point, you&#8217;ve assigned all the land, so you ask Mayor Kernel for more. But now, people start leaving even as more come in. Smaller families move out, and if you have a family coming in of the same size, that plot can be reused. However, if a large family arrives, you need adjacent addresses to combine into a house to fit them. The only place this is likely to be available is at the higher number addresses (newer land plots). After a while, you might begin to notice that even though you have a large number of empty houses\/addresses in total, you need more land to build new houses because you need enough space to build a house large enough to fit the incoming families. Families cannot be split up (as allocated memory must be in a contiguous address space, also that would be cruel).<\/p>\n<p>This describes \u201cexternal\u201d fragmentation, but there is another flavor called \u201cinternal\u201d fragmentation. Internal fragmentation can take another form where space that was assigned in a block is underutilized. Extending the neighborhood allegory, it would be as if we started putting small families in larger, previously-allocated houses. A house that was sized for six originally gets used for a family of four, meaning we\u2019re now underutilizing the space.<\/p>\n<p>This is what is happening when we talk about fragmentation of memory. Memory fragmentation is when your memory is allocated in a large number of non-sequential blocks with gaps that can&#8217;t be used for new allocations due to size differences. The effect is that your process will be assigned an amount of memory where a good percentage is unusable.<\/p>\n<p>A major difference between glibc malloc and jemalloc is that jemalloc is designed to give an &#8220;upper bound&#8221; fragmentation rate. In jemalloc, allocations are categorized by size, and bins are assigned to allocations based on what fits the size requirement the best; it guarantees that internal fragmentation will never be more than <a href=\"http:\/\/jemalloc.net\/jemalloc.3.html#size_classes\" target=\"_blank\" rel=\"noopener\">20% for a size range<\/a>, as memory bins are selected by the size which best matches the allocation. In glibc there are configurations that can be used to mitigate this (<span class=\"monospace\">mmap()<\/span> threshold, for example), but they do require that you know the size ranges of allocations going on in your process in order to be used effectively.<\/p>\n<p>So between what memory is being held onto, and what is getting fragmented, how can we figure out what&#8217;s getting left on the floor in our process?<\/p>\n<p><b>Using gdb-heap<br \/> <\/b><a href=\"https:\/\/github.com\/rogerhu\/gdb-heap\" target=\"_blank\" rel=\"noopener\">gdb-heap<\/a> is a fantastic resource. It leverages gdb&#8217;s Python shell for executing macros that know how to parse and walk the structures of glibc for either a running process or a core dump. With it you can get all kinds of insight into what&#8217;s happening in glibc&#8217;s malloc for a given process, including things like stats on all the individual arenas and their heaps, as well as allocated chunks.<\/p>\n<p>With some slight tweaks, we can create a simple function borrowing the existing code that prints out all the free and available chunks across all arenas. By writing a small amount of aggregation code, we can build a simple report like the following:<\/p>\n<\/p><\/div>\n<\/p><\/div>\n<\/div>\n<div class=\"resourceEmbedBlock section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceembedblock_1149148410\"><\/a>\n <\/div>\n<div class=\"resource-embedded-media-container\">\n<div class=\"resource-embedded-media full-width\">\n<div class=\"github-gist\" data-gist-src=\"https:\/\/gist.github.com\/ZacAttack\/cb44d2c04a37852011381e7a7b6ba680.js\" data-gist-iframe=\"https:\/\/nonprofit.linkedin.com\/content\/dam\/static-sites\/thirdPartyJS\/github-gists\"><\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/div>\n<div class=\"resourceParagraph section\">\n<div class=\"component-anchor-container\">\n  <a class=\"component-anchor\" name=\"post_par_resourceparagraph_1271696941\"><\/a>\n <\/div>\n<div class=\"resource-text-section\">\n<div class=\"resource-paragraph rich-text\">\n<p>This was from a lab node. We found that with glibc, the total amount of unused memory could go as high as 30% in some cases. This was where our memory growth was coming from!<\/p>\n<h2>Conclusion and future work<\/h2>\n<p>We had originally picked jemalloc in order to leverage its profiling capabilities, but we ended up finding it to be a complete solution. Following this analysis, we now have installed jemalloc across Venice servers at LinkedIn. Since picking it up, we no longer see the steady RSS growth in our servers. This post only encompasses part of what we ended up tuning and analyzing to get the most out of our systems resources, though. It would be remiss to not have a follow up talking about how we configured Linux, RocksDB, and the JVM to get the most bang for our buck, and other efforts to tame and track other resource hogs in the system. Keep an eye out for more posts to follow!<\/p>\n<h2>Acknowledgements<\/h2>\n<p>This took an immense amount of work across a number of teams at LinkedIn. Specific shout outs to <a href=\"https:\/\/www.linkedin.com\/in\/olivierlecomte\/\" target=\"_blank\" rel=\"noopener\">Olivier Lecomte<\/a>, <a href=\"https:\/\/www.linkedin.com\/in\/gregnbanks\/\" target=\"_blank\" rel=\"noopener\">Greg Banks<\/a>, and <a href=\"https:\/\/www.linkedin.com\/in\/asinghai\/\" target=\"_blank\" rel=\"noopener\">Ashish Singhai<\/a> for their immense knowledge on this topic and for giving very helpful suggestions for avenues of investigation; <a href=\"https:\/\/www.linkedin.com\/in\/alipoursamadi\/\" target=\"_blank\" rel=\"noopener\">\u00a0Ali Poursamadi<\/a>, <a href=\"https:\/\/www.linkedin.com\/in\/gaojie-liu-99b39768\/\" target=\"_blank\" rel=\"noopener\">Gaojie Liu<\/a>, <a href=\"https:\/\/www.linkedin.com\/in\/kchung\/\" target=\"_blank\" rel=\"noopener\">Kian Chung<\/a>, <a href=\"https:\/\/www.linkedin.com\/in\/zoniaharris\/\" target=\"_blank\" rel=\"noopener\">Zonia Harris<\/a>, <a href=\"https:\/\/www.linkedin.com\/in\/salil-gokhale-b8a0554\/\" target=\"_blank\" rel=\"noopener\">Salil Gohkale<\/a>, and <a href=\"https:\/\/www.linkedin.com\/in\/vgovindaraj\/\" target=\"_blank\" rel=\"noopener\">Vinoth Govindaraj<\/a> for being all hands on deck for methodically working through this investigation as well as helping to keep the site up and healthy; and also to <a href=\"https:\/\/www.linkedin.com\/in\/yun-sun-87254432\/\" target=\"_blank\" rel=\"noopener\">Yun Sun<\/a> and again <a href=\"https:\/\/www.linkedin.com\/in\/olivierlecomte\/\" target=\"_blank\" rel=\"noopener\">Olivier Lecomte<\/a> and <a href=\"https:\/\/www.linkedin.com\/in\/asinghai\/\" target=\"_blank\" rel=\"noopener\">Ashish Singhai<\/a> who also helped review this article.<\/p>\n<\/p><\/div>\n<\/p><\/div>\n<\/div>\n<p><a href=\"https:\/\/engineering.linkedin.com\/blog\/2021\/taming-memory-fragmentation-in-venice-with-jemalloc\" target=\"_blank\" rel=\"noopener\">Read More<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Sometimes, an engineering problem arises that might make us feel like maybe we don&#8217;t know what we&#8217;re doing, or at the very least, forces us out of the comfort zone of our area of expertise. That day came for the Venice team at Linkedin when we began to notice that some Venice processes would consume&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2021\/01\/28\/taming-memory-fragmentation-in-venice-with-jemalloc\/\">Continue reading <span class=\"screen-reader-text\">Taming memory fragmentation in Venice with Jemalloc<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[1,7],"tags":[],"class_list":["post-174","post","type-post","status-publish","format-standard","hentry","category-external","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":869,"url":"https:\/\/fde.cat\/index.php\/2024\/05\/22\/composable-data-management-at-meta\/","url_meta":{"origin":174,"position":0},"title":"Composable data management at Meta","date":"May 22, 2024","format":false,"excerpt":"In recent years, Meta\u2019s data management systems have evolved into a composable architecture that creates interoperability, promotes reusability, and improves engineering efficiency.\u00a0 We\u2019re sharing how we\u2019ve achieved this, in part, by leveraging Velox, Meta\u2019s open source execution engine, as well as work ahead as we continue to rethink our data\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":328,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/ribbon-filter-practically-smaller-than-bloom-and-xor\/","url_meta":{"origin":174,"position":1},"title":"Ribbon filter: Practically smaller than Bloom and Xor","date":"August 31, 2021","format":false,"excerpt":"What the research is: The Ribbon filter is a new data structure that is more space-efficient than the popular Bloom filters that are widely used for optimizing data retrieval. One of the ways that Bloom, and now Ribbon, filters solve real engineering problems is by providing smooth configurability unmatched by\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":506,"url":"https:\/\/fde.cat\/index.php\/2021\/11\/18\/measuring-the-memory-impact-for-hybrid-apps\/","url_meta":{"origin":174,"position":2},"title":"Measuring the Memory Impact for Hybrid Apps","date":"November 18, 2021","format":false,"excerpt":"Memory problems are always challenging to detect and fix for mobile applications, particularly on Android, due to many hardware profiles, OS versions, and OEM skins. With proper memory reporting and analysis, most issues are caught during the development lifecycle. Yet if your application is delivering an entire platform, such as\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":632,"url":"https:\/\/fde.cat\/index.php\/2022\/09\/12\/memlab-an-open-source-framework-for-finding-javascript-memory-leaks\/","url_meta":{"origin":174,"position":3},"title":"MemLab: An open source framework for finding JavaScript memory leaks","date":"September 12, 2022","format":false,"excerpt":"We\u2019ve open-sourced MemLab, a JavaScript memory testing framework that automates memory leak detection. Finding and addressing the root cause of memory leaks is important for delivering a quality user experience on web applications. MemLab has helped engineers and developers at Meta improve user experience and make significant improvements in memory\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":748,"url":"https:\/\/fde.cat\/index.php\/2023\/08\/15\/introducing-immortal-objects-for-python\/","url_meta":{"origin":174,"position":4},"title":"Introducing Immortal Objects for Python","date":"August 15, 2023","format":false,"excerpt":"Instagram has introduced Immortal Objects \u2013 PEP-683 \u2013 to Python. Now, objects can bypass reference count checks and live throughout the entire execution of the runtime, unlocking exciting avenues for true parallelism. At Meta, we use Python (Django) for our frontend server within Instagram. To handle parallelism, we rely on\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":601,"url":"https:\/\/fde.cat\/index.php\/2022\/06\/20\/transparent-memory-offloading-more-memory-at-a-fraction-of-the-cost-and-power\/","url_meta":{"origin":174,"position":5},"title":"Transparent memory offloading: more memory at a fraction of the cost and power","date":"June 20, 2022","format":false,"excerpt":"-Transparent memory offloading (TMO) is Meta\u2019s data center solution for offering more memory at a fraction of the cost and power of existing technologies -In production since 2021, TMO saves 20 percent to 32 percent of memory per server across millions of servers in our data center fleet We are\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/174","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=174"}],"version-history":[{"count":1,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/174\/revisions"}],"predecessor-version":[{"id":201,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/174\/revisions\/201"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=174"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=174"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=174"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}