{"id":301,"date":"2021-08-31T14:40:03","date_gmt":"2021-08-31T14:40:03","guid":{"rendered":"https:\/\/fde.cat\/?p=301"},"modified":"2021-08-31T14:40:03","modified_gmt":"2021-08-31T14:40:03","slug":"reverse-debugging-at-scale","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/reverse-debugging-at-scale\/","title":{"rendered":"Reverse debugging at scale"},"content":{"rendered":"<p><span>Say you receive an email notification that a service is crashing just after your last code change deploys. The crash happens in only 0.1 percent of the servers where it runs. But you\u2019re at a large-scale company, so 0.1 percent equals thousands of servers \u2014 and\u00a0this issue is going to be hard to reproduce. Several hours later, you still can\u2019t reproduce it, and you have spent a full day chasing this issue.<\/span><\/p>\n<p><span>This is where reverse debugging kicks in. Existing methods allow engineers to record a paused (or crashed) program, then rewind and replay to locate the root cause. However, for large-scale companies like Facebook, these solutions impose too much overhead to be practical in production. To address this, we developed a new technique that allows engineers to trace a failing run and inspect its history to find the root cause without the hassle of rerunning the program, which saves an enormous amount of time. We do this by efficiently tracing the CPU activity on our servers and, upon crashes, saving the process history, which is later displayed in a human-readable format with the help of the <\/span><a href=\"https:\/\/lldb.llvm.org\/\"><span>LLDB<\/span><\/a><span> debugger, offering everything from instruction history views to reverse debugging.<\/span><\/p>\n<h2><span>Sounds great, but how does it work?<\/span><\/h2>\n<p><span>To help develop this, we leveraged <\/span><a href=\"https:\/\/software.intel.com\/content\/www\/us\/en\/develop\/blogs\/processor-tracing.html\"><span>Intel Processor Trace<\/span><\/a><span> (Intel PT), which uses specialized hardware to accelerate the collection of the steps in a program, allowing us to use it in production and at scale. However, tracing a program efficiently is just the first step.<\/span><\/p>\n<h3><span>Continuous collection and quick storage in production<\/span><\/h3>\n<p><span>As we don\u2019t know when a crash will happen nor which process will crash, we need to continuously collect an Intel PT trace of all running processes. In order to bound the memory required to operate, we store the trace in a circular buffer where new data overwrites old data.<\/span><\/p>\n<p>When multiple processes (A and B) are running concurrently, the trace data is stored in the buffer. At t+8, Process B\u2019s data begins to overwrite the oldest data (Process A\u2019s) in the buffer.<\/p>\n<p><span>In large servers, like<\/span><a href=\"https:\/\/engineering.fb.com\/2019\/03\/14\/data-center-engineering\/accelerating-infrastructure\/\"> <span>the ones used to train AI models<\/span><\/a><span>, Intel PT can generate hundreds of megabytes of data per second even in its compressed format. When a process crashes, the collector must pause the collection and copy the content of the buffer for the crashed process. To do this, the operating system needs to notify our collector that a process has crashed, which takes time. The larger the delay between the crash and the copy of the buffer content, the more data that will be overwritten by new data from other processes.<\/span><\/p>\n<p><span>To address this, we designed an <\/span><a href=\"https:\/\/ebpf.io\/\"><span>eBPF<\/span><\/a><span> kernel probe, which is a program that is triggered upon certain kernel events, to notify our collector within a few milliseconds that a crash happened, thus maximizing the amount of information in the trace relevant to the crash. We tried several approaches, but none was as fast as this one. An additional design consideration is that crashes often leave the computer in a bad state, which prevents us from doing any analysis on the collected trace on the fly. That\u2019s why we decided to store the traces and the corresponding binaries of the crashed processes on our data centers, so that they can be analyzed later on a different machine.<\/span><\/p>\n<h3><span>Decoding and symbolication<\/span><\/h3>\n<p><span>How do we transform this data into instructions that can be analyzed? Raw traces are undecipherable to engineers. Humans need context such as source line information and function names. To add context, we built a component in LLDB that is able to receive a trace and reconstruct the instructions, along with their symbols. We recently started open-sourcing this work. You can find it <\/span><a href=\"https:\/\/github.com\/llvm\/llvm-project\/tree\/main\/lldb\/source\/Plugins\/Trace\/intel-pt\"><span>here<\/span><\/a><span>.<\/span><\/p>\n<p>A trace primarily contains the information about which branches were taken and which weren\u2019t. We compare that information with all the instructions from the original binary and reconstruct the instructions that were executed by the program. Later, with the help of LLDB\u2019s symbolication stack, we can obtain the corresponding source code information and show it to the engineer in a readable fashion.<\/p>\n<p><span>With the flow mentioned in the picture above, we are able to convert raw trace file into the following:<\/span><\/p>\n<\/p>\n<p><span>Symbolicating the instructions is only the first part of LLDB\u2019s work. The next part is reconstructing the function call history. For this, we built an algorithm that is able to walk through the instructions and construct a tree of function calls, i.e., which function invoked which other function and when. The most interesting part of it is that traces can start at an arbitrary moment in the program execution, and not necessarily at the beginning.<\/span><\/p>\n<p>Sample function call tree, where each vertical segment contains instructions and call sites are displayed with arrows.<\/p>\n<p><span>This tree can help us quickly answer questions like what the stack trace is at a given point in history, which is solved simply by walking up the tree. On the other hand, a more complicated question is figuring out where a reverse-step-over stops.<\/span><\/p>\n<p><span>Imagine that you are in line 16 of this little piece of code. If you reverse-step-over, you should stop at either line 13 or 15, depending on whether the <\/span><span>if <\/span><span>was taken or not. A naive approach would be to inspect the instructions in the history one by one until one of these lines is found, but this can be very inefficient, as the function <\/span><span>foo<\/span><span> can include millions of instructions. Instead, we can use the aforementioned tree to move across lines without jumping into calls. Unlike the previous approach, traversing the tree is almost trivial.<\/span><\/p>\n<\/p>\n<p><span>Besides, breakpoint support has also been added. Suppose you are in the middle of a debug session and you want to go back in time to the most recent call to <\/span><span>function_a<\/span><span>. Then you could do:<\/span><\/p>\n<\/p>\n<p><span>Finally, we are planning to integrate this flow with VSCode at some point, for a rich visual reverse debugging experience.<\/span><\/p>\n<h3><span>Latency analysis and profiling<\/span><\/h3>\n<p><span>An execution trace contains more information than other representations of control flow, like call-stacks. Furthermore, as part of our traces we are collecting timing information with great accuracy, allowing engineers to know not only the sequence of operations (or functions) executed but also their timestamps, enabling a new use case: latency analysis.<\/span><\/p>\n<p><span>Imagine a service that handles different requests that fetch data leveraging an internal cache. Under certain conditions, the cache must be flushed, and the next fetch call will take much longer.<\/span><\/p>\n<p><span>The server receives several requests from engineers, and you want to understand the long tail of the latency in your service (e.g., P99). You get your profiler and collect some sampled call stacks for the main two types of requests. They look like the graph below, which does not tell you how they are different because call stack sampling shows only aggregated numbers. We need to convert traces into execution path info and symbolicate the instructions along the path.<\/span><\/p>\n<\/p>\n<p><span>The image below shows a trace, making it easy to understand what\u2019s happening: Request B flushes data before getting it.<\/span><\/p>\n<\/p>\n<p><span>While a debugger can help you move within the trace step-wise, a visualization can help you see the big picture easily. That\u2019s why we are also working to add graphs like the \u201cCall stack trace\u201d above in our performance analysis tool,<\/span><a href=\"https:\/\/www.facebook.com\/atscaleevents\/videos\/996197807391867\/\"> <span>Tracery<\/span><\/a><span>. We are building a tool that can produce this tracing and timing data, and combine it with the symbolication produced by LLDB. Our goal is to offer developers a way to see all this information at a glance, and then zoom in and out and get the data they need at the level they need.<\/span><\/p>\n<p><span>Remember the undesirable scenario at the beginning of this post? Now imagine that you receive that email notification telling you that your service is crashing in 0.1 percent of machines, but this time it comes with a \u201cReverse debug on VSCode\u201d button. You click it and then move around the history of the program until you find the function call that shouldn\u2019t have happened or an if-clause that shouldn\u2019t have executed, and get it fixed within a few minutes.<\/span><\/p>\n<p><span>This kind of technology enables us to solve problems at a level we didn\u2019t consider possible before. Most of our historic work in this space resulted in considerable overhead, which limited our ability to scale. We are now building debugging tools that interactive and visually can help developers solve production and development issues in a much more efficient way. This is an exciting endeavor that will help us scale even more.<\/span><\/p>\n<p>The post <a href=\"https:\/\/engineering.fb.com\/2021\/04\/27\/developer-tools\/reverse-debugging\/\">Reverse debugging at scale<\/a> appeared first on <a href=\"https:\/\/engineering.fb.com\/\">Facebook Engineering<\/a>.<\/p>\n<p><a href=\"https:\/\/engineering.fb.com\/2021\/04\/27\/developer-tools\/reverse-debugging\/\">Read More<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Say you receive an email notification that a service is crashing just after your last code change deploys. The crash happens in only 0.1 percent of the servers where it runs. But you\u2019re at a large-scale company, so 0.1 percent equals thousands of servers \u2014 and\u00a0this issue is going to be hard to reproduce. Several&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2021\/08\/31\/reverse-debugging-at-scale\/\">Continue reading <span class=\"screen-reader-text\">Reverse debugging at scale<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-301","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":806,"url":"https:\/\/fde.cat\/index.php\/2023\/12\/19\/ai-debugging-at-meta-with-hawkeye\/","url_meta":{"origin":301,"position":0},"title":"AI debugging at Meta with HawkEye","date":"December 19, 2023","format":false,"excerpt":"HawkEye is the powerful toolkit used internally at Meta for monitoring, observability, and debuggability of the end-to-end machine learning (ML) workflow that powers ML-based products. HawkEye supports recommendation and ranking models across several products at Meta. Over the past two years, it has facilitated order of magnitude improvements in the\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":263,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/minesweeper-automates-root-cause-analysis-as-a-first-line-defense-against-bugs\/","url_meta":{"origin":301,"position":1},"title":"Minesweeper automates root cause analysis as a first-line defense against bugs","date":"August 31, 2021","format":false,"excerpt":"Root cause analysis (RCA) is an important part of fixing any bug. After all, you can\u2019t solve a problem without getting to the heart of it. But RCA isn\u2019t always simple, especially at a scale like Facebook\u2019s. When billions of people are using an app on a variety of platforms\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":271,"url":"https:\/\/fde.cat\/index.php\/2021\/08\/31\/faster-more-efficient-systems-for-finding-and-fixing-regressions\/","url_meta":{"origin":301,"position":2},"title":"Faster, more efficient systems for finding and fixing regressions","date":"August 31, 2021","format":false,"excerpt":"Every workday, Facebook engineers commit thousands of diffs (which is a change consisting of one or more files) into production. This code velocity allows us to rapidly ship new features, deliver bug fixes and optimizations, and run experiments. However, a natural downside to moving quickly in any industry is the\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":492,"url":"https:\/\/fde.cat\/index.php\/2021\/10\/20\/facebook-engineers-receive-2021-ieee-computer-society-cybersecurity-award-for-static-analysis-tools\/","url_meta":{"origin":301,"position":3},"title":"Facebook engineers receive 2021 IEEE Computer Society Cybersecurity Award for static analysis tools","date":"October 20, 2021","format":false,"excerpt":"Until recently, static analysis tools weren\u2019t seen by our industry as a reliable element of securing code at scale. After nearly a decade of investing in refining these systems, I\u2019m so proud to celebrate our engineering teams today for being awarded the IEEE Computer Society\u2019s Cybersecurity Award for Practice for\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":886,"url":"https:\/\/fde.cat\/index.php\/2024\/06\/24\/leveraging-ai-for-efficient-incident-response\/","url_meta":{"origin":301,"position":4},"title":"Leveraging AI for efficient incident response","date":"June 24, 2024","format":false,"excerpt":"We\u2019re sharing how we streamline system reliability investigations using a new AI-assisted root cause analysis system. The system uses a combination of heuristic-based retrieval and large language model-based ranking to speed up root cause identification during investigations. Our testing has shown this new system achieves 42% accuracy in identifying root\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":767,"url":"https:\/\/fde.cat\/index.php\/2023\/10\/03\/automated-server-decommissioning-achieving-carbon-reduction-at-scale-for-a-greener-future\/","url_meta":{"origin":301,"position":5},"title":"Automated Server Decommissioning: Achieving Carbon Reduction at Scale for a Greener Future","date":"October 3, 2023","format":false,"excerpt":"By Kristina Fronczyk, Emily Collier, and Scott Nyberg The fossil fuels sector is largely linked to high carbon footprints and greenhouse gas emissions. However, the tech sector too plays a major role in energy usage and carbon emissions \u2014 projected to account for up to 20% of global energy demands\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/301","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=301"}],"version-history":[{"count":1,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/301\/revisions"}],"predecessor-version":[{"id":409,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/301\/revisions\/409"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=301"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=301"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=301"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}