{"id":570,"date":"2022-05-02T16:30:52","date_gmt":"2022-05-02T16:30:52","guid":{"rendered":"https:\/\/fde.cat\/index.php\/2022\/05\/02\/how-the-cinder-jits-function-inliner-helps-us-optimize-instagram\/"},"modified":"2022-05-02T16:30:52","modified_gmt":"2022-05-02T16:30:52","slug":"how-the-cinder-jits-function-inliner-helps-us-optimize-instagram","status":"publish","type":"post","link":"https:\/\/fde.cat\/index.php\/2022\/05\/02\/how-the-cinder-jits-function-inliner-helps-us-optimize-instagram\/","title":{"rendered":"How the Cinder JIT\u2019s function inliner helps us optimize Instagram"},"content":{"rendered":"<p><span>Since Instagram runs one of the world\u2019s largest deployments of the Django web framework, we have natural interest in finding ways to optimize Python so we can speed up our production application. As part of this effort, we\u2019ve recently open-sourced <\/span><a href=\"https:\/\/github.com\/facebookincubator\/cinder\"><span>Cinder<\/span><\/a><span>, our Python runtime that is a fork of CPython. Cinder includes optimizations like immortal objects, shadowcode (our term for inline caching and quickening), <a href=\"https:\/\/github.com\/facebookincubator\/cinder#static-python\">Static Python<\/a>, and <a href=\"https:\/\/github.com\/facebookincubator\/cinder#strict-modules\">Strict Modules<\/a>. But this blog focuses on the just-in-time (JIT) compiler and its recently released <\/span><a href=\"https:\/\/github.com\/facebookincubator\/cinder\/commit\/f3c50b3938906149b32c9cd36c3a41f0e898b52d\"><span>function inliner<\/span><\/a><span>.<\/span><\/p>\n<p><span>Even with Static Python and the shadowcode enabled in the bytecode interpreter, there is still some overhead that we can get rid of by compiling to native code. There is overhead present in the bytecode dispatch (the switch\/case), the stack model (pushing and popping to pass objects between opcodes), and also in a very generic interpreter \u2014 at least in CPython 3.8. So we wrote a JIT compiler to remove a lot of this overhead.<\/span><\/p>\n<h2><span>A bit about the JIT<\/span><\/h2>\n<p><span>Note: If you are already familiar with Python bytecode and JITs, you might want to skip down to \u201c<a href=\"https:\/\/engineering.fb.com\/#inlining\">Inlining and its benefits.\u201d<\/a><\/span><\/p>\n<p><span>The JIT compiles functions one at a time, translating from bytecode to a control-flow graph (CFG), to high-level intermediate representation (HIR), to static single assignment (SSA) HIR, to low-level intermediate representation (LIR), to register-allocated LIR, to assembly.<\/span><\/p>\n<p><span>Translating from bytecode to a register-based HIR removes the stack overhead. Translating to native code removes the dispatch overhead. And several compiler passes, including type inference, specialize the code from its previous generic form. <\/span><span>Let\u2019s walk through how the JIT compiles two Python functions from beginning to end. Take a look at this Python code example<\/span><span>:<\/span><\/p>\n<p>def callee(x):<br \/>\n\treturn x + 1<br \/>\ndef caller():<br \/>\n\treturn callee(3)<\/p>\n<p><span>callee<\/span><span> is a very generic function. It has no idea what the type of <\/span><span>x<\/span><span> is, so it cannot specialize the addition operation. In <\/span><span>caller<\/span><span>, the lookup for the global <\/span><span>callee<\/span><span> must happen anew every time because in Python global variable bindings are mutable. To get an idea of what this looks like, we can look at the bytecode.<\/span><\/p>\n<p><span>This Python code gets compiled to the following CPython 3.8 bytecode:<\/span><\/p>\n<p>callee:<br \/>\n          \t0 LOAD_FAST            \t0 (x)<br \/>\n          \t2 LOAD_CONST           \t1 (1)<br \/>\n          \t4 BINARY_ADD<br \/>\n          \t6 RETURN_VALUE<\/p>\n<p>caller:<br \/>\n          \t0 LOAD_GLOBAL          \t0 (callee)<br \/>\n          \t2 LOAD_CONST           \t1 (3)<br \/>\n          \t4 CALL_FUNCTION        \t1<br \/>\n          \t6 RETURN_VALUE <\/p>\n<p><span>In this representation, produced by the <\/span><span>dis<\/span><span> module, there are four columns. The first contains the bytecode offset (0, 2, 4, \u2026) inside each function. The second contains the human-readable representation of the opcode. The third contains the byte-wide argument to the opcode. The fourth contains the human-readable context-aware meaning of the opcode argument. In most cases, these are read off an auxiliary structure in the <\/span><span>PyCodeObject<\/span><span> struct and are not present in the bytecode.<\/span><\/p>\n<p><span>This bytecode normally gets executed in the interpreter. In the bytecode interpreter, the function call path involves a lot of machinery to query properties of the callee function and the call site. This machinery checks that the argument count and parameter count match up, that defaults are filled in, resolves <\/span><span>__call__<\/span><span> methods for nonfunctions, heap allocate a call frame, and more. This is rather involved and not always necessary. The JIT often knows some more information at compile time and can elide checks that it knows are unnecessary. It also can often avoid dynamic lookup for global variables, inline constants into the instruction stream, and use<\/span><a href=\"https:\/\/github.com\/facebookincubator\/cinder\/blob\/1642fffb42a3a5914386d029bc538a79c435d31b\/Include\/internal\/pycore_shadow_frame_struct.h\"> <span>shadow frames<\/span><\/a><span>.\u00a0<\/span>A shadow frame consists of two stack-allocated words containing metadata we can use to reify <span>PyFrameObject<\/span>s. They are pushed and popped in every function prologue and epilogue.<\/p>\n<p><span>Before the JIT can optimize this Python code, it must be transformed from bytecode into a control flow graph. To do this, we first discover basic block boundaries. Jumps, returns, and <\/span><span>raise<\/span><span> terminate basic blocks, which means that the functions above only have one basic block each. Then the JIT does abstract interpretation on the stack-based bytecode to turn it into our infinite register IR.<\/span><\/p>\n<p><span>Below is the initial HIR when translated straight off the bytecode:<\/span><\/p>\n<p># Initial HIR<br \/>\nfun __main__:callee {<br \/>\n  bb 0 {<br \/>\n\tv0 = LoadArg&lt;0; &#8220;x&#8221;&gt;<br \/>\n\tv0 = CheckVar&lt;&#8220;x&#8221;&gt; v0<br \/>\n\tv1 = LoadConst&lt;MortalLongExact[1]&gt;<br \/>\n\tv2 = BinaryOp&lt;Add&gt; v0 v1<br \/>\n\tReturn v2<br \/>\n  }<br \/>\n}<\/p>\n<p>fun __main__:caller {<br \/>\n  bb 0 {<br \/>\n\tv0 = LoadGlobalCached&lt;0; &#8220;callee&#8221;&gt;<br \/>\n\tv0 = GuardIs&lt;0xdeadbeef&gt; v0<br \/>\n\tv1 = LoadConst&lt;MortalLongExact[3]&gt;<br \/>\n\tv2 = VectorCall&lt;1&gt; v0 v1<br \/>\n\tReturn v2<br \/>\n  }<br \/>\n}<\/p>\n<p><span>It has some additional information not present in the bytecode. For starters, it has objects embedded into the instructions (with all pointers replaced with <\/span><span>0xdeadbeef<\/span><span>). <\/span><span>LoadConst<\/span><span> is parameterized by the type <\/span><span>MortalLongExact[1]<\/span><span>, which describes precisely the <\/span><span>PyObject*<\/span><span> for <\/span><span>1<\/span><span>. It also has this new <\/span><span>GuardIs<\/span><span> instruction, which has an address. This is automatically inserted after <\/span><span>LoadGlobalCached<\/span><span> and is based on the assumption that globals change infrequently. Global variables are normally stored in a big module-scope dictionary, mapping global names to values. Instead of reloading from the dictionary each time, we can do a fast pointer comparison and leave the JIT (deoptimize) if it fails.<\/span><\/p>\n<p><span>This representation is useful but not entirely what we need. Before we can run our other optimization passes on the HIR, it needs to be converted to SSA. So we run the SSA pass on it, which also does basic flow typing:<\/span><\/p>\n<p># SSA HIR<br \/>\nfun __main__:callee {<br \/>\n  bb 0 {<br \/>\n\tv3:Object = LoadArg&lt;0; &#8220;x&#8221;&gt;<br \/>\n\tv4:Object = CheckVar&lt;&#8220;x&#8221;&gt; v3<br \/>\n\tv5:MortalLongExact[1] = LoadConst&lt;MortalLongExact[1]&gt;<br \/>\n\tv6:Object = BinaryOp&lt;Add&gt; v4 v5<br \/>\n\tReturn v6<br \/>\n  }<br \/>\n}<\/p>\n<p>fun __main__:caller {<br \/>\n  bb 0 {<br \/>\n\tv3:OptObject = LoadGlobalCached&lt;0; &#8220;callee&#8221;&gt;<br \/>\n\tv4:MortalFunc[function:0xdeadbeef] = GuardIs&lt;0xdeadbeef&gt; v3<br \/>\n\tv5:MortalLongExact[3] = LoadConst&lt;MortalLongExact[3]&gt;<br \/>\n\tv6:Object = VectorCall&lt;1&gt; v4 v5<br \/>\n\tReturn v6<br \/>\n  }<br \/>\n}<\/p>\n<p><span>You can see that now variable definitions are annotated with the type that the JIT has inferred. For <\/span><span>LoadConst<\/span><span>, this is the type of the constant. For other operations like <\/span><span>LoadGlobalCached<\/span><span>, it is the type of the global variable when the function was compiled. Because of our assumption about the stability of module globals, the JIT can infer function call targets and burn in addresses to generated code (see <\/span><span>MortalFunc[function:0xdeadbeef]<\/span><span> above) after the guard.<\/span><\/p>\n<p class=\"line-numbers\"><span>After SSA, the JIT will pass the HIR to the optimizer. Our current set of optimization passes will remove the <\/span><span>CheckVar<\/span><span> (CPython guarantees arguments will not be null), but that\u2019s about it for these two functions. We can\u2019t optimize away the generic binary operation (<\/span><span>BinaryOp&lt;Add&gt;<\/span><span>) or the generic function call (<\/span><span>VectorCall&lt;1&gt;<\/span><span>). So we get this:<\/span><\/p>\n<p># Final HIR (without inlining)<br \/>\nfun __main__:callee {<br \/>\n  bb 0 {<br \/>\n\tv3:Object = LoadArg&lt;0; &#8220;x&#8221;&gt;<br \/>\n\tv5:MortalLongExact[1] = LoadConst&lt;MortalLongExact[1]&gt;<br \/>\n\tv6:Object = BinaryOp&lt;Add&gt; v3 v5<br \/>\n\tReturn v6<br \/>\n  }<br \/>\n}<\/p>\n<p>fun __main__:caller {<br \/>\n  bb 0 {<br \/>\n\tv3:OptObject = LoadGlobalCached&lt;0; &#8220;callee&#8221;&gt;<br \/>\n\tv4:MortalFunc[function:0xdeadbeef] = GuardIs&lt;0xdeadbeef&gt; v3<br \/>\n\tv5:MortalLongExact[3] = LoadConst&lt;MortalLongExact[3]&gt;<br \/>\n\tv6:Object = VectorCall&lt;1&gt; v4 v5<br \/>\n\tReturn v6<br \/>\n  }<br \/>\n}<\/p>\n<p><span>We can\u2019t optimize these generic operations away because we lack type information. And we lack type information because the JIT is method-at-a-time (as opposed to tracing or some kind of global optimization). Type information and specialization happen only within a function. Additionally, functions are compiled prefork, before they are ever run.<\/span><\/p>\n<p><span>But what if we had more information?<\/span><\/p>\n<h2><span><a><\/a>Inlining and its benefits<\/span><\/h2>\n<p><span>If we could inline the body of <\/span><span>callee<\/span><span> into <\/span><span>caller<\/span><span>, we would get more information about the argument to <\/span><span>callee<\/span><span>. It also would remove the function call overhead. For our code here, that is more than enough. On other code, it has other benefits, as well, like removing inline cache pressure (monomorphizing caches in callees) and reducing register stack spills due to the native code calling conventions.<\/span><\/p>\n<p class=\"line-numbers\"><span>If we hypothetically inline <\/span><span>callee<\/span><span> into <\/span><span>caller<\/span><span> manually, it might look something like the following:<\/span><\/p>\n<p># Hypothetical inlined HIR<br \/>\nfun __main__:caller {<br \/>\n  bb 0 {<br \/>\n\tv3:OptObject = LoadGlobalCached&lt;0; &#8220;callee&#8221;&gt;<br \/>\n\tv4:MortalFunc[function:0xdeadbeef] = GuardIs&lt;0xdeadbeef&gt; v3<br \/>\n\tv5:MortalLongExact[3] = LoadConst&lt;MortalLongExact[3]&gt;<br \/>\n\t# Inlined &#8220;callee&#8221;<br \/>\n\tv13:MortalLongExact[1] = LoadConst&lt;MortalLongExact[1]&gt;<br \/>\n\tv16:Object = BinaryOp&lt;Add&gt; v5 v13<br \/>\n\t# End inlined &#8220;callee&#8221;<br \/>\n\tReturn v16<br \/>\n  }<br \/>\n}<\/p>\n<p class=\"line-numbers\"><span>Now we have a lot more information about the types to <\/span><span>BinaryOp<\/span><span>. An optimization pass can now specialize this to an opcode called <\/span><span>LongBinaryOp<\/span><span>, which calls <\/span><span>int.__add__<\/span><span> directly:<\/span><\/p>\n<p># Hypothetical inlined+optimized HIR<br \/>\nfun __main__:caller {<br \/>\n  bb 0 {<br \/>\n\tv3:OptObject = LoadGlobalCached&lt;0; &#8220;callee&#8221;&gt;<br \/>\n\tv4:MortalFunc[function:0xdeadbeef] = GuardIs&lt;0xdeadbeef&gt; v3<br \/>\n\tv5:MortalLongExact[3] = LoadConst&lt;MortalLongExact[3]&gt;<br \/>\n\t# Inlined &#8220;callee&#8221;<br \/>\n\tv13:MortalLongExact[1] = LoadConst&lt;MortalLongExact[1]&gt;<br \/>\n\tv16:LongExact = LongBinaryOp&lt;Add&gt; v5 v13<br \/>\n\t# End inlined &#8220;callee&#8221;<br \/>\n\tReturn v16<br \/>\n  }<br \/>\n}<\/p>\n<p><span>This lets us reason better about the memory effects of the binary operation: We know precisely what built-in function it\u2019s calling. Or \u2014 even better \u2014 we could constant fold it completely:<\/span><\/p>\n<p># Hypothetical inlined+optimized HIR II<br \/>\nfun __main__:caller {<br \/>\n  bb 0 {<br \/>\n\tv3:OptObject = LoadGlobalCached&lt;0; &#8220;callee&#8221;&gt;<br \/>\n\tv4:MortalFunc[function:0xdeadbeef] = GuardIs&lt;0xdeadbeef&gt; v3<br \/>\n\t# Inlined &#8220;callee&#8221;<br \/>\n\tv17:MortalLongExact[4] = LoadConst&lt;MortalLongExact[4]&gt;<br \/>\n\t# End inlined &#8220;callee&#8221;<br \/>\n\tReturn v17<br \/>\n  }<br \/>\n}<\/p>\n<p><span>This is pretty neat. With one compiler pass adding more information, the other passes reduced our function call to a constant. For now, we still need the <\/span><span>LoadGlobalCached<\/span><span> and <\/span><span>GuardIs<\/span><span> in case somebody changes the definition of <\/span><span>callee<\/span><span>, but they do not take much time.<\/span><\/p>\n<p><span>Now that we have seen what inlining can do, let\u2019s take a look at the concrete implementation inside Cinder.<\/span><\/p>\n<h2><span>How the inliner compiler pass works<\/span><\/h2>\n<p class=\"line-numbers\"><span>Let\u2019s go back to the original nonhypothetical optimized HIR for <\/span><span>caller<\/span><span>. The inliner is a compiler pass that will receive HIR, which looks roughly like this:<\/span><\/p>\n<p># Original HIR, pre-inlining<br \/>\nfun __main__:caller {<br \/>\n  bb 0 {<br \/>\n    v3:OptObject = LoadGlobalCached&lt;0; &#8220;callee&#8221;&gt;<br \/>\n    v4:MortalFunc[function:0xdeadbeef] = GuardIs&lt;0xdeadbeef&gt; v3<br \/>\n    v5:MortalLongExact[3] = LoadConst&lt;MortalLongExact[3]&gt;<br \/>\n    v6:Object = VectorCall&lt;1&gt; v4 v5<br \/>\n    Return v6<br \/>\n  }<br \/>\n}<\/p>\n<p><span>It iterates over all the <\/span><span>VectorCall<\/span><span>s and collects the calls for which the target is known. In this case, <\/span><span>v4<\/span><span> is known to be a particular <\/span><span>function<\/span><span>. We collect all the call sites ahead of time so we are not modifying the CFG as we iterate.<\/span><\/p>\n<p><span>Then, for each call, if the callee can be inlined, we inline the callee into the caller. A function might not be inlinable if, for example, the arguments don\u2019t match the parameters. This means a couple of separate steps:<\/span><\/p>\n<p>1.<span> Construct HIR of the target inside the caller\u2019s CFG, but keep the graphs separate. The caller is already in SSA form, and we need to maintain that invariant, so we SSA-ify the callee\u2019s graph separately. We don\u2019t support running SSA on a program twice, otherwise we would probably run SSA on the whole CFG post-inline. We also rewrite all the <\/span><span>Return<\/span><span> instructions into one big return. This ensures that we only have one entry point into the callee and one exit from the callee.<\/span><\/p>\n<p>fun __main__:caller {<br \/>\n  bb 0 {<br \/>\n    v3:OptObject = LoadGlobalCached&lt;0; &#8220;callee&#8221;&gt;<br \/>\n    v4:MortalFunc[function:0xdeadbeef] = GuardIs&lt;0xdeadbeef&gt; v3<br \/>\n    v5:MortalLongExact[3] = LoadConst&lt;MortalLongExact[3]&gt;<br \/>\n    v6:Object = VectorCall&lt;1&gt; v4 v5<br \/>\n    Return v6<br \/>\n  }<\/p>\n<p>  # Non-linked callee<br \/>\n  bb 1 {<br \/>\n    v7 = LoadArg&lt;0; &#8220;x&#8221;&gt;<br \/>\n    v8 = CheckVar&lt;&#8220;x&#8221;&gt; v7<br \/>\n    v9 = LoadConst&lt;MortalLongExact[1]&gt;<br \/>\n    v10 = BinaryOp&lt;Add&gt; v8 v9<br \/>\n    Return v10<br \/>\n  }<br \/>\n}<\/p>\n<p><strong>2.<\/strong> Split the basic block containing the call instruction after the call instruction. For example, in our above example, split <span>bb 0<\/span><span> into:<\/span><\/p>\n<p>fun __main__:caller {<br \/>\n  bb 0 {<br \/>\n\tv3:OptObject = LoadGlobalCached&lt;0; &#8220;callee&#8221;&gt;<br \/>\n\tv4:MortalFunc[function:0xdeadbeef] = GuardIs&lt;0xdeadbeef&gt; v3<br \/>\n\tv5:MortalLongExact[3] = LoadConst&lt;MortalLongExact[3]&gt;<br \/>\n\tv6:Object = VectorCall&lt;1&gt; v4 v5<br \/>\n  }<\/p>\n<p>  # Non-linked callee<br \/>\n  bb 1 {<br \/>\n\tv7 = LoadArg&lt;0; &#8220;x&#8221;&gt;<br \/>\n\tv8 = CheckVar&lt;&#8220;x&#8221;&gt; v7<br \/>\n\tv9 = LoadConst&lt;MortalLongExact[1]&gt;<br \/>\n\tv10 = BinaryOp&lt;Add&gt; v8 v9<br \/>\n\tReturn v10<br \/>\n  }<\/p>\n<p>  bb 2 {<br \/>\n\tReturn v6<br \/>\n  }<br \/>\n}<\/p>\n<p><strong>3. <\/strong>Add bookkeeping instructions and a branch instruction to the callee constructed in step one. Remember shadow frames? We use <span>BeginInlinedFunction<\/span><span> and <\/span><span>EndInlinedFunction<\/span><span> to push and pop shadow frames for inlined functions. We also remove the call.<\/span><\/p>\n<p>fun __main__:caller {<br \/>\n  bb 0 {<br \/>\n\tv3:OptObject = LoadGlobalCached&lt;0; &#8220;callee&#8221;&gt;<br \/>\n\tv4:MortalFunc[function:0xdeadbeef] = GuardIs&lt;0xdeadbeef&gt; v3<br \/>\n\tv5:MortalLongExact[3] = LoadConst&lt;MortalLongExact[3]&gt;<br \/>\n\tBeginInlinedFunction<br \/>\n\tBranch&lt;1&gt;<br \/>\n  }<\/p>\n<p>  # Linked callee<br \/>\n  bb 1 (preds 0) {<br \/>\n\tv7 = LoadArg&lt;0; &#8220;x&#8221;&gt;<br \/>\n\tv8 = CheckVar&lt;&#8220;x&#8221;&gt; v7<br \/>\n\tv9 = LoadConst&lt;MortalLongExact[1]&gt;<br \/>\n\tv10 = BinaryOp&lt;Add&gt; v8 v9<br \/>\n\tReturn v10<br \/>\n  }<\/p>\n<p>  bb 2 {<br \/>\n\tEndInlinedFunction<br \/>\n\tReturn v6<br \/>\n  }<br \/>\n}<\/p>\n<p><strong>4.<\/strong> <span>Since <\/span><span>LoadArg<\/span><span> does not make sense for the callee \u2014 there is no function call anymore, so no more args \u2014 rewrite it to an <\/span><span>Assign<\/span><span>. Since we checked the arguments against the parameters earlier, we assign directly from the register that held the argument.<\/span><\/p>\n<p>fun __main__:caller {<br \/>\n  bb 0 {<br \/>\n\tv3:OptObject = LoadGlobalCached&lt;0; &#8220;callee&#8221;&gt;<br \/>\n\tv4:MortalFunc[function:0xdeadbeef] = GuardIs&lt;0xdeadbeef&gt; v3<br \/>\n\tv5:MortalLongExact[3] = LoadConst&lt;MortalLongExact[3]&gt;<br \/>\n\tBeginInlinedFunction<br \/>\n\tBranch&lt;1&gt;<br \/>\n  }<\/p>\n<p>  # Linked callee with rewritten LoadArg<br \/>\n  bb 1 (preds 0) {<br \/>\n\tv7 = Assign v5<br \/>\n\tv8 = CheckVar&lt;&#8220;x&#8221;&gt; v7<br \/>\n\tv9 = LoadConst&lt;MortalLongExact[1]&gt;<br \/>\n\tv10 = BinaryOp&lt;Add&gt; v8 v9<br \/>\n\tReturn v10<br \/>\n  }<\/p>\n<p>  bb 2 {<br \/>\n\tEndInlinedFunction<br \/>\n\tReturn v6<br \/>\n  }<br \/>\n}<\/p>\n<p><strong>5.<\/strong> <span>Now we rewrite the inlined <\/span><span>Return<\/span><span> to an <\/span><span>Assign<\/span><span> to the output of the original <\/span><span>VectorCall<\/span><span> instruction. Since we only have one <\/span><span>Return<\/span><span> point and do not need to join many of them (they have already been joined in the callee), we can reuse the output of the call instruction.<\/span><\/p>\n<p>fun __main__:caller {<br \/>\n  bb 0 {<br \/>\n\tv3:OptObject = LoadGlobalCached&lt;0; &#8220;callee&#8221;&gt;<br \/>\n\tv4:MortalFunc[function:0xdeadbeef] = GuardIs&lt;0xdeadbeef&gt; v3<br \/>\n\tv5:MortalLongExact[3] = LoadConst&lt;MortalLongExact[3]&gt;<br \/>\n\tBeginInlinedFunction<br \/>\n\tBranch&lt;1&gt;<br \/>\n  }<\/p>\n<p>  # Linked callee with rewritten Return<br \/>\n  bb 1 (preds 0) {<br \/>\n\tv7 = Assign v5<br \/>\n\tv8 = CheckVar&lt;&#8220;x&#8221;&gt; v7<br \/>\n\tv9 = LoadConst&lt;MortalLongExact[1]&gt;<br \/>\n\tv10 = BinaryOp&lt;Add&gt; v8 v9<br \/>\n\tv6 = Assign v10<br \/>\n\tBranch&lt;2&gt;<br \/>\n  }<\/p>\n<p>  bb 2 (preds 1) {<br \/>\n\tEndInlinedFunction<br \/>\n\tReturn v6<br \/>\n  }<br \/>\n}<\/p>\n<p><strong>6.<\/strong> <span>You will notice that despite being straight-line code, we have some unnecessary branches. We run a <\/span><span>CleanCFG<\/span><span> pass to take care of this for us.<\/span><\/p>\n<p>fun __main__:caller {<br \/>\n  bb 0 {<br \/>\n\tv3:OptObject = LoadGlobalCached&lt;0; &#8220;callee&#8221;&gt;<br \/>\n\tv4:MortalFunc[function:0xdeadbeef] = GuardIs&lt;0xdeadbeef&gt; v3<br \/>\n\tv5:MortalLongExact[3] = LoadConst&lt;MortalLongExact[3]&gt;<br \/>\n\tBeginInlinedFunction<br \/>\n\tv7 = Assign v5<br \/>\n\tv8 = CheckVar&lt;&#8220;x&#8221;&gt; v7<br \/>\n\tv9 = LoadConst&lt;MortalLongExact[1]&gt;<br \/>\n\tv10 = BinaryOp&lt;Add&gt; v8 v9<br \/>\n\tv6 = Assign v10<br \/>\n\tEndInlinedFunction<br \/>\n\tReturn v6<br \/>\n  }<br \/>\n} <\/p>\n<p><strong>7.\u00a0<\/strong><span>We have now added new untyped code to our typed CFG. To run other optimization passes, we need to do another round of type inference and reflow the types.<\/span><\/p>\n<p>fun __main__:caller {<br \/>\n  bb 0 {<br \/>\n\tv3:OptObject = LoadGlobalCached&lt;0; &#8220;callee&#8221;&gt;<br \/>\n\tv4:MortalFunc[function:0xdeadbeef] = GuardIs&lt;0xdeadbeef&gt; v3<br \/>\n\tv5:MortalLongExact[3] = LoadConst&lt;MortalLongExact[3]&gt;<br \/>\n\tBeginInlinedFunction<br \/>\n\tv7:MortalLongExact[3] = Assign v5<br \/>\n\tv8:MortalLongExact[3] = CheckVar&lt;&#8220;x&#8221;&gt; v7<br \/>\n\tv9:MortalLongExact[1] = LoadConst&lt;MortalLongExact[1]&gt;<br \/>\n\tv10:Object = BinaryOp&lt;Add&gt; v8 v9<br \/>\n\tv6:Object = Assign v10<br \/>\n\tEndInlinedFunction<br \/>\n\tReturn v6<br \/>\n  }<br \/>\n}<\/p>\n<p><strong>8.<\/strong> <span>Now that we once more have fully typed code, we can run more optimization passes. <\/span><span>CopyPropagation<\/span><span> will take care of the useless <\/span><span>Assign<\/span><span>s, <\/span><span>Simplify<\/span><span> will take care of the unnecessary <\/span><span>CheckVar<\/span><span>, and then we are done!<\/span><\/p>\n<p>fun __main__:caller {<br \/>\n  bb 0 {<br \/>\n\tv3:OptObject = LoadGlobalCached&lt;0; &#8220;callee&#8221;&gt;<br \/>\n\tv4:MortalFunc[function:0xdeadbeef] = GuardIs&lt;0xdeadbeef&gt; v3<br \/>\n\tv5:MortalLongExact[3] = LoadConst&lt;MortalLongExact[3]&gt;<br \/>\n\tBeginInlinedFunction<br \/>\n\tv9:MortalLongExact[1] = LoadConst&lt;MortalLongExact[1]&gt;<br \/>\n\tv10:LongExact = LongBinaryOp&lt;Add&gt; v5 v9<br \/>\n\tEndInlinedFunction<br \/>\n\tReturn v10<br \/>\n  }<br \/>\n}<\/p>\n<p><span>Here we have compiled <\/span><span>callee<\/span><span> in the context of <\/span><span>caller<\/span><span>, but <\/span><span>callee<\/span><span> might not always be inlined into its other callers. It can still be compiled as a normal standalone function.<\/span><\/p>\n<h2><span>What makes inlining tricky<\/span><\/h2>\n<p><span>Inlining is not just all fun and graph transformations. There are Python APIs and profiling tools that rely on being able to have an accurate view of the Python stack, as if the inlining never happened.<\/span><\/p>\n<p>Sampling profilers:<span> We have a sampling stack profiler that cannot run any code and needs to be able to walk a chain of pointers and discover what functions are currently running. We do this with shadow frames.<\/span><\/p>\n<p>Deoptimization metadata:<span> When a function raises an exception or otherwise transfers control to the interpreter (deoptimization), it needs to materialize a <\/span><span>PyFrameObject<\/span><span> with all the variables that should have existed at the time (but might have been erased by the JIT), line numbers, etc. And this needs information about what any given point in the machine code refers back to in the Python code.<\/span><\/p>\n<p>Coroutines:<span> Inlining normal functions into coroutines is fine because functions execute by starting at the top and continuing until they are finished. We can replace a <\/span><span>Call<\/span><span> with its call target. But coroutines have to yield control every so often and also materialize a <\/span><span>coroutine<\/span><span> or <\/span><span>generator<\/span><span> object when called. This is a little bit trickier, and we will tackle this in the future. Inlining functions into coroutines is not exactly tricky, but it is more work because coroutines have a slightly different frame layout and we have not yet modified the frame to support multiple shadow frames.<\/span><\/p>\n<p>Frame materialization outside the interpreter:<span> It turns out Python allows both Python programmers and C extension developers to get a Python frame whenever they want. For Python programmers, CPython exposes (as an implementation detail, not a guarantee, to be fair) <\/span><span>sys._getframe<\/span><span>. For C extension programmers and standard library developers, <\/span><span>PyEval_GetFrame<\/span><span> is available. This means that even though there might be no deoptimization events in the middle of an inlined function, some managed or native code might decide to request that a frame be created anyway. And the JIT-ed code, which otherwise would have expected the shadow frames to still be around, would also have to handle the case where they have been replaced by real Python frames.<\/span><\/p>\n<p>When to inline:<span> Inlining every callee into its caller is not necessarily the best move. Some callees never get called, and inlining them would increase the code size. Even if they do get called, it may still make more sense to leave the function call than to bloat the caller\u2019s code. Runtimes often have heuristics and rules about when to inline functions, and people spend years tuning them for different workloads to optimize performance.<\/span><\/p>\n<p>But what if someone changes the target\u2019s __code__?<span> We have a hook to detect this from Python code. In this case, we invalidate the JIT-ed code for that function. For inlined functions, we can either check that it hasn\u2019t changed before every execution (slow) or react to changes by patching our generated code. Neither of these exists yet, but they are not so hard to implement. For native (C) extensions, we have to trust that the extensions are well behaved and will notify us after messing with the <\/span><span>PyFunctionObject<\/span><span> fields.<\/span><\/p>\n<h3>Surprisingly not-tricky things<\/h3>\n<p>What if the callee changes?<span> Python is a very dynamic language. Variables can change type over time, programmers can write to global variables in other modules, etc. But the inliner does not have to deal with this at all. The inliner starts with the knowledge that the callee is known and works from there. If the value changes, the guard instruction will maintain the invariant for the native code and deoptimize into the interpreter otherwise.<\/span><\/p>\n<h2>Looking forward<\/h2>\n<p><span>It\u2019s time to collect data about the performance characteristics of our workload and figure out whether we can develop good heuristics about what functions to inline. There are papers to read and evaluate.\u00a0<\/span><\/p>\n<p><span>Take a look at our <\/span><a href=\"https:\/\/github.com\/facebookincubator\/cinder\"><span>GitHub repo<\/span><\/a><span>, and play around with Cinder. We have included a Dockerfile and <\/span><a href=\"https:\/\/github.com\/facebookincubator\/cinder\/pkgs\/container\/cinder\"><span>prebuilt Docker image<\/span><\/a><span> to make this easier.<\/span><\/p>\n<p>The post <a href=\"https:\/\/engineering.fb.com\/2022\/05\/02\/open-source\/cinder-jits-instagram\/\">How the Cinder JIT\u2019s function inliner helps us optimize Instagram<\/a> appeared first on <a href=\"https:\/\/engineering.fb.com\/\">Engineering at Meta<\/a>.<\/p>\n<p>Engineering at Meta<\/p>","protected":false},"excerpt":{"rendered":"<p>Since Instagram runs one of the world\u2019s largest deployments of the Django web framework, we have natural interest in finding ways to optimize Python so we can speed up our production application. As part of this effort, we\u2019ve recently open-sourced Cinder, our Python runtime that is a fork of CPython. Cinder includes optimizations like immortal&hellip; <a class=\"more-link\" href=\"https:\/\/fde.cat\/index.php\/2022\/05\/02\/how-the-cinder-jits-function-inliner-helps-us-optimize-instagram\/\">Continue reading <span class=\"screen-reader-text\">How the Cinder JIT\u2019s function inliner helps us optimize Instagram<\/span><\/a><\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":""},"categories":[7],"tags":[],"class_list":["post-570","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":768,"url":"https:\/\/fde.cat\/index.php\/2023\/10\/05\/meta-contributes-new-features-to-python-3-12\/","url_meta":{"origin":570,"position":0},"title":"Meta contributes new features to Python 3.12","date":"October 5, 2023","format":false,"excerpt":"Python 3.12 is out! It includes new features and performance improvements \u2013 some contributed by Meta \u2013 that we believe will benefit all Python users. We\u2019re sharing details about these new features that we worked closely with the Python community to develop. This week\u2019s release of Python 3.12 marks a\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":824,"url":"https:\/\/fde.cat\/index.php\/2024\/02\/12\/meta-loves-python\/","url_meta":{"origin":570,"position":1},"title":"Meta loves Python","date":"February 12, 2024","format":false,"excerpt":"By now you\u2019re already aware that Python 3.12 has been released. But did you know that several of its new features were developed by Meta? Meta engineer Pascal Hartig (@passy) is joined on the Meta Tech Podcast by Itamar Oren and Carl Meyer, two software engineers at Meta, to discuss\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":814,"url":"https:\/\/fde.cat\/index.php\/2024\/01\/18\/lazy-is-the-new-fast-how-lazy-imports-and-cinder-accelerate-machine-learning-at-meta\/","url_meta":{"origin":570,"position":2},"title":"Lazy is the new fast: How Lazy Imports and Cinder accelerate machine learning at Meta","date":"January 18, 2024","format":false,"excerpt":"At Meta, the quest for faster model training has yielded an exciting milestone: the adoption of Lazy Imports and the Python Cinder runtime. The outcome? Up to 40 percent time to first batch (TTFB) improvements, along with a 20 percent reduction in Jupyter kernel startup times. This advancement facilitates swifter\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":897,"url":"https:\/\/fde.cat\/index.php\/2024\/07\/16\/ai-lab-the-secrets-to-keeping-machine-learning-engineers-moving-fast\/","url_meta":{"origin":570,"position":3},"title":"AI Lab: The secrets to keeping machine learning engineers moving fast","date":"July 16, 2024","format":false,"excerpt":"The key to developer velocity across AI lies in minimizing time to first batch (TTFB) for machine learning (ML) engineers. AI Lab is a pre-production framework used internally at Meta. It allows us to continuously A\/B test common ML workflows \u2013 enabling proactive improvements and automatically preventing regressions on TTFB.\u00a0\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":795,"url":"https:\/\/fde.cat\/index.php\/2023\/11\/21\/writing-and-linting-python-at-scale\/","url_meta":{"origin":570,"position":4},"title":"Writing and linting Python at scale","date":"November 21, 2023","format":false,"excerpt":"Python plays a big part at Meta. It powers Instagram\u2019s backend and plays an important role in our configuration systems, as well as much of our AI work. Meta even made contributions to Python 3.12, the latest version of Python. On this episode of the\u00a0Meta Tech Podcast, Meta engineer Pascal\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":748,"url":"https:\/\/fde.cat\/index.php\/2023\/08\/15\/introducing-immortal-objects-for-python\/","url_meta":{"origin":570,"position":5},"title":"Introducing Immortal Objects for Python","date":"August 15, 2023","format":false,"excerpt":"Instagram has introduced Immortal Objects \u2013 PEP-683 \u2013 to Python. Now, objects can bypass reference count checks and live throughout the entire execution of the runtime, unlocking exciting avenues for true parallelism. At Meta, we use Python (Django) for our frontend server within Instagram. To handle parallelism, we rely on\u2026","rel":"","context":"In &quot;Technology&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/570","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/comments?post=570"}],"version-history":[{"count":0,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/posts\/570\/revisions"}],"wp:attachment":[{"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/media?parent=570"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/categories?post=570"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fde.cat\/index.php\/wp-json\/wp\/v2\/tags?post=570"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}