June 8, 2014

Fun with Python bytecode

The Python programming language is now present everywhere. While significantly slower than other more low-level languages like C/C++, its ease of use, its power and its various libraries make it an ideal choice for many projects. Yet for some cases, performance does matter. For these, many solutions have been proposed. One of the most common is the use of an underlying, heavily-optimized, C library along with Python bindings (numpy, scipy, opencv, etc.). Another is the addition of performance related features to the interpreter, like JIT or stackless mode with Pypy. However, what to do when you can't apply neither of them, yet you still want more performance without sacrificing the Python zen and simplicity?

One of the answers, at least in the present case, is to roll up our sleeves, fasten our seat belts, and dive into the mysteries of the Python bytecode.

Wait, what?

You have probably already heard that Python, unlike Fortran, C++ or Go, was an interpreted language. While not completely wrong, it is not entirely true either. You could create a Python interpreter that works in a purely interpreted way, but it is not how it is done in most current Python implementations. The Python code is first compiled into a series of simpler instructions. Thereafter, these instructions are actually interpreted by a virtual machine. The main reason for that design choice is speed: directly interpreting high-level Python code each time it is executed (think for instance of a loop or a function executed many times) would be ineffective. Moreover, some different Python syntax turn out to give the same result. For instance, using a if/elif statement produces exactly the same result than a if inside the else clause of another condition. A list comprehension has the same behavior than a traditional loop – except for some underlying details in the current implementations, but it could be exactly the same thing. The use of a simplified, intermediate language allows the actual interpreter to be simpler, easier to understand and maintain, and faster, while being entirely hidden to the user in most cases. With CPython, the most visible effect is the creation of *.pyc files, used to store the bytecode in order to avoid a recompilation step each time the file is executed.

It is to be noted that this design is not Python-specific, since many other prominent languages use it too (Java, Ruby, etc.). All the designs are more or less identical, the steps involved being a) the parsing of the high-level language into an intermediate, simpler representation, and b) the interpretation of this intermediate representation to actually run the program. One could note that Python makes use of a fancier design, by translating the Python code into Abstract Syntax Tree or AST before create the bytecode, but it is out of the scope of this post. However, those interested by the CPython internals could refer to the Cpython compiler developers information.

In Cpython 3.4 (the reference interpreter), the bytecode is based on a stack representation, and 101 different opcodes. Python 2.7 uses 18 more, mainly for slicing operations, which where separated from the item-based operations, and print and exec statements, which became functions in Python 3. For the rest of this post, we will use CPython 3.4. While the explanations provided here also mostly apply to other versions and interpreters, they may not be totally accurate.

Let's take a look at an actual bytecode snippet. Suppose you have the following simple function :
Loading... Please enable Javascript
We could observe the bytecode generated by Python by using the __code__ attribute :
In [9]: myMathOperator.__code__.co_code
Out[9]: b'|\x00\x00|\x01\x00|\x02\x00\x17d\x01\x00\x13\x14S'

Ok... While expected (since the bytecode is basically a sequence of bytes), the result is not exactly readable. Fortunately, there is a helper module in the Python library which can render a bit more understandable code. This module is named dis, for disassembly. Let's see what it can do :
In [12]: import dis
In [13]: print(dis.Bytecode(myMathOperator.__code__).dis())
2 0 LOAD_FAST 0 (a)
3 LOAD_FAST 1 (b)
6 LOAD_FAST 2 (c)
9 BINARY_ADD
10 LOAD_CONST 1 (2)
13 BINARY_POWER
14 BINARY_MULTIPLY
15 RETURN_VALUE

Well, that's definitely better! The dis module is able to translate the bytecode into its opcodes and their arguments, which are far more readable. However, to find what does a bytecode do, we must place ourselves in a stack context. Most opcodes modify the stack by either adding or removing elements (or both), or by changing the value of the top of stack pointer. For instance, LOAD_FAST is described as:
Pushes a reference to the local co_varnames[var_num] onto the stack.

For now, we can forget the reference to co_varnames (we will talk about it later on), and just retain that this opcode fetches a variable and put it onto the stack. In our example, the stack is initially empty since it is the beginning of the function (we will represent it as [ ]). Supposing that the values of a, b, and c arguments are respectively 12, 7 and 1, then the stack will contain [12, 7, 1] after the first three statements execution.

On the next line, we reach a new opcode, BINARY_ADD. In the documentation, it is said that it:
Implements TOS = TOS1 + TOS

Ok, that's a bit less clear. Here, TOS means “Top Of the Stack”. So, basically, it takes the value at the top of the stack and the second value at the top (TOS1), add them, and push back the result of the top of the stack. Applying this operation to our [12, 7, 1] stack, we obtain [12, 8].

Moving on to the next opcode, we find a LOAD_CONST. Basically, it does the same job as LOAD_FAST, except that it loads constants and not variables (in our case, the constant loaded is the 2 used as exponent). So our stack now contains [12, 8, 2].

The next opcode, BINARY_POWER does, according to the documentation:
TOS = TOS1 ** TOS

As for the addition, we take the two top-most items, exponent the second by the first, and push back the result on the stack, which now contains [12, 64].

The next opcode is BINARY_MULTIPLY, which works similarly than BINARY_POWER and BINARY_ADD, and our stack now contains the product of 12 and 64, that is [768]. Finally, the RETURN_VALUE operation is said to:
Returns with TOS to the caller of the function.

That is, it pops the value on the top of the stack, and return it to the calling function. In our case, the answer (768) is effectively returned, and we're done for this example.
There are of course many other opcodes, but their behavior is essentially the same, popping and pushing values and references on a global stack. As one can see, interpreting (and even writing) Python bytecode is fairly simple when we get the twist.

A last twist before the next step: some opcodes need arguments. This is the case for LOAD_FAST, which needs the index of the variable to grab. This is achieved very simply, by using up to three bytes for each operation. The first byte is the opcode itself, and the two remaining can be used for parameter passing. If there is no parameter to pass (like BINARY_ADD), then only one byte is used. The attentive reader would have noticed the curious indexes in the bytecode snippet above: they precisely mark the byte index of each opcode.

Great, so we are now ready! Let's write bytecode instead of Python code!

Wait, why?

At this point, a reasonable mind might say: why in the world would you bother to write bytecode, a limited and difficult language, instead of actual Python code? Well, because the compilation procedure is too slow and induces a significant overhead! To this answer, the same reasonable mind would probably say: who cares about the compilation time? It is not so slow (a few tenths of seconds in most cases), and, even if it was not, this must be done only once – since the next times, you can just reload the *.pyc and go happily!

In most cases, the clever guy would be right. But there's a specific use which has to be taken into account. Its name is GP, standing for Genetic Programming.

Genetic programming is an hyper-heuristics member of the bigger family of the evolutionary algorithms. Basically, it uses natural selection, mutations and other evolutionary concepts to evolve trees (in the computer sense). As any program can be represented as a tree, genetic programming is actually able to evolve programs, that is learn, alone, the best algorithm for a given problem, like getting out of a labyrinth, solving the Rubik Cube, etc. We do not intend to dive into deeper explanations on GP, but the interested reader could find introductory and advanced material about it at www.gp-field-guide.org.uk.

As a DEAP contributor and user, I frequently have to work with GP and its DEAP implementation (DEAP is a generic evolutionary algorithms framework in Python, which implements GP among others). Here, the compilation time becomes a problem. GP uses a population of several hundreds, and even thousands of different programs. This is mandatory to keep a sufficient level of diversity. But in order to see the results produced by each of these programs, we must execute them, and, as they change during the evolution, we have to compile them each time. Consider the next figure, produced with gprof2dot and the Python profile module, which represents the execution of a typical evolution with DEAP. One can see that the compilation time represented by the gp.compile function, in light green, takes almost 50% of the total computation time. In other words, we spend half our computational effort just to compile the programs!

A few solution have been proposed to address the problem. The compilation of the individuals with LLVM, as shown here, is one of them. Since it compiles to bare x86 bytecode, it indeed reduces the execution time, but at the price of an increase in compilation time, the penalty being more severe if we also add compiler optimizations. Also, the use of an optimized Python interpreter like Pypy is not of great help with GP – it actually decreases performance by a factor 2. Shortly, this can be explained by the JIT design itself: just-in-time compilers are efficient with long loop and repetitive code segments, since the whole idea is to make up for the compilation time by an important speedup in upcoming executions of the same code. With GP, the code is constantly changing, so the JIT either does nothing (because it detects that it would not be of any use) or deteriorate the processing time by trying to compile every individual at each evaluation. Finally, the use of an heavily optimized C implementation of the basic functions used in the GP program is not helpful, since Python still have to compile the high level code before calling this C implementation!

That's where our bytecode hobby comes on stage. What if we directly evolve Python bytecode? Well (spoiler alert), the results obtained with a preliminary implementation are given in the next figure. It should be noted that we use the exact same problem, configuration, and random seed for our standard and optimized implementations. The results are impressive: the compile time is now negligible (less than 1% of the total execution time), and the computation time is divided by more than two (47 seconds vs. 107 for the standard implementation), without altering the results in any way! It should be noted that this is the maximum speedup achievable, since the standard compilation procedure took about half of the program execution time. The small additional edge is a side effect caused by more effective implementations of other compile related methods.

Ok, great, so that was easy! Another problem solved!

Wait, how?

In order to understand the intricacies of the solution, we have to learn a bit about how GP is implemented in DEAP. As we said earlier, GP works by evolving trees. Since there's no tree implementation in the Python standard library, DEAP includes one specifically targeted on GP. It works by storing the trees into an underlying list, along with information about how many children each node have. For instance, the mathematical function f(x) = (x-1) + x*x can be represented by the following tree:

and will be stored in the following list:
['add', 'subtract', 'x', 1, 'multiply', 'x', 'x']

Because DEAP knows the number of arguments of each function (or, in genetic programming gibberish, their arity), it can reconstitute the tree. Thereafter, when comes the evaluation step, the following procedure is used:
  1. The tree is converted into its string representation. Basically, it is the program that one would write if he wanted to produce the result of the tree. For instance, the string representation of the previous tree would be add( subtract(x, 1), multiply(x, x) ).
  2. This string is passed to eval(). This function is a powerful (and dangerous!) tool which allows to execute arbitrary code from its string representation. For instance, the following code would write “Hello world!” to the standard output:
    eval('print(“Hello world!”)')
    It is in this step that the Python code is actually compiled into its executable form.
  3. The compiled function is returned, so it can be evaluated.
The reader would have understood that the last step is mandatory and cannot be suppressed, which is why we focus on the first two steps with our bytecode hack.

First, one useful simplification: the resulting program is merely only a sequence of calls to different functions. We never actually have to understand what is going inside these functions. This leads to the fact that our implementation does not restrain at all the function choice. If it works in Python, it will work with our approach! Also, this reduce our solution complexity, since we only have to call a few opcodes. Basically, we will need the following:
  • LOAD_FAST : this will be needed to access to our program arguments.
  • LOAD_CONST : similarly, this opcode is required to use constant values (numerical or not).
  • LOAD_GLOBAL : this one will be used to retrieve references to the function objects of our tree. As its name suggests, instead of grabbing a symbol from the local dictionary like LOAD_FAST, it uses the global dictionary.
  • CALL_FUNCTION : obviously, this is mandatory to actually call the functions previously loaded. It takes as parameter the number of arguments to pass to the function. All these arguments must be on the stack at the moment of the call, plus, under them, a reference to the function object itself. This particular structure will prove to be very useful hereinafter.
  • RETURN_VALUE : this one will actually be needed only one time, to return the final result value.
Before we can go on with some code, it is the time to learn about how the functions, constants and arguments are actually described in a code object. The LOAD_* functions have an almost common description:
Loads the {global/local/constant} named {co_names/co_varnames/co_consts}[arg] onto the stack

Fair enough, but what are these co_names, co_varnames or co_consts fields? Well, they are simply tuple objects containing all the needed symbols. For instance, if co_names = (add, divide, subtract) and that we want the subtract function, we will write LOAD_GLOBAL (2), that is put the index of the wanted function in the co_names tuple as operation argument. As explained in the first section, the argument is simply the value of the two bytes following the opcode in the bytecode.

For the sake of simplicity, we will consider in the following part that we already have these tuples. Let's now see how to convert a list representation of our tree to its bytecode representation. The important point is to realize the similarities between the Python bytecode and our list representation. The following figure shows it quite straightly.

As one can see, the conversion is merely an iteration through our list with the appending of a corresponding LOAD_* for each node, the relative order between the nodes staying the same (for the record, this order is called depth-first). The only tricky part is to add the CALL_FUNCTION opcode after the last argument. For this purpose, we keep a list of the number of arguments of each added node. Whenever we bump into a terminal node (a leaf), we decrement the argument count of its parent node. If this count gets to 0, then we know that we are done with this function, and we add the appropriate CALL_FUNCTION. Using this algorithm, we provide a simplified conversion function.
Loading... Please enable Javascript
This function takes a list as argument, and return its corresponding bytecode. Note that a bytecode must be of type bytes, but this type is not suitable for our manipulations, because it is non-mutable. Therefore, we use a bytearray, a mutable equivalent of bytes object. The opcode module contains various tools to assist the bytecode creation. In this case, we use the opmap dictionary, which allows us to write the opcode in plain text ('CALL_FUNCTION' being far more readable than 0x83).
We now have to make Python understand that he has to execute this bytecode. First, we must create a complete code object with it. While the bytecode obviously plays an important role in a code object, there are plenty of other information we have to give to Python:
Loading... Please enable Javascript

That's a bunch of information! Most of these parameters are self-explanatory, but a few deserve more explanations. The stacksize parameter controls how many things can be put onto the stack at the same time. For instance, a recursive call will add elements on the stack each time. While its exact value is not really important, one must take care to not exceed it, under threat of a segfault of the Python interpreter! The flags argument control various things about the way the code is handle, yet there is not much documentation about it. The value of 67 (64+2+1) comes from the actual value given by Python to the executable objects in the standard GP implementation.

Now that we have the code object, we must associate it with a function. There are various ways to do it, but one of the most obvious in Python 3 is to use the types module again, but to create a function object this time:
Loading... Please enable Javascript
The second argument (a dictionary) is mandatory to tell Python which function actually corresponds to each symbol. Fortunately, its generation was already implemented in DEAP, since the same mechanism applies when using the eval function.

Another interesting thing about this approach is that it allows for easy genetic manipulation, like crossovers or mutations. In a few words, these operations change the content of the tree by modifying or exchanging branches. In the standard list implementation, a branch replacement could be done with a simple slicing operation, since each branch is stored in a contiguous way. Well, that rationale also applies to our bytecode representation! In order to identify a subtree, we just need to obtain the position of the LOAD_GLOBAL of its root node, and look for the corresponding CALL_FUNCTION thereafter. These two positions then give us the indices needed for the slice construction.

The following table and figure show the time required to evaluate trees of different lengths and the relative speedup. The bytecode implementation reaches its optimal speedup around 250 nodes, but provides a considerable gain even for smaller lengths. The speedup value can be easily understand if we refer to the evaluation procedure we describe at the beginning of this section: the first two steps that were executed in O(n) are now done in O(1). Of course, the result computation itself has not changed, and still has, in our test case, a O(n) complexity. Overall, in exact notation, we started from a Θ(3n) complexity, down to Θ(n+2) with the bytecode approach, which is coherent with the observed speedups.

# of nodes Standard Bytecode Speedup
1 4,2674E-05 2,3882E-05 1,787
3 5,9971E-05 3,0232E-05 1,984
7 8,1913E-05 3,6364E-05 2,253
15 1,5890E-04 7,4422E-05 2,135
31 2,5088E-04 9,6281E-05 2,606
63 4,5313E-04 1,5996E-04 2,833
127 8,6494E-04 2,9974E-04 2,886
255 1,6832E-03 5,0897E-04 3,307
511 3,3082E-03 1,0151E-03 3,259
1023 6,5782E-03 1,9996E-03 3,290
2047 1,3001E-02 3,8225E-03 3,401
4095 2,6315E-02 7,8760E-03 3,341
8191 5,4996E-02 1,5786E-02 3,484
16383 1,0808E-01 3,1520E-02 3,429
32767 2,1563E-01 6,2349E-02 3,459
65535 4,3286E-01 1,2690E-01 3,411

Conclusion

We have provided a simple way to speed up a genetic programming evolution by directly evolving Python bytecode, without intermediate representations. This generally divides by two the computation time needed to perform an evolution, which is non negligible with real world problems. This is especially important when taking into account that as most stochastic methods, evolutionary algorithms needs to be run at least a couple of times to produce statistical significant results. The complete code is available at github.com/mgard/deap. However, it should be noted that this code includes many hacks to make it fully compliant with DEAP API (so a lambda user does not have to worry about which tree implementation he is using), and is still in development to further improve performance and reliability – one funny characteristic of our approach is that it is going so deep into Python internals that it could actually segfault the interpreter whenever an error occurs...

If the clever mind from section 3 did not run away already (which would probably be the reasonable thing to do instead of reading an article describing a weird and unclean optimization technique targeting a very narrow topic of an already narrow field on a specific language), he might notice one interesting thing in the second profiling figure. The evaluation itself now takes about two thirds of the total computation time. Moreover, even basic arithmetic functions like multiply or divide are called so often (almost 100 million times!) that they take up to 20% of the total execution time! That's clearly where we should focus in order to further improve performance. But how could we do it?

Well, in this post, we had fun with Python bytecode. Maybe it's time to take the next step, and use a different type of bytecode, even more low level. But that will be the subject of another story...